[Solved-1 Solution] How to refer to a field after a join and a group by in Pig ?



What is join ?

  • A Join simply brings together two data sets. These joins can happen in different ways in Pig - inner, outer, right, left, and outer joins. These however are simple joins and there are specialized joins supported by Pig

Problem:

  • If you have this code in Pig (win, request and response are just tables loaded directly from filesystem):
win_request = JOIN win BY bid_id, request BY bid_id;
win_request_response = JOIN win_request BY win.bid_id, response BY bid_id;

win_group = GROUP win_request_response BY (win.campaign_id);
  • Basically if you want to sum the bid_price after joining and grouping, but an error come:
Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of of them fit.
Please use an explicit cast.

Solution 1:

  • When performing multiple joins we recommend using unique identifiers for above fields (e.g. for bid_id). Alternatively, we can also use the disambiguation operator '::', but that can get pretty hard.
wins = LOAD '/user/hadoop/rtb/wins' USING PigStorage(',') AS (f1_w:int, f2_w:int,  f3_w:chararray);
reqs = LOAD '/user/hadoop/rtb/reqs' USING PigStorage(',') AS (f1_r:int, f2_r:int, f3_r:chararray);
resps = LOAD '/user/hadoop/rtb/resps' USING PigStorage(',') AS (f1_rp:int, f2_rp:int, f3_rp:chararray);

wins_reqs = JOIN wins BY f1_w, reqs BY f1_r;
wins_reqs_reps = JOIN wins_reqs BY f1_r, resps BY f1_rp;

win_group = GROUP wins_reqs_reps BY (f3_w);

win_sum = FOREACH win_group GENERATE group, SUM(wins_reqs_reps.f2_w);

Related Searches to How to refer to a field after a join and a group by in Pig ?