[Solved-2 Solutions] Pig: Get top n values per group ?



Problem :

The below data is already grouped and aggregated.

user    value      count
----    --------  ------
wiki  third      5
wiki   first      11
wiki   second     10
wiki   fourth     2
...
tiki     second     20
tiki     third      18
tiki    first      21
tiki     fourth     8
  • For every user (wiki and tiki), we want to retrieve their top n values (let's say 2), sorted terms of 'count'. So the desired output want it to be:
Wiki first 11
Wiki second 10
Tiki first 21
Tiki second 20

How can we accomplish that?

Solution 1:

  • The below code is helps to get n values
records = LOAD '/user/nubes/ncdc/micro-tab/top.txt' AS (user:chararray,value:chararray,counter:int);
grpd = GROUP records BY user;

top3 = foreach grpd {
        sorted = order records by counter desc;
        top    = limit sorted 2;
        generate group, flatten(top);
};

Input :

wiki  third   5 
wiki  first   11 
wiki  second  10
wiki   fourth  2
tiki second  20
tiki third   18
tiki  first   21
tiki fourth  8

Output :

(wiki,wiki,first,11)
(wiki,wiki,second,10
(tiki,tiki,first,21)
(tiki,tiki ,second,20)

Solution 2:

Here is an example

top    = limit sorted 2;
  • top is an inbuilt function and may throw an error so the only thing which we did was changed the name of the relation in this case and instead of
generate group, flatten(top);

Output:

(wiki,wiki,first,11)
(wiki,wiki,second,10)
(tiki,tiki,first,21)
(tiki,tiki,second,20)

Modified that as shown below -

records = load 'test1.txt' using PigStorage(',') as (user:chararray, value:chararray, count:int);
grpd = GROUP records BY user;
top2 = foreach grpd {
        sorted = order records by count desc;
        top1    = limit sorted 2;
        generate flatten(top1);
};

Output:

(wiki,first,11)
(wiki,second,10)
(tiki,first,21)
(tiki,second,20)

Related Searches to Pig: Get top n values per group