[Solved-1 Solution] Hadoop pig return top 5 rows ?



What is Group By ?

The GROUP operator is used to group the data in one or more relations. It collects the data having the same key.

Syntax

grunt> Group_data = GROUP Relation_name BY age;

What is Order By ?

  • The ORDER BY operator is used to display the contents of a relation in a sorted order based on one or more fields.

Syntax

grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);

Problem:

If we want to return the top 5 rows of a group. Basically we have a table with some state names and their cities which is grouped by state name. we want to have the top 5 cities of that state and not all of them. How can we do this using pig?

Solution 1:

  • First we have GROUP BY the elements inside of a foreach.then
  • We have to ORDER BY then LIMIT. This will sort the things in each group first by city size, then pulls the top 5.

The below code helps to returns top 5 rows

B = GROUP A BY state;
C = FOREACH B {                          
   DA = ORDER A BY citysize DESC;                
   DB = LIMIT DA 5;                         
   GENERATE FLATTEN(group), FLATTEN(DB.citysize), FLATTEN(DB.cityname);
}

Related Searches to hadoop pig return top 5 rows