pig tutorial - apache pig tutorial - Apache Pig - Top() - pig latin - apache pig - pig hadoop



What is TOP() function in Apache Pig ?

  • The TOP() function of Pig Latin is used to get the top N tuples of a bag.
  • To this function, as inputs, we have to pass a relation, the number of tuples you need, and the column name whose values are being compared.
  • This function will return a bag containing the required columns.

Syntax

grunt> TOP(topN,column,relation)

Example

  • Ensure we have a file named wikitechy_emp_details.txt in the HDFS directory /pig_data/, with the following content.

Wikitechy_emp_details.txt

111,Anu,22,newyork 
112,Bastin,23,Kolkata 
113,Cimen,23,Tokyo 
114,Darathy,25,London 
115,Enba,23,Bhuwaneshwar 
116,Favin,22,Chennai 
117,Robert,22,newyork 
118,Syam,23,Kolkata 
119,Mary,25,Tokyo 
120,Vincent,25,London 
121,Preethi,25,Bhuwaneshwar 
122,Antony,22,Chennai
  • You have loaded this file into Pig with the relation name emp_data as given below.
grunt>emp_data = LOAD 'hdfs://localhost:9000/pig_data/ wikitechy_emp_details.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, city:chararray);
  • Group the relation emp_data by age, and store it in the relation emp_group.
grunt> emp_group = Group emp_data BY age;

Now verify the relation emp_group using the Dump operator as given below.

grunt> Dump emp_group;
(22,{(122,Antony,22,Chennai),(117,Robert,22,newyork),(116,Favin,22,Chennai),(111,Anu,22,newyork)}) 
(23,{(118,Syam,23,Kolkata),(115,David,23,Bhuwaneshwar),(113,Cimen,23,Tokyo),(112,Bastin,23, Kolkata)}) 
(25,{(111,Anu,25,Bhuwaneshwar),(120,Vincent,25,London),(119,Mary,25,Tokyo),(114,Darathy, 25,London)})

Now, you can get the top two records of each group arranged in ascending order (based on id) as given below.

grunt> data_top = FOREACH emp_group { 
   top = TOP(2, 0, emp_data); 
   GENERATE top; 
}
  • In this instance we are retriving the top 2 tuples of a group having greater id.
  • Then we are retriving top 2 tuples basing on the id, we are passing the index of the column name id as second parameter of TOP() function.

Verification

You can verify the contents of the data_top relation using the Dump operator as given below.

grunt> Dump data_top;
({(117,Robert,22,newyork),(122,Antony,22,Chennai)}) 
({(115,David,23,Bhuwaneshwar),(118,Syam,23,Kolkata)}) 
({(120,Vincent,25,London),(111,Anu,25,Bhuwaneshwar)})

Related Searches to Apache Pig - Top()