pig tutorial - apache pig tutorial - Apache Pig - Top() - pig latin - apache pig - pig hadoop




What is TOP() function in Apache Pig ?

  • The TOP() function of Pig Latin is used to get the top N tuples of a bag.
  • To this function, as inputs, we have to pass a relation, the number of tuples you need, and the column name whose values are being compared.
  • This function will return a bag containing the required columns.

Syntax

grunt> TOP(topN,column,relation)

Example

  • Ensure we have a file named wikitechy_emp_details.txt in the HDFS directory /pig_data/, with the following content.

Wikitechy_emp_details.txt

111,Anu,22,newyork 
112,Bastin,23,Kolkata 
113,Cimen,23,Tokyo 
114,Darathy,25,London 
115,Enba,23,Bhuwaneshwar 
116,Favin,22,Chennai 
117,Robert,22,newyork 
118,Syam,23,Kolkata 
119,Mary,25,Tokyo 
120,Vincent,25,London 
121,Preethi,25,Bhuwaneshwar 
122,Antony,22,Chennai
  • You have loaded this file into Pig with the relation name emp_data as given below.
grunt>emp_data = LOAD 'hdfs://localhost:9000/pig_data/ wikitechy_emp_details.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, city:chararray);
  • Group the relation emp_data by age, and store it in the relation emp_group.
grunt> emp_group = Group emp_data BY age;

Now verify the relation emp_group using the Dump operator as given below.

grunt> Dump emp_group;
(22,{(122,Antony,22,Chennai),(117,Robert,22,newyork),(116,Favin,22,Chennai),(111,Anu,22,newyork)}) 
(23,{(118,Syam,23,Kolkata),(115,David,23,Bhuwaneshwar),(113,Cimen,23,Tokyo),(112,Bastin,23, Kolkata)}) 
(25,{(111,Anu,25,Bhuwaneshwar),(120,Vincent,25,London),(119,Mary,25,Tokyo),(114,Darathy, 25,London)})

Now, you can get the top two records of each group arranged in ascending order (based on id) as given below.

grunt> data_top = FOREACH emp_group { 
   top = TOP(2, 0, emp_data); 
   GENERATE top; 
}
  • In this instance we are retriving the top 2 tuples of a group having greater id.
  • Then we are retriving top 2 tuples basing on the id, we are passing the index of the column name id as second parameter of TOP() function.

Verification

You can verify the contents of the data_top relation using the Dump operator as given below.

grunt> Dump data_top;
({(117,Robert,22,newyork),(122,Antony,22,Chennai)}) 
({(115,David,23,Bhuwaneshwar),(118,Syam,23,Kolkata)}) 
({(120,Vincent,25,London),(111,Anu,25,Bhuwaneshwar)})

Related Searches to Apache Pig - Top()

Adblocker detected! Please consider reading this notice.

We've detected that you are using AdBlock Plus or some other adblocking software which is preventing the page from fully loading.

We don't have any banner, Flash, animation, obnoxious sound, or popup ad. We do not implement these annoying types of ads!

We need money to operate the site, and almost all of it comes from our online advertising.

Please add wikitechy.com to your ad blocking whitelist or disable your adblocking software.

×