pig tutorial - apache pig tutorial - Apache Pig - Group Operator - pig latin - apache pig - pig hadoop
What is GROUP operator in Apache Pig ?
- The GROUP operator is used to group the data in one or more relations.
- It gathers the data having the same key.
Pig Operations - Grouping
- Produces records with two fields: the key (named group) and the bag of collected records
- USING 'collected' - avoids a reduce phase
- GROUP ALL - groups together all of the records into a
GroupedAll = GROUP Users ALL;
CountedAll == FOREACH GroupedAll GENERATE COUNT (Users);
- Creates tuples with the key and a of bag tuples with the same key values
- Ensure that you have a file named wikitechy_employee_details.txt in the HDFS directory /pig_data/ as shown below.
- And you have loaded this file into Apache Pig with the relation name wikitechy_employee_details as given below.
- Let us group the records/tuples in the relation by age as shown below.
- To verify the relation group_data using the DUMP operator as given below.
Next you will get output displaying the contents of the relation named group_data as given below.
Here you can observe that the resulting schema has two columns,
- One is age, by which we have grouped the relation.
- The other is a bag, which contains the group of tuples, employee records with the respective age.
Now you can see the schema of the table after grouping the data using the describe command as given below.
- If you can get the sample illustration of the schema using the illustrate command as given below.
Grouping by Multiple Columns
Group the relation by age and city as given below.
You can verify the content of the relation named group_multiple using the Dump operator as given below.
We can group a relation by all the columns as given below.
At this time, verify the content of the relation group_all as given below.