pig tutorial - apache pig tutorial - Apache Pig - Group Operator - pig latin - apache pig - pig hadoop



What is GROUP operator in Apache Pig ?

    • The GROUP operator is used to group the data in one or more relations.
    • It gathers the data having the same key.

    Pig Operations - Grouping

  • GROUPS collects together records with the same key
    • Produces records with two fields: the key (named group) and the bag of collected records
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig grouping statement
  • Support of an expression or user-defined function as the group key
  • Support of grouping on multiple keys
  • Special versions
    • USING 'collected' - avoids a reduce phase
    • GROUP ALL - groups together all of the records into a single group
      GroupedAll = GROUP Users ALL;
      CountedAll == FOREACH GroupedAll GENERATE COUNT (Users);
  • GROUP instruction:
    • Creates tuples with the key and a of bag tuples with the same key values
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache pig group by operation
  • We can use multiple relations. Creates one bag per relation
  • learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache pig group by operation

    Syntax

      grunt> Group_data = GROUP Relation_name BY age;
      

      Example

        • Ensure that you have a file named wikitechy_employee_details.txt in the HDFS directory /pig_data/ as shown below.

        Wikitechy_employee_details.txt

          111,Anu,Shankar,23,9876543210,Chennai
          112,Barvathi,Nambiayar,24,9876543211,Chennai
          113,Kajal,Nayak,24,9876543212,Trivendram
          114,Preethi,Antony,21,9876543213,Pune
          115,Raj,Gopal,21,9876543214,Hyderabad
          116,Yashika,Kannan,22,9876543215,Delhi
          117,siddu,Narayanan,22,9876543216,Kolkata
          118,Timple,Mohanthy,23,9876543217,Bhuwaneshwar
          
          • And you have loaded this file into Apache Pig with the relation name wikitechy_employee_details as given below.
          grunt> wikitechy_employee_details = LOAD 'hdfs://localhost:9000/pig_data/wikitechy_employee_details.txt' USING PigStorage(',')
             as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
          
          • Let us group the records/tuples in the relation by age as shown below.
          grunt> group_data = GROUP wikitechy_employee_details by age;
          

          Verification

            • To verify the relation group_data using the DUMP operator as given below.
            grunt> Dump group_data;
            

            Output

              Next you will get output displaying the contents of the relation named group_data as given below.

              Here you can observe that the resulting schema has two columns,

              • One is age, by which we have grouped the relation.
              • The other is a bag, which contains the group of tuples, employee records with the respective age.
              (21,{(114,Preethi,Antony,21,9876543213,Pune),(115, Raj,Gopal,21,9876543214,Hyderabad)}, { } )
              (22,{(116,Yashika,Kannan,22,9876543215,Delhi),(117,siddu,Narayanan,22,9876543216,Kolkata)}, { })
              (23,{(111,Anu,Shankar,23,9876543210,Chennai),(118,Timple,Mohanthy,23,9876543217,Bhuwaneshwar)}, { })
              (24,{(112,Barvathi,Nambiayar,24,9876543211,Chennai),(113,Kajal,Nayak,24,9876543212,Trivendram)}, { })
              

              Now you can see the schema of the table after grouping the data using the describe command as given below.

              <b>grunt> Describe group_data; </b>
                
              group_data: {group: int,wikitechy_employee_details: {(id: int,firstname: chararray,
                             lastname: chararray,age: int,phone: chararray,city: chararray)}}
              
              • If you can get the sample illustration of the schema using the illustrate command as given below.
              $ Illustrate group_data;
              

              Output

                
                ------------------------------------------------------------------------------------------------- 
                |group_data|  group:int | wikitechy_employee_details:bag{:tuple(id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,city:chararray)}|
                ------------------------------------------------------------------------------------------------- 
                |                    |     21         | { (114,Preethi,Antony,21,9876543213,Pune),(115, Raj,Gopal,21,9876543214,Hyderabad)}| 
                |                    |     22         | {(116,Yashika,Kannan,22,9876543215,Delhi),(117,siddu,Narayanan,22,9876543216,Kolkata)}| 
                -------------------------------------------------------------------------------------------------
                

                Grouping by Multiple Columns

                  Group the relation by age and city as given below.

                  grunt> group_multiple = GROUP wikitechy_employee_details by (age, city);
                  

                  You can verify the content of the relation named group_multiple using the Dump operator as given below.

                  <b>grunt> Dump group_multiple; </b> 
                    
                  ((21,Pune),{(114,Preethi,Antony,21,9876543213,Pune)})
                  ((21,Hyderabad),{(115, Raj,Gopal,21, 9876543214,Hyderabad)})
                  ((22,Delhi),{(116,Yashika,Kannan,22,9876543215,Delhi)})
                  ((22,Kolkata),{(117,siddu,Narayanan,22,9876543216,Kolkata)})
                  ((23,Chennai),{( 111,Anu,Shankar,23,9876543210,Chennai)})
                  ((23,Bhuwaneshwar),{(118,Timple,Mohanthy,23,9876543217,Bhuwaneshwar)})
                  ((24,Chennai),{( 112,Barvathi,Nambiayar,24,9876543211,Chennai)})
                  (24,Trivendram),{( 113,Kajal,Nayak,24,9876543212,Trivendram)})
                  

                  Group All

                    We can group a relation by all the columns as given below.

                    grunt> <b>group_all</b> = GROUP <b>wikitechy_employee_details<b> All;
                    

                    At this time, verify the content of the relation group_all as given below.

                    <b>grunt> Dump group_all; </b>  
                      
                    (all,{( 118,Timple,Mohanthy,23,9876543217,Bhuwaneshwar),
                    (117,siddu,Narayanan,22,9876543216,Kolkata),
                    (116,Yashika,Kannan,22,9876543215,Delhi),
                    (115,Raj,Gopal,21,9876543214,Hyderabad),
                    (114,Preethi,Antony,21,9876543213,Pune),
                    (113,Kajal,Nayak,24,9876543212,Trivendram),
                    (112,Barvathi,Nambiayar,24,9876543211,Chennai),
                    (111,Anu,Shankar,23,9876543210,Chennai)}
                    

                    Related Searches to Group Operator