pig tutorial - apache pig tutorial - Apache Pig - Join Operator - pig latin - apache pig - pig hadoop



What is Join?

  • JOIN keyword is used to combine rows from two or more logs, based on a common fields .
  • Left join returns all logs from left log(table) and matching logs from the right log.
 customer id column

Learn Apache Pig - Apache Pig tutorial - customer id column - Apache Pig examples - Apache Pig programs

Pig Operations - Joining

  • JOIN joins two or more sets
  • learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig join operations statement
  • Good build-in support for joins
    • Many different join implementations
    • Left, right and full outer joins are supported
    • Joining on multiple keys is supported
  • Merge Join
    • sets are pre-sorted by the join key
  • Merge-Sparse Join (since Pig 0.10.0)
    • sets are pre-sorted and one set has few ( < 1% of its total) matching keys
  • Replicated Join
    • one set is very large, while other sets are small enough to fit into memory
  • Skewed Join
    • when a large number of records for some values of the join key is expected
  • Regular Join
  • (inner) JOIN instruction:
    • Our classic database operator for relations!
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  -apache pig inner join operation
  • (left) JOIN instruction:
    • Our classic database operator for relations!
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache pig left outer join operation

    How to use join operator in Apache Pig ?

    • The JOIN operator is used to combine records from two or more relations. While performing a join operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these keys match, the two particular tuples are matched, else the records are dropped.
      • Self-join
      • Inner-join
      • Outer-join − left join, right join, and full join

    wikitechy_customers.txt

    1,Raj,32,Ahmedabad,2000.00
    2,Keerthana,25,Delhi,1500.00
    3,karthi,23,Kota,2000.00
    4,Shyam,25,Mumbai,6500.00 
    5,Hari,27,Bhopal,8500.00
    6,Kannan,22,MP,4500.00
    7,Mukil,24,Indore,10000.00
    

    orders.txt

    102,2017-01-08 00:00:00,3,3000
    100,2017-01-08 00:00:00,3,1500
    101,2017-02-20 00:00:00,2,1560
    103,2016-05-20 00:00:00,4,2060
    

    And these two files into Pig with the relations wikitechy_customers and orders as given below.

    grunt> wikitechy_customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
       as (id:int, name:chararray, age:int, address:chararray, salary:int);
      
    grunt> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',')
       as (oid:int, date:chararray, customer_id:int, amount:int);
    
    • To perform various Join operations on these two relations.

    Self - join

    • Self-join is used to join a table with itself as if the table were two relations, temporarily renaming at least one relation.
    • Usually, in Apache Pig, to perform self-join, we will load the same data multiple times, under different aliases (names).
    • Therefore load the contents of the file wikitechy_customers.txt as two tables as given below.
    grunt> customers1 = LOAD 'hdfs://localhost:9000/pig_data/wikitechy_customers.txt' USING PigStorage(',')
       as (id:int, name:chararray, age:int, address:chararray, salary:int);
      
    grunt> customers2 = LOAD 'hdfs://localhost:9000/pig_data/wikitechy_customers.txt' USING PigStorage(',')
       as (id:int, name:chararray, age:int, address:chararray, salary:int); 
    

    Syntax

    • Given below is the syntax of performing self-join operation using the JOIN operator.
    grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
    

    Example

    • Let us perform self-join operation on the relation customers, by joining the two relations customers1 and customers2 as given below.
    grunt> customers3 = JOIN customers1 BY id, customers2 BY id;
    

    Verification

    • Now verify the relation customers3 using the DUMP operator as given below.
    grunt> Dump customers3;
    

    Output

    • The following output, displaying the contents of the relation customers
    (1,Raj,32,Ahmedabad,2000,1,Raj,32,Ahmedabad,2000)
    (2,Keerthana,25,Delhi,1500,2,Keerthana,25,Delhi,1500)
    (3,karthi,23,Kota,2000,3,karthi,23,Kota,2000)
    (4,Shyam,25,Mumbai,6500,4,Shyam,25,Mumbai,6500) 
    (5,Hari,27,Bhopal,8500,5,Hari,27,Bhopal,8500)
    (6,Kannan,22,MP,4500,6,Kannan,22,MP,4500)
    (7,Mukil,24,Indore,10000,7,Mukil,24,Indore,10000)
    
    

    Inner Join

    • Inner Join is used quite frequently; it is also referred to as equijoin. An inner join returns rows when there is a match in both tables.
    • It creates a new relation by combining column values of two relations (say A and B) based upon the join-predicate.
    • The query compares each row of A with each row of B to find all pairs of rows which satisfy the join-predicate.
    • When the join-predicate is satisfied, the column values for each matched pair of rows of A and B are combined into a result row.
     inner join

    Learn Apache Pig - Apache Pig tutorial - inner join - Apache Pig examples - Apache Pig programs

    Syntax

    • Here is the syntax of performing inner join operation using the JOIN operator.
    grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;
    

    Example

    • Let us perform inner join operation on the two relations wikitechy_customers and orders as given below.
    grunt> wikitechy_coustomer_orders = JOIN customers BY id, orders BY customer_id;
    

    Verification

    • Verify the relation wikitechy_coustomer_orders using the DUMP operator as given below.
    grunt> Dump wikitechy_coustomer_orders;
    

    Output

    • The following output that will the contents of the relation named wikitechy_coustomer_orders.
    (2,Keerthana,25,Delhi,1500,101,2017-01-20 00:00:00,2,1560)
    (3,karthi,23,Kota,2000,100,2017-01-08 00:00:00,3,1500)
    (3,karthi,23,Kota,2000,102,2017-02-08 00:00:00,3,3000)
    (4,Shyam,25,Mumbai,6500,103,2016-05-20 00:00:00,4,2060)
    

    Note

    • Outer Join: Unlike inner join, outer join returns all the rows from at least one of the relations. An outer join operation is carried out in three ways ,
      • Left outer join
      • Right outer join
      • Full outer join

    Left Outer Join

    • The left outer Join operation returns all rows from the left table, even if there are no matches in the right relation.
     left outer join

    Learn Apache Pig - Apache Pig tutorial - left outer join - Apache Pig examples - Apache Pig programs

    Syntax

    • Let us perform left outer join operation using the JOIN operator.
    grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id;
    

    Example

    • Let us perform left outer join operation on the two relations customers and orders as given below.
    grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;
    

    Verification

    • Verify the relation outer_left using the DUMP operator as given below.
    grunt> Dump outer_left;
    

    Output

    • The following output, displaying the contents of the relation outer_left.
    (1,Raj,32,Ahmedabad,2000,,,,)
    (2,Keerthana,25,Delhi,1500,101,2017-01-20 00:00:00,2,1560)
    (3,karthi,23,Kota,2000,100,2017-01-08 00:00:00,3,1500)
    (3,karthi,23,Kota,2000,102,2017-02-08 00:00:00,3,3000)
    (4,Shyam,25,Mumbai,6500,103,2016-05-20 00:00:00,4,2060) (5,Hari,27,Bhopal,8500,,,,)
    (6,Kannan,22,MP,4500,,,,)
    (7,Mukil,24,Indore,10000,,,,)
    

    Right Outer Join

    • The right outer join operation returns all rows from the right table, even if there are no matches in the left table.
     right outer join

    Learn Apache Pig - Apache Pig tutorial - right outer join - Apache Pig examples - Apache Pig programs

    Syntax

    • Given below is the syntax of performing right outer join operation using the JOIN operator.
    grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
    

    Example

    • Let us perform right outer join operation on the two relations wikitechy_customers and orders as given below.
    grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
    

    Verification

    • Verify the relation outer_right using the DUMP operator as given below.
    grunt> Dump outer_right
    

    Output

    • The following output, displaying the contents of the relation outer_right.
    (2,Keerthana,25,Delhi,1500,101,2017-01-20 00:00:00,2,1560)
    (3,karthi,23,Kota,2000,100,2017-01-08 00:00:00,3,1500)
    (3,karthi,23,Kota,2000,102,2017-02-08 00:00:00,3,3000)
    (4,Shyam,25,Mumbai,6500,103,2016-05-20 00:00:00,4,2060)
    
    • The full outer join operation returns rows when there is a match in one of the relations
     full join new

    Learn Apache Pig - Apache Pig tutorial - full join new - Apache Pig examples - Apache Pig programs

    Syntax

    • Given below is the syntax of performing full outer join using the JOIN operator.
    grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
    

    Example

    • Let us perform full outer join operation on the two relations wikitechy_customers and orders as given below.
    grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
    

    Verification

    • Verify the relation outer_full using the DUMP operator as given below.
    grun> Dump outer_full; 
    

    Output

    • The following output, displaying the contents of the relation outer_full.
    (1,Raj,32,Ahmedabad,2000,,,,)
    (2,Keerthana,25,Delhi,1500,101,2017-01-20 00:00:00,2,1560)
    (3,karthi,23,Kota,2000,100,2017-01-08 00:00:00,3,1500)
    (3,karthi,23,Kota,2000,102,2017-02-08 00:00:00,3,3000)
    (4,Shyam,25,Mumbai,6500,103,2016-05-20 00:00:00,4,2060) (5,Hari,27,Bhopal,8500,,,,)
    (6,Kannan,22,MP,4500,,,,)
    (7,Mukil,24,Indore,10000,,,,)
    

    Using Multiple Keys

    • Let us perform JOIN operation using multiple keys are given below.

    Syntax

    • Here is how you can perform a JOIN operation on two tables using multiple keys.
    grunt> Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name BY (key1, key2);
    
    • Ensure that we have two files namely wikitechy_employee.txt and wikitechy_employee_contact.txt in the /pig_data/ directory of HDFS as given below.

    wikitechy_employee.txt

    111,Anu,Shankar,21,programmer,003
    112,Barvathi,Nambiayar,22,programmer,003
    113,Kajal,Nayak,22,programmer,003
    114,Preethi,Antony,21,programmer,003
    115,Raj,Gopal,23,programmer,003
    116,Yashika,Kannan,23,programmer,003
    117,siddu,Narayanan,24,teamlead,002
    118,Timple,Mohanthy,24,manager,001
    

    wikitechy_employee_contact.txt

    111,9876543210,[email protected],Hyderabad,003
    112,9876543211,[email protected],Kolkata,003
    113,9876543212,[email protected],Delhi,003
    114,9876543213,[email protected],Pune,003
    115,9876543214,[email protected],Bhuwaneshwar,003
    116,9876543215,[email protected],Chennai,003
    117,9876543216,[email protected],trivendram,002
    118,9876543217,[email protected],Chennai,001
    
    • And we have loaded these two files into Pig with relations wikitechy_employee and wikitechy_employee_contact as given below.
    grunt> wikitechy_employee_contact = LOAD 'hdfs://localhost:9000/pig_data/employee_contact.txt' USING PigStorage(',') 
       as (id:int, phone:chararray, email:chararray, city:chararray, jobid:int);
    
    • Here join the contents of these two relations using the JOIN operator as given below.
    grunt> emp = JOIN wikitechy_employee BY (id,jobid), wikitechy_employee_contact BY (id,jobid);
    

    Verification

    • To verify the relation emp using the DUMP operator as shown below.
    grunt> Dump emp; 
    

    Output

    • The following output, displaying the contents of the relation named emp as given below.
    (111,Anu,Shankar,21,programmer,113,111,9876543210,[email protected],Hyderabad,113)
    (112,Barvathi,Nambiayar,22,programmer,113,112,9876543211,[email protected],Kolka ta,113)  
    (113,Kajal,Nayak,22,programmer,113,113,9876543212,[email protected],Delhi,113)  
    (114,Preethi,Antony,21,programmer,113,114,9876543213,[email protected],Pune,113)  
    (115,Raj,Gopal,23,programmer,113,115,9876543214,[email protected],Bhuwaneshw ar,113)  
    (116,Yashika,Kannan,23,programmer,113,116,9876543215,[email protected],Chennai,113)  
    (117,siddu,Narayanan,24,teamlead,112,117,9876543216,[email protected],trivendram,112)  
    (118,Timple,Mohanthy,24,manager,111,118,9876543217,[email protected],Chennai,111)
    
    

    Related Searches to Apache Pig - Join Operator