pig tutorial - apache pig tutorial - Apache Pig - Join Operator - pig latin - apache pig - pig hadoop




What is Join?

  • JOIN keyword is used to combine rows from two or more logs, based on a common fields .
  • Left join returns all logs from left log(table) and matching logs from the right log.
 customer id column

Learn Apache Pig - Apache Pig tutorial - customer id column - Apache Pig examples - Apache Pig programs

Pig Operations - Joining

  • JOIN joins two or more sets
  • learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig join operations statement
  • Good build-in support for joins
    • Many different join implementations
    • Left, right and full outer joins are supported
    • Joining on multiple keys is supported
  • Merge Join
    • sets are pre-sorted by the join key
  • Merge-Sparse Join (since Pig 0.10.0)
    • sets are pre-sorted and one set has few ( < 1% of its total) matching keys
  • Replicated Join
    • one set is very large, while other sets are small enough to fit into memory
  • Skewed Join
    • when a large number of records for some values of the join key is expected
  • Regular Join
  • (inner) JOIN instruction:
    • Our classic database operator for relations!
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  -apache pig inner join operation
  • (left) JOIN instruction:
    • Our classic database operator for relations!
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - apache pig left outer join operation

    How to use join operator in Apache Pig ?

    • The JOIN operator is used to combine records from two or more relations. While performing a join operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these keys match, the two particular tuples are matched, else the records are dropped.
      • Self-join
      • Inner-join
      • Outer-join − left join, right join, and full join

    wikitechy_customers.txt

    1,Raj,32,Ahmedabad,2000.00
    2,Keerthana,25,Delhi,1500.00
    3,karthi,23,Kota,2000.00
    4,Shyam,25,Mumbai,6500.00 
    5,Hari,27,Bhopal,8500.00
    6,Kannan,22,MP,4500.00
    7,Mukil,24,Indore,10000.00
    

    orders.txt

    102,2017-01-08 00:00:00,3,3000
    100,2017-01-08 00:00:00,3,1500
    101,2017-02-20 00:00:00,2,1560
    103,2016-05-20 00:00:00,4,2060
    

    And these two files into Pig with the relations wikitechy_customers and orders as given below.

    grunt> wikitechy_customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
       as (id:int, name:chararray, age:int, address:chararray, salary:int);
      
    grunt> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',')
       as (oid:int, date:chararray, customer_id:int, amount:int);
    
    • To perform various Join operations on these two relations.

    Self - join

    • Self-join is used to join a table with itself as if the table were two relations, temporarily renaming at least one relation.
    • Usually, in Apache Pig, to perform self-join, we will load the same data multiple times, under different aliases (names).
    • Therefore load the contents of the file wikitechy_customers.txt as two tables as given below.
    grunt> customers1 = LOAD 'hdfs://localhost:9000/pig_data/wikitechy_customers.txt' USING PigStorage(',')
       as (id:int, name:chararray, age:int, address:chararray, salary:int);
      
    grunt> customers2 = LOAD 'hdfs://localhost:9000/pig_data/wikitechy_customers.txt' USING PigStorage(',')
       as (id:int, name:chararray, age:int, address:chararray, salary:int); 
    

    Syntax

    • Given below is the syntax of performing self-join operation using the JOIN operator.
    grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
    

    Example

    • Let us perform self-join operation on the relation customers, by joining the two relations customers1 and customers2 as given below.
    grunt> customers3 = JOIN customers1 BY id, customers2 BY id;
    

    Verification

    • Now verify the relation customers3 using the DUMP operator as given below.
    grunt> Dump customers3;
    

    Output

    • The following output, displaying the contents of the relation customers
    (1,Raj,32,Ahmedabad,2000,1,Raj,32,Ahmedabad,2000)
    (2,Keerthana,25,Delhi,1500,2,Keerthana,25,Delhi,1500)
    (3,karthi,23,Kota,2000,3,karthi,23,Kota,2000)
    (4,Shyam,25,Mumbai,6500,4,Shyam,25,Mumbai,6500) 
    (5,Hari,27,Bhopal,8500,5,Hari,27,Bhopal,8500)
    (6,Kannan,22,MP,4500,6,Kannan,22,MP,4500)
    (7,Mukil,24,Indore,10000,7,Mukil,24,Indore,10000)
    
    

    Inner Join

    • Inner Join is used quite frequently; it is also referred to as equijoin. An inner join returns rows when there is a match in both tables.
    • It creates a new relation by combining column values of two relations (say A and B) based upon the join-predicate.
    • The query compares each row of A with each row of B to find all pairs of rows which satisfy the join-predicate.
    • When the join-predicate is satisfied, the column values for each matched pair of rows of A and B are combined into a result row.
     inner join

    Learn Apache Pig - Apache Pig tutorial - inner join - Apache Pig examples - Apache Pig programs

    Syntax

    • Here is the syntax of performing inner join operation using the JOIN operator.
    grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;
    

    Example

    • Let us perform inner join operation on the two relations wikitechy_customers and orders as given below.
    grunt> wikitechy_coustomer_orders = JOIN customers BY id, orders BY customer_id;
    

    Verification

    • Verify the relation wikitechy_coustomer_orders using the DUMP operator as given below.
    grunt> Dump wikitechy_coustomer_orders;
    

    Output

    • The following output that will the contents of the relation named wikitechy_coustomer_orders.
    (2,Keerthana,25,Delhi,1500,101,2017-01-20 00:00:00,2,1560)
    (3,karthi,23,Kota,2000,100,2017-01-08 00:00:00,3,1500)
    (3,karthi,23,Kota,2000,102,2017-02-08 00:00:00,3,3000)
    (4,Shyam,25,Mumbai,6500,103,2016-05-20 00:00:00,4,2060)
    

    Note

    • Outer Join: Unlike inner join, outer join returns all the rows from at least one of the relations. An outer join operation is carried out in three ways ,
      • Left outer join
      • Right outer join
      • Full outer join

    Left Outer Join

    • The left outer Join operation returns all rows from the left table, even if there are no matches in the right relation.
     left outer join

    Learn Apache Pig - Apache Pig tutorial - left outer join - Apache Pig examples - Apache Pig programs

    Syntax

    • Let us perform left outer join operation using the JOIN operator.
    grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id;
    

    Example

    • Let us perform left outer join operation on the two relations customers and orders as given below.
    grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;
    

    Verification

    • Verify the relation outer_left using the DUMP operator as given below.
    grunt> Dump outer_left;
    

    Output

    • The following output, displaying the contents of the relation outer_left.
    (1,Raj,32,Ahmedabad,2000,,,,)
    (2,Keerthana,25,Delhi,1500,101,2017-01-20 00:00:00,2,1560)
    (3,karthi,23,Kota,2000,100,2017-01-08 00:00:00,3,1500)
    (3,karthi,23,Kota,2000,102,2017-02-08 00:00:00,3,3000)
    (4,Shyam,25,Mumbai,6500,103,2016-05-20 00:00:00,4,2060) (5,Hari,27,Bhopal,8500,,,,)
    (6,Kannan,22,MP,4500,,,,)
    (7,Mukil,24,Indore,10000,,,,)
    

    Right Outer Join

    • The right outer join operation returns all rows from the right table, even if there are no matches in the left table.
     right outer join

    Learn Apache Pig - Apache Pig tutorial - right outer join - Apache Pig examples - Apache Pig programs

    Syntax

    • Given below is the syntax of performing right outer join operation using the JOIN operator.
    grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
    

    Example

    • Let us perform right outer join operation on the two relations wikitechy_customers and orders as given below.
    grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
    

    Verification

    • Verify the relation outer_right using the DUMP operator as given below.
    grunt> Dump outer_right
    

    Output

    • The following output, displaying the contents of the relation outer_right.
    (2,Keerthana,25,Delhi,1500,101,2017-01-20 00:00:00,2,1560)
    (3,karthi,23,Kota,2000,100,2017-01-08 00:00:00,3,1500)
    (3,karthi,23,Kota,2000,102,2017-02-08 00:00:00,3,3000)
    (4,Shyam,25,Mumbai,6500,103,2016-05-20 00:00:00,4,2060)
    
    • The full outer join operation returns rows when there is a match in one of the relations
     full join new

    Learn Apache Pig - Apache Pig tutorial - full join new - Apache Pig examples - Apache Pig programs

    Syntax

    • Given below is the syntax of performing full outer join using the JOIN operator.
    grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
    

    Example

    • Let us perform full outer join operation on the two relations wikitechy_customers and orders as given below.
    grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
    

    Verification

    • Verify the relation outer_full using the DUMP operator as given below.
    grun> Dump outer_full; 
    

    Output

    • The following output, displaying the contents of the relation outer_full.
    (1,Raj,32,Ahmedabad,2000,,,,)
    (2,Keerthana,25,Delhi,1500,101,2017-01-20 00:00:00,2,1560)
    (3,karthi,23,Kota,2000,100,2017-01-08 00:00:00,3,1500)
    (3,karthi,23,Kota,2000,102,2017-02-08 00:00:00,3,3000)
    (4,Shyam,25,Mumbai,6500,103,2016-05-20 00:00:00,4,2060) (5,Hari,27,Bhopal,8500,,,,)
    (6,Kannan,22,MP,4500,,,,)
    (7,Mukil,24,Indore,10000,,,,)
    

    Using Multiple Keys

    • Let us perform JOIN operation using multiple keys are given below.

    Syntax

    • Here is how you can perform a JOIN operation on two tables using multiple keys.
    grunt> Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name BY (key1, key2);
    
    • Ensure that we have two files namely wikitechy_employee.txt and wikitechy_employee_contact.txt in the /pig_data/ directory of HDFS as given below.

    wikitechy_employee.txt

    111,Anu,Shankar,21,programmer,003
    112,Barvathi,Nambiayar,22,programmer,003
    113,Kajal,Nayak,22,programmer,003
    114,Preethi,Antony,21,programmer,003
    115,Raj,Gopal,23,programmer,003
    116,Yashika,Kannan,23,programmer,003
    117,siddu,Narayanan,24,teamlead,002
    118,Timple,Mohanthy,24,manager,001
    

    wikitechy_employee_contact.txt

    111,9876543210,[email protected],Hyderabad,003
    112,9876543211,[email protected],Kolkata,003
    113,9876543212,[email protected],Delhi,003
    114,9876543213,[email protected],Pune,003
    115,9876543214,[email protected],Bhuwaneshwar,003
    116,9876543215,[email protected],Chennai,003
    117,9876543216,[email protected],trivendram,002
    118,9876543217,[email protected],Chennai,001
    
    • And we have loaded these two files into Pig with relations wikitechy_employee and wikitechy_employee_contact as given below.
    grunt> wikitechy_employee_contact = LOAD 'hdfs://localhost:9000/pig_data/employee_contact.txt' USING PigStorage(',') 
       as (id:int, phone:chararray, email:chararray, city:chararray, jobid:int);
    
    • Here join the contents of these two relations using the JOIN operator as given below.
    grunt> emp = JOIN wikitechy_employee BY (id,jobid), wikitechy_employee_contact BY (id,jobid);
    

    Verification

    • To verify the relation emp using the DUMP operator as shown below.
    grunt> Dump emp; 
    

    Output

    • The following output, displaying the contents of the relation named emp as given below.
    (111,Anu,Shankar,21,programmer,113,111,9876543210,[email protected],Hyderabad,113)
    (112,Barvathi,Nambiayar,22,programmer,113,112,9876543211,[email protected],Kolka ta,113)  
    (113,Kajal,Nayak,22,programmer,113,113,9876543212,[email protected],Delhi,113)  
    (114,Preethi,Antony,21,programmer,113,114,9876543213,[email protected],Pune,113)  
    (115,Raj,Gopal,23,programmer,113,115,9876543214,[email protected],Bhuwaneshw ar,113)  
    (116,Yashika,Kannan,23,programmer,113,116,9876543215,[email protected],Chennai,113)  
    (117,siddu,Narayanan,24,teamlead,112,117,9876543216,[email protected],trivendram,112)  
    (118,Timple,Mohanthy,24,manager,111,118,9876543217,[email protected],Chennai,111)
    
    

    Related Searches to Apache Pig - Join Operator

    Adblocker detected! Please consider reading this notice.

    We've detected that you are using AdBlock Plus or some other adblocking software which is preventing the page from fully loading.

    We don't have any banner, Flash, animation, obnoxious sound, or popup ad. We do not implement these annoying types of ads!

    We need money to operate the site, and almost all of it comes from our online advertising.

    Please add wikitechy.com to your ad blocking whitelist or disable your adblocking software.

    ×