[Solved-2 Solutions] Pig approach to pairing data fields in a data set ?



Self - join

  • Self-join is used to join a table with itself as if the table were two relations, temporarily renaming at least one relation.
  • Generally, in Apache Pig, to perform self-join, we will load the same data multiple times, under different aliases (names). Therefore let us load the contents of the file customers.txt as two tables as shown below.
grunt> customers1 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, address:chararray, salary:int);
  
grunt> customers2 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, address:chararray, salary:int); 

Syntax

  • Given below is the syntax of performing self-join operation using the JOINoperator.
grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;

Problem:

Is there is a way to pairing data fields in a data set in pig ?

Solution 1:

  • Let us perform .self-join operation on the relation customers, by joining the two relations customers1 and customers2 as shown below.

The first approach is a self join

s1 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
s2 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
b = JOIN s1 BY class, s2 BY class;

The other option would be to use CROSS nested in a FOREACH after the GROUP:

B = GROUP s BY class;
C = FOREACH B {                          
   DA = CROSS s, s;                       
   GENERATE FLATTEN(DA);
}

Solution 2:

  • This can be done with a self-join and some simple filtering.
classes1 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
classes2 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
joined = JOIN classes1 BY class, classes2 BY class;
filtered = FILTER joined BY classes1.student < classes2.student;
pairs = FOREACH filtered GENERATE classes1.student AS student1, classes2.student AS student2;

Related Searches to Pig approach to pairing data fields in a data set