pig tutorial - apache pig tutorial - Apache Pig - DIFF() Function - pig latin - apache pig - pig hadoop



What is DIFF() Function in Apache Pig ?

  • The DIFF() function used in Apache Pig is used to compare two bags in a tuple.
  • The specification which is given on DIFF() function is the name of the existing series and is also known as the degree of differencing, in parentheses.
  • The degree of differencing used in DIFF() function must be specified as it is not default.
  • System-missing values which is used in DIFF() will appear at the beginning of the new series.
  • We can specify one degree of differencing which is done per DIFF function.

Syntax

grunt> DIFF (expression, expression)

Example

  • We can assume that we have two files namely wikitechy_employee_sales.txt and wikitechy_employee_bonus.txt which is given in the HDFS directory /pig_data/ which is given below:

wikitechy_employee_sales.txt

1,Rubin,22,25000,sales 
2,BOB,23,30000,sales 
3,Lavanya,23,25000,sales 
4,Sarah,25,40000,sales 
5,Arvin,23,45000,sales 
6,Sruti,22,35000,sales

wikitechy_employee_bonus.txt

1,Rubin,22,25000,sales 
2,Maya,23,20000,admin 
3,Lavanya,23,25000,sales 
4,Arya,25,50000,admin 
5,Arvin,23,45000,sales 
6,Omakr,30,30000,admin

We have loaded the files into Pig, with the relation names which are called employee_sales and employee_bonus.

grunt> employee_sales = LOAD 'hdfs://localhost:9000/pig_data/wikitechy_employee_sales.txt' USING PigStorage(',')
as (sno:int, name:chararray, age:int, salary:int, dept:chararray);

employee_bonus

grunt> employee_bonus = LOAD 'hdfs://localhost:9000/pig_data/wikitechy_employee_bonus.txt' USING PigStorage(',')
as (sno:int, name:chararray, age:int, salary:int, dept:chararray);
  • We need to group the records/tuples of the relations employee_sales and employee_bonus with the key sno, which is done using the COGROUP operator which is given below:
grunt> cogroup_data = COGROUP employee_sales by sno, employee_bonus by sno;

Verify the relation cogroup_data by using the DUMP operator which is given below:

grunt> Dump cogroup_data;
(1,{(1,Rubin,22,25000,sales)},{(1,Rubin,22,25000,sales)}) 
(2,{(2,BOB,23,30000,sales)},{(2,Maya,23,20000,admin)}) 
(3,{(3,Lavanya,23,25000,sales)},{(3,Lavanya,23,25000,sales)}) 
(4,{(4,Sarah,25,40000,sales)},{(4,Arya,25,50000,admin)}) 
(5,{(5,Arvin,23,45000,sales)},{(5,Arvin,23,45000,sales)}) 
(6,{(6,Sruti,22,35000,sales)},{(6,Omakr,30,30000,admin)})

Calculating the Difference between Two Relations

We need to calculate the difference between the two relations by using DIFF() function and we need to store it in the relation diff_data which is given below:

grunt> diff_data = FOREACH cogroup_data GENERATE DIFF(employee_sales,employee_bonus);

Verification

grunt> Dump diff_data;   
({}) 
({(2,BOB,23,30000,sales),(2,Maya,23,20000,admin)}) 
({}) 
({(4,Sarah,25,40000,sales),(4,Arya,25,50000,admin)}) 
({}) 
({(6,Sruti,22,35000,sales),(6,Omakar,30,30000,admin)})

Related Searches to Apache Pig - DIFF() Function