pig tutorial - apache pig tutorial - Apache Pig - DIFF() Function - pig latin - apache pig - pig hadoop




What is DIFF() Function in Apache Pig ?

  • The DIFF() function used in Apache Pig is used to compare two bags in a tuple.
  • The specification which is given on DIFF() function is the name of the existing series and is also known as the degree of differencing, in parentheses.
  • The degree of differencing used in DIFF() function must be specified as it is not default.
  • System-missing values which is used in DIFF() will appear at the beginning of the new series.
  • We can specify one degree of differencing which is done per DIFF function.

Syntax

grunt> DIFF (expression, expression)

Example

  • We can assume that we have two files namely wikitechy_employee_sales.txt and wikitechy_employee_bonus.txt which is given in the HDFS directory /pig_data/ which is given below:

wikitechy_employee_sales.txt

1,Rubin,22,25000,sales 
2,BOB,23,30000,sales 
3,Lavanya,23,25000,sales 
4,Sarah,25,40000,sales 
5,Arvin,23,45000,sales 
6,Sruti,22,35000,sales

wikitechy_employee_bonus.txt

1,Rubin,22,25000,sales 
2,Maya,23,20000,admin 
3,Lavanya,23,25000,sales 
4,Arya,25,50000,admin 
5,Arvin,23,45000,sales 
6,Omakr,30,30000,admin

We have loaded the files into Pig, with the relation names which are called employee_sales and employee_bonus.

grunt> employee_sales = LOAD 'hdfs://localhost:9000/pig_data/wikitechy_employee_sales.txt' USING PigStorage(',')
as (sno:int, name:chararray, age:int, salary:int, dept:chararray);

employee_bonus

grunt> employee_bonus = LOAD 'hdfs://localhost:9000/pig_data/wikitechy_employee_bonus.txt' USING PigStorage(',')
as (sno:int, name:chararray, age:int, salary:int, dept:chararray);
  • We need to group the records/tuples of the relations employee_sales and employee_bonus with the key sno, which is done using the COGROUP operator which is given below:
grunt> cogroup_data = COGROUP employee_sales by sno, employee_bonus by sno;

Verify the relation cogroup_data by using the DUMP operator which is given below:

grunt> Dump cogroup_data;
(1,{(1,Rubin,22,25000,sales)},{(1,Rubin,22,25000,sales)}) 
(2,{(2,BOB,23,30000,sales)},{(2,Maya,23,20000,admin)}) 
(3,{(3,Lavanya,23,25000,sales)},{(3,Lavanya,23,25000,sales)}) 
(4,{(4,Sarah,25,40000,sales)},{(4,Arya,25,50000,admin)}) 
(5,{(5,Arvin,23,45000,sales)},{(5,Arvin,23,45000,sales)}) 
(6,{(6,Sruti,22,35000,sales)},{(6,Omakr,30,30000,admin)})

Calculating the Difference between Two Relations

We need to calculate the difference between the two relations by using DIFF() function and we need to store it in the relation diff_data which is given below:

grunt> diff_data = FOREACH cogroup_data GENERATE DIFF(employee_sales,employee_bonus);

Verification

grunt> Dump diff_data;   
({}) 
({(2,BOB,23,30000,sales),(2,Maya,23,20000,admin)}) 
({}) 
({(4,Sarah,25,40000,sales),(4,Arya,25,50000,admin)}) 
({}) 
({(6,Sruti,22,35000,sales),(6,Omakar,30,30000,admin)})

Related Searches to Apache Pig - DIFF() Function

Adblocker detected! Please consider reading this notice.

We've detected that you are using AdBlock Plus or some other adblocking software which is preventing the page from fully loading.

We don't have any banner, Flash, animation, obnoxious sound, or popup ad. We do not implement these annoying types of ads!

We need money to operate the site, and almost all of it comes from our online advertising.

Please add wikitechy.com to your ad blocking whitelist or disable your adblocking software.

×