[Solved-1 Solution] Using IN clause with PIG FILTER ?



In operator

  • Pig had no support for IN operators. To imitate an IN operation, users had to concatenate several OR operators.

Here is an example

a = LOAD ‘1.txt’ USING PigStorage (‘,’) AS (i:int);
b = FILTER a BY
(i == 1) OR
(i == 22) OR
(i == 333) OR
(i == 4444) OR
(i == 55555)
  • Now, this type of expression can be re-written in a more compressed manner using an IN operator:
a = LOAD ‘1.txt’ USING PigStorage (‘,’) AS (i:int);
b = FILTER a BY i IN (1, 22, 333, 4444, 55555);

Problem:

How to use IN clause with pig filter ?

Solution 1:

Filter in Pig

  • Pig allows to remove unwanted records based on a condition. The Filter functionality is similar to the WHERE clause in SQL.
  • The FILTER operator in pig is used to remove unwanted records from the data file.
  • The syntax of FILTER operator is shown below:
<new relation> = FILTER <relation> BY <condition>

Here relation is the data set on which the filter is applied, condition is the filter condition and new relation is the relation created after filtering the rows.

Pig Filter Examples:

Lets consider the below sales data set as an example

year,product,quantity
---------------------
2000, iphone, 1000
2001, iphone, 1500
2002, iphone, 2000
2000, nokia, 1200
2001, nokia, 1500
2002, nokia, 900

1. select products whose quantity is greater than or equal to 1000.

grunt> A = LOAD '/user/hadoop/sales' USING PigStorage(',') AS (year:int,product:chararray,quantity:int);
grunt> B = FILTER A BY quantity >= 1000;
grunt> DUMP B;
(2000,iphone,1000)
(2001,iphone,1500)
(2002,iphone,2000)
(2000,nokia,1200)
(2001,nokia,1500)

2. select products whose quantity is greater than 1000 and year is 2001

grunt> C = FILTER A BY quantity > 1000 AND year == 2001;
(2001,iphone,1500)
(2001,nokia,1500)

3. select products with year not in 2000

grunt> D = FILTER A BY year != 2000;
grunt> DUMP D;
(2001,iphone,1500)
(2002,iphone,2000)
(2001,nokia,1500)
(2002,nokia,900)
  • we can use all the logical operators (NOT, AND, OR) and relational operators (< , >, ==, !=, >=, <= ) in the filter conditions.

Related Searches to Using IN clause with PIG FILTER