[Solved-1 Solution] Max/Min for whole sets of records in PIG ?



What is max()

  • The Pig Latin MAX() function is used to calculate the highest value for a column (numeric values or chararrays) in a single-column bag. While calculating the maximum value, the Max() function ignores the NULL values.
  • To get the global maximum value, we need to perform a Group All.operation, and calculate the maximum value using the MAX() function.
  • To get the maximum value of a group, we need to group it using the Group By operator and proceed with the maximum function.

Syntax

  • Given below is the syntax of the Max() function.
grunt> Max(expression)

What is min()

  • The MIN() function of Pig Latin is used to get the minimum (lowest) value (numeric or chararray) for a certain column in a single-column bag. While calculating the minimum value, the MIN() function ignores the NULL values.
  • To get the global minimum value, we need to perform a Group All operation, and calculate the minimum value using the MIN() function.
  • To get the minimum value of a group, we need to group it using theGroup By operator and proceed with the minimum function.

Syntax

  • Here is the syntax of the MIN() function.
grunt> MIN(expression)

Problem:

  • Here is an example of set of records that we loading from a file and the first thing we need to do is get the max and min of a column.
  • In SQL we would do this with a subquery like this:
select c.state, c.population, 
(select max(c.population) from state_info c) as max_pop, 
(select min(c.population) from state_info c) as min_pop
from state_info c
  • It has a MAX and MIN function but when we tried doing the following it didn't work:
records=LOAD '/Users/Winter/School/st_incm.txt'  AS (state:chararray, population:int);
with_max = FOREACH records GENERATE state, population, MAX(population);
  • This one is not working. Is there any solution?

Solution 1:

  • We need to group all the data together but no extra column is required if we can use GROUP ALL

Pig:

records = LOAD 'states.txt'  AS (state:chararray, population:int);
records_group = GROUP records ALL;
with_max = FOREACH records_group 
           GENERATE
               FLATTEN(records.(state, population)), MAX(records.population);

Input:

CA  10
VA  5
WI  2

Output:

(CA,10,10)
(VA,5,10)
(WI,2,10)

Related Searches to Max/Min for whole sets of records in PIG