[Solved-2 Solutions] Finding mean using pig or hadoop ?



What is mean ?

  • The mean is the average of the numbers: a calculated "central" value of a set of numbers.

Problem :

Data is saved in directory data/data1.txt, data2.txt and so on

merchant_id, user_id, amount
1234, 9123, 299.2
1233, 9199, 203.2
 1234, 0124, 230
 and so on..

What we want to do is for each merchant, find the average amount. So basically, in the end we want to save the output in file. something like

 merchant_id, average_amount
  1234, avg_amt_1234 a
  and so on.

How to calculate the standard deviation ?

Solution 1:

Apache PIG is well adapted form such tasks.

Example

inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray,c2:chararray);
grp = group inpt by id;
mean = foreach grp {
    sum = SUM(inpt.amount);
    count = COUNT(inpt);
    generate group as id, sum/count as mean, sum as sum, count as count;
};
  • Pay special attention to the data type of the amount column as it will influence which implementation of the SUM function PIG is going to invoke.
  • PIG can also do something that SQL can not, it can put the mean against each input row without using any inner joins. That is useful if you are calculating z-scores using standard deviation.
mean = foreach grp {
    sum = SUM(inpt.amnt);
    count = COUNT(inpt);
    generate FLATTEN(inpt), sum/count as mean, sum as sum, count as count;
};
  • FLATTEN(inpt) have an access to the original amount that had contributed to the groups average, sum and count.

The below code helps to Calculate variance and standard deviation

inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray, c2:chararray);
grp = group inpt by id;
mean = foreach grp {
        sum = SUM(inpt.amnt);
        count = COUNT(inpt);
        generate flatten(inpt), sum/count as avg, count as count;
};
tmp = foreach mean {
    dif = (amnt - avg) * (amnt - avg) ;
     generate *, dif as dif;
};
grp = group tmp by id;
standard_tmp = foreach grp generate flatten(tmp), SUM(tmp.dif) as sqr_sum; 
standard = foreach standard_tmp generate *, sqr_sum / count as variance, SQRT(sqr_sum / count) as standard;

Solution 2:

The map step:

record -> (merchant_id as key, amount as value)

The reduce step:

merchant_id, amount) -> (merchant_id, aggregate the value you want)
  • As in the reduce step, we will be provided with a stream of record having the same key and we can do almost everything we can including the average, variance.

Related Searches to finding mean using pig or hadoop