[Solved-1 Solution] Subtract One row's value from another row in Pig ?



Problem:

If you are trying to develop a sample program using Pig to analyse some log files. You need to analyze the running time of different jobs. When you read in the log file of the job, you get the start time and the end time of the job, like this :

(Wed,03/20/13,01:03:37,EDT)
(Wed,03/20/13,01:05:00,EDT)

Now, to calculate the elapsed time, you need to subtract these 2 timestamps, but since both timestamps are in the same bag, but not sure how to compare them.

How to subtract one rows value row in pig ?

Solution 1

Is there a unique ID for the job that is in both log lines? Also is there something to indicate which event is start, and which is end ?

If you could read the dataset twice, once for start events, once for end-events, and join the two together. Then you'll have one record with both events in it.

A = FOREACH logline GENERATE id, type, timestamp;
START = FILTER A BY (type == 'start');

END = FILTER A  BY (type == 'end');

JOINED = JOIN START by ID, END by ID;

DIFF = FOREACH JOINED GENERATE (START.timestamp - END.timestamp); // or whatever;

Related Searches to Subtract One row's value from another row in Pig