  • In a Big Data context have a time series S1=(t1, t2, t3 ...) sorted in an ascending order. If you would like to produce a series of time differences: S2=(t2-t1, t3-t2 ...)

Is there a way to do this in Apache Pig ? and What will be an great way to do this suitable for large amounts of data ?

Solution 1:

	S1 = Generate Id,Timestamp i.e. from t1...tn
	S2 = Generate Id,Timestamp i.e. from t2...tn
	S3 = Join S1 by Id,S2 by Id
	S4 = Extract S1.Timestamp,S2.Timestamp,(S2.TimeStamp - S1.TimeStamp)

Sample Data


s1 = LOAD 'test2.txt' USING PigStorage() AS (t:chararray);
s11 = foreach s1 generate ToDate(t) as t1;
s1_new = rank s11;

s2 = LOAD 'test2.txt' USING PigStorage() AS (t:chararray);
s22 = foreach s2 generate ToDate(t) as t1;
s2_new = rank s22;

-- Filter records by excluding the 1 ranked row and rank the new data
ss = FILTER s2_new by (rank_s22 > 1);
ss_new = rank ss;

s3 = join s1_new by rank_s11,ss_new by rank_ss;
s4 = foreach s3 generate DaysBetween(ss_new::t1,s1_new::t1) as time_diff;

DUMP s4;
