[Solved-1 Solutions] Apache Pig load entire relationship into UDF ?



UDF:

  • Apache Pig provides extensive support for User Defined Functions (UDF’s). Using these UDF’s, we can define our own functions and use them.
  • The UDF support is provided in six programming languages, namely, Java, Jython, Python, JavaScript, Ruby and Groovy.
  • For writing UDF’s, complete support is provided in Java and limited support is provided in all the remaining languages.
  • Using Java, we can write UDF’s involving all parts of the processing like data load/store, column transformation, and aggregation.
  • Since Apache Pig has been written in Java, the UDF’s written using Java language work efficiently compared to other languages.
  • In Apache Pig, we also have a Java repository for UDF’s named Piggybank. Using Piggybank, we can access Java UDF’s written by other users, and contribute our own UDF’s.

Problem:

We have a pig script that pertains to 2 Pig relations, let’s say A and B. A is a small relationship, and B is a big one. My UDF should load all of A into memory on each machine and then use it while processing B. Currently we do it like this.

A = foreach smallRelation Generate ...
B = foreach largeRelation Generate propertyOfB;
store A into 'templocation';
C = foreach B Generate CustomUdf(propertyOfB);

Every machine load from 'templocation' to get A. This works, but we have two problems with it.

  • How to load a relationship directly into the HDFS cache ?
  • When we reload the file in UDF we got to write logic to parse the output from A that was outputted to file when we did rather be directly using bags and tuples (is there a built in Pig java function to parse Strings back into Bag/Tuple form?).

Does anyone know how it should be done ?

Solution 1:

  • We can GROUP ALL on A first which "bags" all data in A into one field. Then artificially add a common field on both A and B and join them.
  • This way, foreach tuple in the enhanced B, you will have the full data of A for your UDF to use.

Here is the code that used to load entire relationship into udf

add an artificial join key with value 'xx'
B_aux = FOREACH B GENERATE 'xx' AS join_key, fb1, fb2;
A_all = GROUP A ALL;
A_aux = FOREACH A GENERATE 'xx' AS join_key, $1;
A_B_JOINED = JOIN B_aux BY join_key, A_aux BY join_key USING 'replicated';
C = FOREACH A_B_JOINED GENERATE CustomUdf(fb1, fb2, A_all);
  • This is replicated join, it's also only map-side join.

Related Searches to Apache Pig load entire relationship into UDF