[Solved-3 Solutions] Pig Script REPLACE with pipe symbol ?



Problem:

  • If you want to strip characters outside of the curly brackets in rows that look like the following.
35|{......}|
  • Stripping the '35|' from the front and the trailing '|' from the end.
{.....}
  • Initially working on the first 3 characters, If you try the following but it removes everything.
a = LOAD '/file' as (line1:chararray);

 b = FOREACH x generate REPLACE(line1, '35|','');

Solution 1:

  • | and { and } are special characters in regular expressions and the second parameter for REPLACE is a regular expression.
  • Try to escape the characters:
b = FOREACH x generate REPLACE(line1, '35\\|','');

Solution 2:

  • Function (UDF) which takes your data as input and returns the processed data. If we want to transform data into a more complex form which cant be achieved simply by REPLACE , we can create a Javascript/Java/Jython/Ruby/Groovy/Python User Defined

Example of Javascript UDF:

Pig Script:

--including the js file containing the UDF
 register 'test.js' using javascript as myfuncs;

 a = LOAD '/file' as (line1:chararray);

 --Processing each line1 by calling UDF
 b = FOREACH x generate myfuncs.processData(line1);
 dump b;

test.js

processData.outputSchema = "word:chararray,num:long";

 function processData(word){
    return {word:word, num:word.length};
 }

Solution 3:

  • We could use REGEX_EXTRACT :
REGEX_EXTRACT(line1, '.*(\\{.*\\}).*', 1);

Related Searches to Pig Script REPLACE with pipe symbol ?