[Solved-2 Solutions] Pig problem with split string(STRSPLIT) ?



What is split string

  • This function is used to split a given string by a given delimiter.

Syntax

  • The syntax of STRSPLIT() is given below. This function accepts a string that is needed to be split, a regular expression, and an integer value specifying the limit (the number of substrings the string should be split).
  • This function parses the string and when it encounters the given regular expression, it splits the string into n number of substrings where n will be the value passed to limit.
grunt> STRSPLIT(string, regex, limit)

Problem:

  • The following tuple H1, want to strsplit its $0 into tuple, but always got error message:
DUMP H1:
(item32;item31;,1)

m = FOREACH H1 GENERATE STRSPLIT($0, ";", 50);
ERROR 1000: Error during parsing. Lexical error at line 1, column 40.  Encountered: <EOF> after : "\";"

Is there any solution ?

Solution 1:

  • There is an escaping problem in the pig parsing routines when it encounters this semicolon.
  • we can use a unicode escape sequence for a semicolon: \u003B.
  • However this must also be slash escaped and put in a single quoted string.
  • The string must be the single quoted string
H1 = LOAD 'h1.txt' as (splitme:chararray, name);

A1 = FOREACH H1 GENERATE STRSPLIT(splitme,'\\u003B'); -- OK
B1 = FOREACH H1 GENERATE STRSPLIT(splitme,';');       -- ERROR
C1 = FOREACH H1 GENERATE STRSPLIT(splitme,':');       -- OK
D1 = FOREACH H1 {                                     -- OK
    splitup = STRSPLIT( splitme, ';' );
    GENERATE splitup;
}

A2 = FOREACH H1 GENERATE STRSPLIT(splitme,"\\u003B"); -- ERROR
B2 = FOREACH H1 GENERATE STRSPLIT(splitme,";");       -- ERROR
C2 = FOREACH H1 GENERATE STRSPLIT(splitme,":");       -- ERROR
D2 = FOREACH H1 {                                     -- ERROR
    splitup = STRSPLIT( splitme, ";" );
    GENERATE splitup;
}

Dump H1;
(item32;item31;,1)

Dump A1;
((item32,item31))

Dump C1;
((item32;item31;))

Dump D1;
((item32,item31))

Solution 2:

  • STRSPLIT on a semi-colon is tricky. The semi colon should be put inside the code.
raw = LOAD 'cname.txt' as (name,cname_string:chararray);

xx = FOREACH raw {
  cname_split = STRSPLIT(cname_string,';');
  GENERATE cname_split;
}
  • This is how we originally implemented STRSPLIT() command

Related Searches to Pig problem with split string(STRSPLIT)