pig tutorial - apache pig tutorial - Splitting input into substrings in PIG (Hadoop) ? - pig latin - apache pig - pig hadoop



What is substring ?

  • A substring of a string is a string that occurs "in" . For example, "the best of" is a substring of "It was the best of times".
  • This is not to be confused with subsequence, which is a generalization of substring. For example, "Itwastimes" is a subsequence of "It was the best of times", but not a substring.

What is hadoop ?

  • Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment
  • On Hadoop system using Apache Pig you can write very simple code that will split file on the fly. we will have the flexibility to control flow of data and do manipulations (if any) and split file.

Now we see how to split file into individual files using Pig Script. Here is our sample file

TEST.DEV.ENV.SAMPLE.FILE:

 HEADER1DIBBCCLY8-9568347556434756972CMMS21WUE
000001010K1DIBB  7RHUN  2100000AE            J82V  2269167AD         2002-03-079999-12-31+000000000100000    22004-02-28-20.00.13.106749
000002010K1DIBB  7RHUN  2100000AE            J82V  2269167AD         2002-03-072004-07-30+000000000100000    32004-02-28-20.00.13.106749
9TRAILR1DIBBCCLY8-95683475564347560000084<br>
 HEADERVV9IHFYSKN-4654178251104433898CMMS21ANI
0000030108VV9IH  AKB7L  3300000AE            XMV9  2269167AD         2002-03-059999-12-31+000000000100000    22004-02-28-20.00.13.106749
0000040108VV9IH  AKB7L  3300000SE            XMV9  2269167AD         2002-03-052004-07-30+000000000100000    32004-02-28-20.00.13.106749
9TRAILRVV9IHFYSKN-46541782511044330000510<br>
 HEADERLY674FBNR4-9375012333185998800CMMS21AIZ
000005010ULY674  5XWJR  0100000SE            XMV9  2269167AD         2002-03-059999-12-31+000000000100000    22004-02-28-20.00.13.106749
000006010ULY674  5XWJR  0100000AE            XMV9  2269167AD         2002-03-052004-07-30+000000000100000    32004-02-28-20.00.13.106749
9TRAILRLY674FBNR4-93750123331859980000150<br>
 HEADERT0X36Q6YVQ-5632769394873798290CMMS21WLO
000007010QT0X36  RJPWK  5500000AE            J82V  2269167AD         2002-10-229999-12-31+000000000100000    22004-02-28-20.00.13.106749
000008010QT0X36  RJPWK  5500000AE            J82V  2269167AD         2002-10-222004-07-30+000000000100000    32004-02-28-20.00.13.106749
9TRAILRT0X36Q6YVQ-56327693948737980000642<br>
 HEADER8LIKAC67U9-2737265552238819829CMMS21HMV
000009010S8LIKA  07L1X  4400000BE            J82V  2269167AD         2002-03-079999-12-31+000000000100000    22004-02-28-20.00.13.106749
000010010S8LIKA  07L1X  4400000BE            J82V  2269167AD         2002-03-072004-07-30+000000000100000    32004-02-28-20.00.13.106749
9TRAILR8LIKAC67U9-27372655522388190000412<br>
 HEADERAS7G3QPIUC-8825934656338659366CMMS21BQA
000011010QAS7G3  CO46Q  8500000BE            RI12  2269167AD         2002-03-059999-12-31+000000000100000    22004-02-28-20.00.13.106749
000012010QAS7G3  CO46Q  8500000BE            RI12  2269167AD         2002-03-052004-07-30+000000000100000    32004-02-28-20.00.13.106749
9TRAILRAS7G3QPIUC-88259346563386590000865
  • In the sample file if we see there are lot of headers and trailers and some data between them. Our requirement is to split each set of data with HEADER, TRAILER and DETAIL DATA into individual files. For our sample it should generate 6 different files.
  • We will split file using key values in the file. Here we use positions 11-15 (5 characters) in DETAIL DATA and positions 8-12 in HEADER and TRAILER data. We must maintain the consistency of the data as Header row, Details rows and Trailer row.
  • This makes the complete structure of file and keeps all data together.

Now we write Pig Script to split the file -

REGISTER /home/jars/pig/piggybank.jar;

A  = LOAD  '/path/to/input/file/TEST.DEV.ENV.SAMPLE.FILE'
     USING PigStorage('\t') AS (line:chararray);

B  = FILTER A BY SUBSTRING(line, 1, 7) != 'HEADER' AND SUBSTRING(line, 0, 7) != '9TRAILR';

C  = FILTER A BY SUBSTRING(line, 1, 7) == 'HEADER' OR SUBSTRING(line, 0, 7) == '9TRAILR';

-- Extract data based on key value from Header, Details and Trailer rows 
D  = GROUP B BY SUBSTRING($0, 10, 15);
E  = GROUP C BY SUBSTRING($0, 7, 12);

F  = UNION D, E;

G  = FOREACH F GENERATE FLATTEN($0), FLATTEN($1);

SPLIT G INTO H IF SIZE($0) > 0, X IF SIZE($0) <= 0;

J  = ORDER H BY $1;

STORE J INTO '/path/to/output/directory'
        -- Stores using \t as the input separator 
        USING org.apache.pig.piggybank.storage.MultiStorage('/path/to/output/directory', '0');
  • File is split on the value we taken as key value. Here our key value from all rows (Header, Details and Trailer) is 5 characters specified using function substring() at transformtions ā€˜Dā€™ and ā€˜Eā€™.

Now we will see output directory for files -

pig split string example

Learn Apache pig - Apache pig tutorial - Pig split string example - Apache pig examples - Apache pig programs

pig substring

Learn Apache pig - Apache pig tutorial - Pig Substring - Apache pig examples - Apache pig programs

Split in Pig

Learn Apache pig - Apache pig tutorial - Split in pig - Apache pig examples - Apache pig programs


Related Searches to Splitting input into substrings in PIG (Hadoop)