[Solved-3 Solutions] How to incorporate the current input filename into my Pig Latin script ?



What is PigStorage() ?

  • The PigStorage() function loads and stores data as structured text files. It takes a delimiter using which each entity of a tuple is separated as a parameter. By default, it takes ‘\t’ as a parameter.

Syntax

  • Given below is the syntax of the PigStorage() function.
grunt> PigStorage(field_delimiter)

Problem:

How to incorporate the current input filename into my Pig Latin script ?

Solution 1:

We can use PigStorage by specify -tagsource as following

A = LOAD 'input' using PigStorage(',','-tagsource'); 
B = foreach A generate INPUT_FILE_NAME; 

The first field in each Tuple will contain input path (INPUT_FILE_NAME)

Solution 2:

  • The Pig wiki as an example of PigStorageWithInputPath which had the filename in an additional chararray field:

Example:

A = load '/directory/of/files/*' using PigStorageWithInputPath() 
    as (field1:chararray, field2:int, field3:chararray);

UDF

// Note that there are several versions of Path and FileSplit. These are intended:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
import org.apache.pig.builtin.PigStorage;
import org.apache.pig.data.Tuple;

public class PigStorageWithInputPath extends PigStorage {
   Path path = null;

   @Override
   public void prepareToRead(RecordReader reader, PigSplit split) {
       super.prepareToRead(reader, split);
       path = ((FileSplit)split.getWrappedSplit()).getPath();
   }

   @Override
   public Tuple getNext() throws IOException {
       Tuple myTuple = super.getNext();
       if (myTuple != null)
          myTuple.append(path.toString());
       return myTuple;
   }
}

Solution 3:

  • tagSource is deprecated in Pig 0.12.0 . Instead use
    • -tagFile - Appends input source file name to beginning of each tuple.
    • -tagPath - Appends input source file path to beginning of each tuple.
A = LOAD '/user/myFile.TXT' using PigStorage(',','-tagPath'); 
DUMP A  ;

It will gives the full file path as first column

( hdfs://myserver/user/blo/input/2015.TXT,439,43,05,4,NAVI,PO,P &C,P &CR,UC,40)

Related Searches to How to incorporate the current input filename into my Pig Latin script ?