[Solved-1 Solution] Hadoop Pig - Removing csv header ?

Problem :

Here the csv files have header in the first line. Loading them into pig create a mess on any subsequent functions (like SUM). Here to apply a filter on the loaded data to remove the rows containing the headers:

affaires    = load 'affaires.csv'   using PigStorage(',') as (NU_AFFA:chararray,    date:chararray) ;
affaires    = filter affaires by date matches '../../..';

Is there is a way to tell pig not to load the first line of the csv, like an "as_header" boolean parameter to the load function. What would be a best practice?

Solution 1:

  • CSVExcelStorage loader support to skip the header row, so instead of PigStorage use CSVExcelStorage. Download piggybank.jar and try this option.

Sample example



PigScript:(With SKIP_INPUT_HEADER option)

REGISTER '/tmp/piggybank.jar';
A  = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER');



