[Solved-2 Solutions] Skipping the header while loading the text file using Piglatin ?



What is filter

  • The FILTER operator is used to select the required tuples from a relation based on a condition.

Syntax

  • Here is the syntax of the FILTER operator.
grunt> Relation2_name = FILTER Relation1_name BY (condition);

Problem :

  • If you have a text file and it's first row contains the header. Now if we want to do some operation on the data, but while loading the file using PigStorage it takes the HEADER too. Is it possible to skip the header ?
  • Here is the command which is used to load a data
input_file = load '/home/hadoop/smdb_tracedata.csv'
USING PigStorage(',')
as (trans:chararray, carrier:chararray,aainday:chararray);

Solution 1:

Using Filter:

  • Usually the way we solve this problem is to use a FILTER on something. We know is in the header.

For example, consider the following data example:

STATE,NAME
MD,Bob
VA,Larry

We can use as below mentioned:

B = FILTER A BY state != 'STATE';

Solution 2:

Here is another way to achieve this:

Load the complete file including header record in a relation

fileAllRecords = LOAD 'csvfilename' using PigStorage(',');
  • Use the Linux tail command to stream only the data records
fileDataRecords = STREAM fileAllRecords THROUGH `tail -n +2` AS (chararray:f1 ..)
  • To verify the header record is removed, use following command -
firstFewRecords = STREAM fileDataRecords THROUGH `head -20`;
DUMP firstFewRecords;

Related Searches to Skipping the header while loading the text file using Piglatin