[Solved-2 Solutions] Export from pig to CSV



What is CSV ?

  • CSV is a simple file format used to store tabular data, such as a spreadsheet or database . Files in the CSV format can be imported to and exported from programs that store data in tables, such as Microsoft Excel or OpenOffice Calc .
  • CSV stands for "comma-separated values". Its data fields are most often separated, or delimited , by a comma .
  • There is a lot of trouble getting data out of pig and into a CSV that we can use in Excel or SQL (or R or SPSS etc etc) without a lot of manipulation
Export from pig to csv

Learn Apache pig - Apache pig tutorial - Export from pig to csv - Apache pig examples - Apache pig programs

Problem:

  • We tried using the following function:
STORE pig_object INTO '/Users/Name/Folder/pig_object.csv'
    USING CSVExcelStorage(',','NO_MULTILINE','WINDOWS');
  • It creates the folder with that name with lots of part-m-0000# files. We can later join them all up using cat part* > filename.csv but there's no header which means we have to put it in manually.
  • The PigStorageSchema is supposed to create another bit with a header but it doesn't seem to work at all. eg, we get the same result as if it's just stored, no header file: STORE pig_object INTO '/Users/Name/Folder/pig_object' USING org.apache.pig.piggybank.storage.PigStorageSchema();
  • Is there any way of getting the data out of Pig into a simple CSV file without these multiple steps?

Solution 1:

  • There isn't a one-liner which does the job,but it can help to come up with the followings (Pig v0.10.0):
A = load '/user/hadoop/csvinput/somedata.txt' using PigStorage(',') 
      as (firstname:chararray, lastname:chararray, age:int, location:chararray);
store A into '/user/hadoop/csvoutput' using PigStorage('\t','-schema');

1. The result need to be copied by the local disk:

hadoop fs -rm /user/hadoop/csvoutput/.pig_schema
hadoop fs -getmerge /user/hadoop/csvoutput ./output.csv

Since -getmerge takes an input directory so first you need to get clear of .pig_schema

2. Storing the result on HDFS:

hadoop fs -cat /user/hadoop/csvoutput/.pig_header 
  /user/hadoop/csvoutput/part-x-xxxxx | 
    hadoop fs -put - /user/hadoop/csvoutput/result/output.csv

Solution 2:

If we want to store the date and merge using the code getmerge -nl

STORE pig_object INTO '/user/hadoop/csvoutput/pig_object'
    using PigStorage('\t','-schema');
fs -getmerge -nl /user/hadoop/csvoutput/pig_object  /Users/Name/Folder/pig_object.csv;

A single TSV/CSV file with the following structure:

1 - header
2 - empty line
3 - pig schema
4 - empty line
5 - 1st line of DATA
6 - 2nd line of DATA

Here you can simply remove lines [2,3,4] using AWK:

awk 'NR==1 || NR>4 {print}' /Users/Name/Folder/pig_object.csv > /Users/Name/Folder/pig_object_clean.csv

Related Searches to Export from pig to CSV