[Solved-2 Solutions] STORE output to a single CSV in pig ?



What is CSV ?

CSV loading and storing with support for multi-line fields, and escaping of delimiters and double quotes within fields

Problem:

Is there any way to store output to a single CSV file in Pig ?

Solution 1:

We can do this in a few ways:

  • To set the number of reducers for all Pig opeations, we can use the default_parallelproperty - but this means every single step will use a single reducer, decreasing throughput:
set default_parallel 1;
  • Prior to calling STORE, if one of the operations execute is (COGROUP, CROSS, DISTINCT, GROUP, JOIN (inner), JOIN (outer), and ORDER BY), then we can use the PARALLEL 1keyword to denote the use of a single reducer to complete that command:
GROUP a BY grp PARALLEL 1;

Solution 2:

  • We can also use Hadoop's getmerge command to merge all those part-* files. This is only possible if we run your Pig scripts from the Pig shell .
  • This as an advantage over the proposed solution: as we can still use several reducers to process your data, so your job may run faster, especially if each reducer output few data.
grunt> fs -getmerge  <Pig output file> <local file>

Related Searches to STORE output to a single CSV in pig