[Solved-3 Solutions] How to store gzipped files using PigStorage in Apache Pig ?



Problem:

  • Apache Pig v0.7 can read gzipped files with no extra effort on part, e.g.:
MyData = LOAD '/tmp/data.csv.gz' USING PigStorage(',') AS (timestamp, user, url);

It is processed that data and output it to disk :

PerUser = GROUP MyData BY user;
UserCount = FOREACH PerUser GENERATE group AS user, COUNT(MyData) AS count;
STORE UserCount INTO '/tmp/usercount' USING PigStorage(',');

But the output file isn't compressed:

/tmp/usercount/part-r-00000

Is there a way of STORE command to output content in gzip format ?

Solution 1:

There are two ways:

Why pigstorage()

  • The PigStorage() function loads and stores data as structured text files. It takes a delimiter using which each entity of a tuple is separated as a parameter. By default, it takes ‘\t’ as a parameter.

1. As mentioned above in the storage we can say the output directory as

usercount.gz STORE UserCount INTO '/tmp/usercount.gz' USING PigStorage(',');

Use compression

  • Compression can be used to reduce the amount of data to be stored to disk and written over the network. By default, compression is turned off, both between map and reduce tasks and between MapReduce jobs.

2. Set compression method in script.

set output.compression.enabled true; set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;

Solution 2:

Specifying the compression format using the 'STORE' statement

STORE UserCount INTO '/tmp/usercount.gz' USING PigStorage(',');
STORE UserCount INTO '/tmp/usercount.bz2' USING PigStorage(',');
STORE UserCount INTO '/tmp/usercount.lzo' USING PigStorage(',');

Notice the above statements. Pig supports 3 compression formats, i.e GZip, BZip2 and LZO. For getting LZO to work we have to install it separately.

Solution 3:

Specifying compression via job properties

  • By setting the following properties in pig script, i.e, output.compression.enabled and output.compression.codec via the following code
set output.compression.enabled true;
set output.compression.codec com.hadoop.compression.lzo.LzopCodec;
set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
set output.compression.codec org.apache.hadoop.io.compress.BZip2Codec;

Related Searches to How to store gzipped files using PigStorage in Apache Pig ?