[Solved-3 Solutions] Merging compressed files on HDFS ?

HDFS

Apache Pig works on top of Hadoop. It is an analytical tool that analyzes large datasets that exist in the Hadoop File System. To analyze data using Apache Pig, we have to initially load the data into Apache Pig.

How do you merge all files in a directory on HDFS, that you know are all compressed, into a single compressed file, without copying the data through the local machine?

If you have a folder /data/input that contains the files part-m-00000.gz and part-m-00001.gz.

How to merge them into a single file ?

We would suggest to look at FileCrush ,a tool to merge files on HDFS using MapReduce. It does exactly what you described and provides several options to deal with compressions and control the number of output files.


  Crush --max-file-blocks XXX /data/input /data/output

With the default value 8, 80 small files, each being 1/10th of a dfs block will be grouped into to a single output file since 8 * 1/10 = 8 dfs blocks. If there are 81 small files, each being 1/10th of a dfs block, two output files will be created. One output file contain the combined contents of 41 files and the second will contain the combined contents of the other 40. A directory of many small files will be converted into fewer number of larger files where each output file is roughly the same size.

In your pig add set default_parallel 20; but note that this effect everything in your pig
Change the Parallel for a single operation - like DISTINCT ID PARALLEL 1;

There's an option to do merging to the local filesystem using "hdfs dfs -getMerge" command. Perhaps we can use that to merge to the local filesystem and then use 'hdfs dfs -copyFromLocal" command to copy it back into hdfs.

Eval Functions

Math-function