[Solved-3 Solutions] Merging compressed files on HDFS ?
- Apache Pig works on top of Hadoop. It is an analytical tool that analyzes large datasets that exist in the Hadoop File System. To analyze data using Apache Pig, we have to initially load the data into Apache Pig.
- How do you merge all files in a directory on HDFS, that you know are all compressed, into a single compressed file, without copying the data through the local machine?
- If you have a folder /data/input that contains the files part-m-00000.gz and part-m-00001.gz.
How to merge them into a single file ?
- We would suggest to look at FileCrush ,a tool to merge files on HDFS using MapReduce. It does exactly what you described and provides several options to deal with compressions and control the number of output files.
Here is an Example:
- max-file-blocks represents the maximum number of dfs blocks per output file.
- For example,
- With the default value 8, 80 small files, each being 1/10th of a dfs block will be grouped into to a single output file since 8 * 1/10 = 8 dfs blocks. If there are 81 small files, each being 1/10th of a dfs block, two output files will be created. One output file contain the combined contents of 41 files and the second will contain the combined contents of the other 40. A directory of many small files will be converted into fewer number of larger files where each output file is roughly the same size.
- If we set the Parallel to 1 - then we will have single output file.
- This can be done in 2 ways:
- In your pig add set default_parallel 20; but note that this effect everything in your pig
- Change the Parallel for a single operation - like DISTINCT ID PARALLEL 1;
- There's an option to do merging to the local filesystem using "hdfs dfs -getMerge" command. Perhaps we can use that to merge to the local filesystem and then use 'hdfs dfs -copyFromLocal" command to copy it back into hdfs.