[Solved-2 Solutions] Pig: Hadoop jobs Fail ?

What is hadoop ?

  • Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment


we have a pig script that queries data from a csv file.

The script has been tested locally with small and large .csv files.

In Small Cluster: It starts with processing the scripts, and fails after completing 40% of the call

The error is,

Failed to read data from "path to file"

Solution 1:

  • An answer for the General Problem would be changing the errors levels in the Configuration Files, adding these two lines to mapred-site.xml
log4j.logger.org.apache.hadoop = error,A 
log4j.logger.org.apache.pig= error,A

It as a kind of an OutOfMemory Exception

Solution 2:

  • Its needed to check logs to increase the verbosity level if needed

To change the memory in Hadoop change the hadoop-env.sh file

# The following applies to multiple commands (fs, dfs, fsck, distcp etc)

For Apache PIG we have this in the header of pig bash file:

# PIG_HEAPSIZE The maximum amount of heap to use, in MB.
# Default is 1000.

So we can use export

$ export PIG_HEAPSIZE=4096MB

