[Solved-1 Solution] Fine tuning PIG for local execution ?

Execution Modes

Pig has two execution modes or exectypes:

Local Mode - To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system. Specify local mode using the -x flag (pig -x local). Note that local mode does not support parallel mapper execution with Hadoop 0.20.x and 1.0.0. This is because the LocalJobRunner of these Hadoop versions is not thread-safe.
Mapreduce Mode - To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode; you can, but don't need to, specify it using the -x flag (pig OR pig -x mapreduce).

Problem:

Is there any recommendations for fine tuning PIG for local execution ?

Solution 1:

Pig's makes it clear that local operation is intended to be run single-threaded, taking different code paths for certain functions that would otherwise use distributed sort. As a result, optimizing for Pig's local mode seems like the wrong solution to the presented problem.

Have you considered running a local, "pseudo-distributed" cluster instead of investing in full cluster setup?

We can follow Hadoop's instructions for pseudo-distributed operation, then point Pig at localhost. This would have the desired result, at the expense of two-step startup and teardown.

You'll want to raise the number of default mappers and reducers to consume all cores available on your machine

simply define mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum in your local copy of $HADOOP_HOME/conf/mapred-site.xml

Apache Pig Basics

Apache Pig - Filtering

Apache Pig - Operators

Apache Pig - Functions

Eval Functions

Bag-Tuple Functions

DateTime Function

User Defined Function

Load-store Function

Math-function

Apache Pig- Regex

Apache Pig - Running Scripts