pig tutorial - apache pig tutorial - pig optimizer - pig latin - apache pig - pig hadoop
Pig Optimizer Example



Pig Multi-Query Execution

Basic Optimization Rules
- apply filters as early as possible to reduce the amount of data processed
- do not apply filter, if the cost of applying filter is very high and only a small amount of data is filtered out
- remove NULLs before JOIN
- Pig does not (yet) determine whether a field is no longer needed and drop the field from the record
- Pig assumes the type of double for numeric computations
- specify the real type to speed of arithmetic computation (even 2x speedup for queries like bellow)
- early error detection

- PARALLEL keyword
- the SET default_parallel command (the script level)
- the PARALLEL clause (the operator level)
- based on the size of the input data (assumes no data size change)
- by default allocates a reducer for every 1GB of data
- (*) some implementation will force a reducer, but some will do not
- Based on Examples - Parallelism samples
- SET default_parallel 1 - 3m8.033s
- SET default_parallel 2 - 2m52.972s
- SET default_parallel 6 - 2m42.771s
- SET default_parallel 10 - 2m32.819s
- SET default_parallel 20 - 2m38.023s
- SET default_parallel 50 - 2m48.035s
- separate map is created for each file
- may be inefficient if there are many small files
maxCombinedSplitSize method in pig
- pig.maxCombinedSplitSize – specifies the size of data to be processed by a single map. Smaller files are combined until this size is reached
- default - 2m32.819s
- pig.maxCombinedSplitSize 64MB - 2m42.977s
- pig.maxCombinedSplitSize 128MB - 2m38.076s
- pig.maxCombinedSplitSize 256MB - 3m8.913s
- Compress the results of intermediate jobs
- Tune MapReduce and Pig parameters