pig tutorial - apache pig tutorial - pig optimizer - pig latin - apache pig - pig hadoop



Pig Optimizer Example

learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig optimizer
  • can merge together two foreach statements
  • learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig optimizer foreach statement
  • can simplify the expression in filter statement
  • learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig optimizer filter statement

    Pig Multi-Query Execution

  • parses the entire script to determine if intermediate tasks can be combined
  • Example: one MULTI_QUERY ,MAP_ONLY job
  • learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig multiquery statement

    Basic Optimization Rules

  • Filter early and often
    • apply filters as early as possible to reduce the amount of data processed
    • do not apply filter, if the cost of applying filter is very high and only a small amount of data is filtered out
    • remove NULLs before JOIN
  • Project early and often
    • Pig does not (yet) determine whether a field is no longer needed and drop the field from the record
  • Use the right data type
    • Pig assumes the type of double for numeric computations
    • specify the real type to speed of arithmetic computation (even 2x speedup for queries like bellow)
    • early error detection
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig  data type
  • Use the right JOIN implementation
  • Select the right level of parallelism
    • PARALLEL keyword
  • Select the right level of parallelism
  • Specify number of reducers explicitly
    • the SET default_parallel command (the script level)
    • the PARALLEL clause (the operator level)
  • Let Pig set the number of reducers (since Pig 0.8)
    • based on the size of the input data (assumes no data size change)
    • by default allocates a reducer for every 1GB of data
  • COGROUP*, GROUP*, JOIN*, CROSS, DISTINCT, LIMIT and ORDER start a reduce phase
    • (*) some implementation will force a reducer, but some will do not
  • Select the right level of parallelism
    • Based on Examples - Parallelism samples
      • SET default_parallel 1 - 3m8.033s
      • SET default_parallel 2 - 2m52.972s
      • SET default_parallel 6 - 2m42.771s
      • SET default_parallel 10 - 2m32.819s
      • SET default_parallel 20 - 2m38.023s
      • SET default_parallel 50 - 2m48.035s
  • Combine small input files
    • separate map is created for each file
    • may be inefficient if there are many small files

    maxCombinedSplitSize method in pig

  • Combine small input files
    • pig.maxCombinedSplitSize – specifies the size of data to be processed by a single map. Smaller files are combined until this size is reached
  • Based on Examples - maxCombinedSplitSize samples
    • default - 2m32.819s
    • pig.maxCombinedSplitSize 64MB - 2m42.977s
    • pig.maxCombinedSplitSize 128MB - 2m38.076s
    • pig.maxCombinedSplitSize 256MB - 3m8.913s
  • Use LIMIT operator
    • Compress the results of intermediate jobs
    • Tune MapReduce and Pig parameters

    Related Searches to pig optimizer