pig tutorial - apache pig tutorial - pig optimizer - pig latin - apache pig - pig hadoop




Pig Optimizer Example

learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig optimizer
  • can merge together two foreach statements
  • learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig optimizer foreach statement
  • can simplify the expression in filter statement
  • learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig optimizer filter statement

    Pig Multi-Query Execution

  • parses the entire script to determine if intermediate tasks can be combined
  • Example: one MULTI_QUERY ,MAP_ONLY job
  • learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig multiquery statement

    Basic Optimization Rules

  • Filter early and often
    • apply filters as early as possible to reduce the amount of data processed
    • do not apply filter, if the cost of applying filter is very high and only a small amount of data is filtered out
    • remove NULLs before JOIN
  • Project early and often
    • Pig does not (yet) determine whether a field is no longer needed and drop the field from the record
  • Use the right data type
    • Pig assumes the type of double for numeric computations
    • specify the real type to speed of arithmetic computation (even 2x speedup for queries like bellow)
    • early error detection
    learn apache pig - apache pig tutorial - pig tutorial - apache pig examples - big data - apache pig script - apache pig program - apache pig download - apache pig example  - pig  data type
  • Use the right JOIN implementation
  • Select the right level of parallelism
    • PARALLEL keyword
  • Select the right level of parallelism
  • Specify number of reducers explicitly
    • the SET default_parallel command (the script level)
    • the PARALLEL clause (the operator level)
  • Let Pig set the number of reducers (since Pig 0.8)
    • based on the size of the input data (assumes no data size change)
    • by default allocates a reducer for every 1GB of data
  • COGROUP*, GROUP*, JOIN*, CROSS, DISTINCT, LIMIT and ORDER start a reduce phase
    • (*) some implementation will force a reducer, but some will do not
  • Select the right level of parallelism
    • Based on Examples - Parallelism samples
      • SET default_parallel 1 - 3m8.033s
      • SET default_parallel 2 - 2m52.972s
      • SET default_parallel 6 - 2m42.771s
      • SET default_parallel 10 - 2m32.819s
      • SET default_parallel 20 - 2m38.023s
      • SET default_parallel 50 - 2m48.035s
  • Combine small input files
    • separate map is created for each file
    • may be inefficient if there are many small files

    maxCombinedSplitSize method in pig

  • Combine small input files
    • pig.maxCombinedSplitSize – specifies the size of data to be processed by a single map. Smaller files are combined until this size is reached
  • Based on Examples - maxCombinedSplitSize samples
    • default - 2m32.819s
    • pig.maxCombinedSplitSize 64MB - 2m42.977s
    • pig.maxCombinedSplitSize 128MB - 2m38.076s
    • pig.maxCombinedSplitSize 256MB - 3m8.913s
  • Use LIMIT operator
    • Compress the results of intermediate jobs
    • Tune MapReduce and Pig parameters

    Related Searches to pig optimizer

    Adblocker detected! Please consider reading this notice.

    We've detected that you are using AdBlock Plus or some other adblocking software which is preventing the page from fully loading.

    We don't have any banner, Flash, animation, obnoxious sound, or popup ad. We do not implement these annoying types of ads!

    We need money to operate the site, and almost all of it comes from our online advertising.

    Please add wikitechy.com to your ad blocking whitelist or disable your adblocking software.

    ×