Skewed join in Pig

  • Joining skewed data using apache Pig skewed join.In a distributed processing environment Data skew is a serious problem,and occurs when the data is not evenly divided among the key tuples from the map phase.
  • To help the data skew issue with joins Apache Pig is used.
what is skewed join in pig
  • Using two-table skewed join works.
  • Construct the join Used “skewed”‘ to force it used skewed join. pig.skewed join.reduce.memusage
  • specifies the reducer to perform the join.
  • Pig forces low fraction for more reducer but increases copying cost.
  • Difficult to presence Parallel joins for underlying data.
  • The underlying data is sufficiently skewed, load too much of the parallelism gains.
  • Skewed join does not have restriction on the size of the input keys.
  • It accomplishes by dividing one of the input on the join and other input.

Implementation:

  • Skewed join it translates into two map/reduce jobs.
  • The root job samples the input records and computes the underlying key space.
  • The second job modules the input table and performs a join on the predicate.
  • In order to join two tables, the first tables is partitioned and another is streamed to the reducer.
  • The map task uses the pig.keydist file to define the number of reducers per key.
  • It sends the key to each of the reducers in a round robin(RR)fashion. Skewed joins happen in the reduce phase of the join job.

Categorized in:

Tagged in:

, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,