Tool for tuning hive queries

tool for tuning hive queries

1. Enable Compression in Hive

  • By doing compression at various phases (i.e. on final output, intermediate data),we achieve performance improvement in Hive Queries.

2. Optimize Joins

We can improve the performance of joins.By enabling Auto Convert Map Joins and enabling optimization of skew join.

  1. Auto Map Join
  2. Skew Joins
  3. Enable Bucketed Map Joins

Auto Map Join:

    • Auto Map-Join is useful feature when joining a big table with a small table.
    • If we enable this feature, the small table will be saved in the local cache on each node, joined with the big table in the Map phase.
    • Enabling Auto Map Join provides 2 advantages.
    • Primary,it loads a small table into cache will save read time on each data node.
    • Secondary, it avoids skew joins in the Hive query, since the join operation has been already done in the Map phase for each block of data.

Skew joins:

    • We enable skew joins by setting hive.optimize.
    • Skew join property SET command in hive shell or hive-site.xml file.

Enable Bucketed Map Joins

    • The tables as specific column and tables used in joins to improve performance bucketed map join is used.

3. Enable Parallel Execution

  • Hive converts a query into more stages.The MapReduce stage, sampling stage, a mergestage and a limit stage.
  • By default, Hive executes only one time for these satges.
  • A particular job may consist of some stages that are not dependent on each other and could be executed in parallel, possibly allowing the overall job to complete more quickly.

4. Single Reduce for Multi Group BY

  • The single reducer used for multi operations, it combine multiple GROUP BY operations in a query into a single MapReduce job

5. Enable Vectorization

  • Vectorization introduced into hive for the first time in hive-0.13.1 release only
  • It improve operations like scans, aggregations, filters and joins, batches of 1024 rows for each time.

6. Enable Cost Based Optimization

  • It provided the cost based optimization, based on query cost, resulting in different decisions: how to order joins, which type of join to perform and degree of parallelism.

Categorized in:

Tagged in:

, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,