how will you optimize hive performance - Wikitechy

What is the difference between ‘select from table’ and ‘select column from table’ in hive ?

Editor — Tue, 13 Jul 2021 22:18:52 +0000

Table in Hive is stored as a directory in the HDFS.
Using select from table the Hive query processor simply goes directory that have one or more files in table schema.
You may do this if you have very small data like less than a Gigabyte.
In real clusters if you hit ‘select * from table’, it may have data in Terabytes and displaying that will run for long time.
Hive achieved sequence of map reduce programs that reads data from table stored on Hadoop Distributed File System.
Any data processing you do in Hive is achieved through sequence of map reduce programs that reads data from table stored on HDFS.
Hive map reduce based on query processing engine.
Tables have wide number of columns that representing different values.To perform select column the map reduce program will scan all rows and extract a column.

Editor — Tue, 13 Jul 2021 22:18:28 +0000

Hive	HBase
Hive is query engine	HBase is a data storage particularly for unstructured data.
Apache Hive is mainly used for batch processing i.e. OLAP	HBase is extensively used for transactional processing wherein the response time of the query is not highly interactive i.e. OLTP.
Operations in Hive are used to transformed into mapreduce jobs.	Operations in HBase are run in real-time on the database
For big data applications that require complex and fine grained processing, Hadoop MapReduce is the best choice.	HBase should be used when Data model schema is sparse.
It used for data warehousing requirements the programmers do not write complex mapreduce code.	HBase is an ideal big data solution if the application requires random read or random write operations or both.
Hive does not currently support update statements.	HBase queries are written in a custom language that needs to be learned.
Hive does not provide interactive querying it only runs batch processes on Hadoop.	Apache HBase is a NoSQL key/value store which runs on top of HDFS.
Hive has some limitations of high latency	HBase does not have analytical capabilities
Hive is to analytical queries.	HBase is to real-time querying
Hive used for analytical querying of data collected over a period of time.Hive should not be used for real-time querying.	HBase is perfect for real-time example Facebook use for messaging and real-time analytics. They may even be using it to count Facebook likes.

Editor — Tue, 13 Jul 2021 22:18:19 +0000

Apache hive	Impala
Hive generates query expressions at compile time;Hive is batch based Hadoop MapReduce	Impala does not support for complex types and fault tolerance.
Apache does not generations runtime code for “big loops ” using llvm.	Impala does generations runtime code for “big loops ” using llvm.
Hadoop 2.7.3	Hadoop 2.6.0
All queries run through LLAP	Runtime Filtering Optimization Enabled
ORCFile format with zlib compression	Parquet format with snappy compression
Every hive query has this problem of “cold start”.	Impala avoids startup overhead as daemon processes are started at boot time itself, always being ready to processes a query.
Apache Hive might not be ideal for interactive computing	Impala is meant for interactive computing.
Hive is batch based Hadoop MapReduce.	Impala is more like MPP database.
Hive supports complex types.	Impala does not support complex types.
Apache Hive is fault tolerant.	Impala does not support fault tolerance.
It is more universal, versatile and pluggable language.	It is used unleash its brute processing power and give lightning fast analytic results.

Editor — Tue, 13 Jul 2021 21:50:34 +0000

By doing compression at various phases (i.e. on final output, intermediate data),we achieve performance improvement in Hive Queries.

We can improve the performance of joins.By enabling Auto Convert Map Joins and enabling optimization of skew join.

- Auto Map-Join is useful feature when joining a big table with a small table.
- If we enable this feature, the small table will be saved in the local cache on each node, joined with the big table in the Map phase.
- Enabling Auto Map Join provides 2 advantages.
- Primary,it loads a small table into cache will save read time on each data node.
- Secondary, it avoids skew joins in the Hive query, since the join operation has been already done in the Map phase for each block of data.

- We enable skew joins by setting hive.optimize.
- Skew join property SET command in hive shell or hive-site.xml file.

- The tables as specific column and tables used in joins to improve performance bucketed map join is used.

Hive converts a query into more stages.The MapReduce stage, sampling stage, a mergestage and a limit stage.
By default, Hive executes only one time for these satges.
A particular job may consist of some stages that are not dependent on each other and could be executed in parallel, possibly allowing the overall job to complete more quickly.

The single reducer used for multi operations, it combine multiple GROUP BY operations in a query into a single MapReduce job

Vectorization introduced into hive for the first time in hive-0.13.1 release only
It improve operations like scans, aggregations, filters and joins, batches of 1024 rows for each time.

It provided the cost based optimization, based on query cost, resulting in different decisions: how to order joins, which type of join to perform and degree of parallelism.