apache hive - hive mapreduce - hadoop mapreduce - hive tutorial - hadoop hive - hadoop hive - hiveql

What is Hive ?

Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
It lives on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive.

Learn hive - hive tutorial - learn hive tutorial - process in hive vs mapreduce - hive example - hive examples - hive programs

What is Mapreduce?

MapReduce is a programming model suitable for processing of huge data.

learn hive - hive tutorial - apache hive - map reduce plans - hive examples

Hadoop is capable of running MapReduce programs written in various languages:

Java,
Ruby,
Python,
and C++.

MapReduce programs are parallel in nature, thus are very useful for performing large-scale data analysis using multiple machines in the cluster.

Learn hive - hive tutorial - learn hive tutorial - data sharing in hive vs map reduce - hive example - hive examples - hive programs

Difference between Hive and Mapreduce:

Previous to selecting one of these two options, we must look at some of their features. Hive and Map reduce choosing some factors are,

Type of Data
Amount of Data
Complexity of Code

Feature	Hive	Map Reduce
Language	It Supports SQL like query language for interaction and for Data modeling	It compiles language with two main tasks present in it. One is map task, and another one is a reducer.
	Compiled Language	We can define these task using Java or Python Sql like Query Language
Level of abstraction	Higher level of Abstraction on top of HDFS	Lower level of abstraction
Efficiency in Code	Comparatively lesser than Map reduce	Provides High efficiency
Extent of code	Less number of lines code required for execution	More number of lines of codes to be defined
Type of Development work required	Less Development work required	More development work needed

apache hive related article tags - hive tutorial - hadoop hive - hadoop hive - hiveql - hive hadoop - learnhive - hive sql

MAPREDUCE SCRIPT :

Using an approach like Hadoop Streaming, the TRANSFORM, MAP and REDUCE clauses make it possible to invoke an external script or program from Hive.

Example - script to filter out rows to remove poor quality readings.

   import re
   import sys
   for line in sys.stdin:
   (year, temp, q) = line.strip().split()
   if (temp != "9999" and re.match("[01459]", q))
   print "%s\t%s" % (year, temp)

   hive> ADD FILE /path/to/is_good_quality.py;
   hive> FROM records2
   > SELECT TRANSFORM(year, temperature, quality)
   > USING 'is_good)quality.py'
   > AS year, temperature;

Output :
   1949 111
   1949 78
   1950 0
   1950 22
   1950 -11

learn hive - hive tutorial - apache hive - mapreduce vs pig vs hive - hive examples

Hive QL Join :

Hive QL Join - Example 1 :

learn hive - hive tutorial - apache hive - hive sql join - hive examples

learn hive - hive tutorial - apache hive - hive sql join map reduce - hive examples

Hive QL Join - Example 2 :

Below is HiveQL Join,

INSERT OVERWRITE TABLE pv_users
SELECT pv.pageid, u.age
FROM page_view pv
JOIN user u
ON (pv.userid = u.userid);

learn hive - hive tutorial - apache hive - hive mapreduce programming - hive examples

Rightmost table streamed – whereas inner tables data is kept in memory for a given key. Use largest table as the right most table.
hive.mapred.mode = nonstrict
In strict mode, Cartesian product not allowed

Below is HiveQL Join, :

INSERT OVERWRITE TABLE pv_users
SELECT pv.pageid, u.age
FROM page_view p JOIN user u
ON (pv.userid = u.userid) JOIN newuser x on (u.userid = x.userid);

Same join key – merge into 1 map-reduce job – true for any number of tables with the same join key.
The merging happens for OUTER joins also

INSERT OVERWRITE TABLE pv_users
SELECT pv.pageid, u.age
FROM page_view p JOIN user u
ON (pv.userid = u.userid) JOIN newuser x on (u.age = x.age);

Different join keys – 2 map-reduce jobs Same as:

The above query can be splitted as two seperate queries,

INSERT OVERWRITE TABLE tmptable SELECT *
FROM page_view p JOIN user u
ON (pv.userid = u.userid);

The above query output can be merged with the below query to fetch the final output,

INSERT OVERWRITE TABLE pv_users
SELECT x.pageid, x.age
FROM tmptable x JOIN newuser y on (x.age = y.age);

Join Optimization - Map Joins :

User specified small tables stored in hash tables on the mapper backed by jdbm

No reducer needed

INSERT INTO TABLE pv_users
SELECT /*+ MAPJOIN(pv) */ pv.pageid, u.age
FROM page_view pv JOIN user u
ON (pv.userid = u.userid);

learn hive - hive tutorial - apache hive - hiveql map join and join optimization - hive examples

Optimization phase
n-way map-join if (n-1) tables are map side readable
Mapper reads all (n-1) tables before processing the main table under consideration
Map-side readable tables are cached in memory and backed by JDBM persistent hash tables

Parameters for Join Optimization and Map Joins :

hive.join.emit.interval = 1000
hive.mapjoin.size.key = 10000
hive.mapjoin.cache.numrows = 10000

Hive QL Multiple Tables Join with Map Reduce - Example 3 :

learn hive - hive tutorial - apache hive - hive sql multiple table join map reduce - hive examples

apache hive related article tags - hive tutorial - hadoop hive - hadoop hive - hiveql - hive hadoop - learnhive - hive sql

Hive QL – Group By - Optimizations :

SELECT pageid, age, count(1)
FROM pv_users
GROUP BY pageid, age;

< p align="center">learn hive - hive tutorial - apache hive - hive group by mapreduce - hive examples

Map side partial aggregations
- Hash-based aggregates
- Serialized key/values in hash tables
- 90% speed improvement on Query
Load balancing for data skew

Parameters for Group by Optimization :

hive.map.aggr = true
hive.groupby.skewindata = false
hive.groupby.mapaggr.checkinterval = 100000
hive.map.aggr.hash.percentmemory = 0.5
hive.map.aggr.hash.min.reduction = 0.5

apache hive - hive mapreduce - hadoop mapreduce - hive tutorial - hadoop hive - hadoop hive - hiveql

apache hive related article tags - hive tutorial - hadoop hive - hadoop hive - hiveql - hive hadoop - learnhive - hive sql

What is Hive ?

What is Mapreduce?

Difference between Hive and Mapreduce:

apache hive related article tags - hive tutorial - hadoop hive - hadoop hive - hiveql - hive hadoop - learnhive - hive sql

MAPREDUCE SCRIPT :