Hadoop HBase And Hive
- Hadoop is an open source software stack that runs on a cluster of machines.
- Hadoop provides distributed storage and distributed process for very big data sets.
It has following 2 core components:
- Hadoop Distributed file system or HDFS is a Java based distributed file system that enables us to store big data across multiple nodes in a Hadoop cluster.
- So, if you install Hadoop, you will get HDFS as an underlying storage system for storing the big data sets in the distributed environment.
- MapReduce is a programming framework for writing applications that method massive amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable, fault-tolerant manner.
- Apache HBase is Hadoop database, a distributed, scalable, column oriented big data store.
- HBase is built on top of HDFS means that data you store in HBase is stored in HDFS itself.
- Hive is an important tool for Hadoop ecosystem it provides an SQL for querying data in HDFS, other file systems that integrate with Hadoop like MapR-FS and Amazon’s S3 and databases like HBase(the Hadoop database) and Cassandra.
- Hive too like HBase stores data into HDFS but it uses MapReduce too. It compiles queries into MapReduce jobs and runs them on the cluster. It was the primary abstraction engines to be built on top of MapReduce. Hive needs a metastore(JDBC compliant RDBMS) to store its metadata.