difference between big data and hadoop - Wikitechy

Difference between Hive and HBase ?

Editor — Mon, 12 Jul 2021 19:18:59 +0000

Hive is a datawarehousing package built on the top of Hadoop. It is mainly used for data analysis. It generally target towards users already comfortable with Structured Query Language (SQL).
It is similar to SQL and called Hive Query Language (HQL).
Hive manages and queries structured data. Moreover, hive abstracts complexity of Hadoop. It does not support
- Not a full database.
- Not a real time processing system.
- Not SQL-92 compliant.
- Does not provide row level insert, updates or deletes.
- Doesn’t support transactions and limited sub-query support.
- Query optimization in evolving stage.

HBase is a column-oriented database management system that runs on top of Hadoop Distributed File System (HDFS).
It is well suited for sparse data sets, which are common in many Big Data use cases.
It is an opensource, distributed database developed by Apache software foundations.
Initially, it was named Google Big Table, afterwards it was re-named as HBase and is primarily written in Java.
It can store massive amount of data from terabytes to petabytes.
It is built for low-latency operations and is used extensively for read and write operations.
It stores large amount of data in the form of tables.

HIVE	HBASE
Hive is a query engine.	Data storage particularly for unstructured data.
Mainly used for batch processing.	Extensively used for transactional processing.
Not a real time processing.	Real-time processing.
Only for analytical queries.	Real-time querying.
Runs on the top of Hadoop.	Runs on the top of HDFS (Hadoop distributed file system).
Apache Hive is not a database.	It support NoSQL database.
It has schema model.	It is free from schema model.
Made for high latency operations.	Made for low level latency operations.

Editor — Mon, 12 Jul 2021 17:16:29 +0000

Big data analytics is the process of examining large data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information.
Hadoop is a framework to store and process big data. Hadoop specifically designed to provide distributed storage and parallel data processing that big data requires.

Hadoop stores huge files as they are (raw) without specifying any schema.

High scalability – We can add any number of nodes, hence enhancing performance dramatically.
High availability – In hadoop data is highly available despite hardware failure. If a machine or few hardware crashes, then we can access data from another path.
Reliable – Data is reliably stored on the cluster despite of machine failure.
Economic – Hadoop runs on a cluster of commodity hardware which is not very expensive.

Hadoop is an open source project from Apache Software Foundation.
It provides a software framework for distributing and running applications on clusters of servers that is inspired by Google’s Map-Reduce programming model as well as its file system(GFS).
Hadoop was originally written for the nutch search engine project.
Hadoop is open source framework written in Java. It efficiently processes large volumes of data on a cluster of commodity hardware.
Hadoop can be setup on single machine , but the real power of Hadoop comes with a cluster of machines , it can be scaled from a single machine to thousands of nodes. Hadoop consists of two key parts,
- Hadoop Distributes File System(HDFS)
- Map-Reduce.

HDFS is a highly fault tolerant, distributed, reliable, scalable file system for data storage.
HDFS stores multiple copies of data on different nodes; a file is split up into blocks (Default 64 MB) and stored across multiple machines.
Hadoop cluster typically has a single namenode and number of datanodes to form the HDFS cluster.

Map-Reduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks.
It is also a paradigm for distributed processing of large data set over a cluster of nodes.

Editor — Mon, 12 Jul 2021 16:09:43 +0000

Apache Spark is an open-source distributed cluster-computing framework.
Spark is a data processing engine developed to provide faster and ease-of-use analytics than Hadoop MapReduce.
Before Apache Software Foundation took possession of Spark, it was under the control of University of California, Berkeley’s AMP Lab.

Apache Hadoop is an open-source framework written in Java that allows us to store and process Big Data in a distributed environment, across various clusters of computers using simple programming constructs.
To do this, Hadoop uses an algorithm called MapReduce, which divides the task into small parts and assigns them to a set of computers.
Hadoop also has its own file system, Hadoop Distributed File System (HDFS), which is based on the Google File System (GFS).
HDFS is designed to run on low-cost hardware.

CRITERIA	SPARK	HADOOP MAPREDUCE
Memory	Let’s save data on memory with the use of RDD’s.	Does not leverage the memory of the hadoop cluster to maximum.
Disk usage	Spark caches data in-memory and ensures low latency.	MapReduce is disk oriented.
Processing	Supports real-time processing through spark streaming.	Only batch processing is supported
Installation	Is not bound to Hadoop.	Is bound to hadoop.
Storage	Leverage exciting	HDFS
Speed	10 – 100X faster.	Fast.
Rsource management	standalone	YARN