What is Apache Spark

  • Apache Spark is an open-source distributed cluster-computing framework.
  • Spark is a data processing engine developed to provide faster and ease-of-use analytics than Hadoop MapReduce.
  • Before Apache Software Foundation took possession of Spark, it was under the control of University of California, Berkeley’s AMP Lab.

What is Apache Hadoop

  • Apache Hadoop is an open-source framework written in Java that allows us to store and process Big Data in a distributed environment, across various clusters of computers using simple programming constructs.
  • To do this, Hadoop uses an algorithm called MapReduce, which divides the task into small parts and assigns them to a set of computers.
  • Hadoop also has its own file system, Hadoop Distributed File System (HDFS), which is based on the Google File System (GFS).
  • HDFS is designed to run on low-cost hardware.
CRITERIA SPARK HADOOP MAPREDUCE
Memory Let’s save data on memory with
the use of RDD’s.
Does not leverage the memory of the hadoop cluster to maximum.
Disk usage Spark caches data in-memory
and ensures low latency.
MapReduce is disk oriented.
Processing Supports real-time processing through
spark streaming.
Only batch processing is supported
Installation Is not bound to Hadoop. Is bound to hadoop.
Storage Leverage exciting HDFS
Speed 10 – 100X faster. Fast.
Rsource management standalone YARN
Hadoop Vs Spark

Categorized in:

Tagged in:

, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,