Apache Spark - Wikitechy

Apache Spark - Wikitechy https://www.wikitechy.com/interview-questions/category/apache-spark/ Interview Questions Tue, 14 Sep 2021 10:04:46 +0000 en-US hourly 1 https://wordpress.org/?v=7.0.2 https://www.wikitechy.com/interview-questions/wp-content/uploads/2025/10/cropped-wikitechy-icon-32x32.png Apache Spark - Wikitechy https://www.wikitechy.com/interview-questions/category/apache-spark/ 32 32 What is Shark ? https://www.wikitechy.com/interview-questions/apache-spark/what-is-shark/ https://www.wikitechy.com/interview-questions/apache-spark/what-is-shark/#respond Mon, 12 Jul 2021 16:16:52 +0000 https://www.wikitechy.com/interview-questions/?p=254

What is Shark ?

Shark is a tool, developed for people who are from a database background – to access Scala MLib capabilities through Hive like SQL interface.
Shark tool helps data users run Hive on Spark – offering compatibility with Hive metastore, queries and data.
Like Hive, Spark queries are written using a SQL-like language called HiveQL, which Spark translates into Spark Directed Acyclic Graphs (DAGs) that are executed on the Hadoop cluster.
More complex queries are supported through User Defined Functions (UDFs) that can be written in Java and referenced by a HiveQL query.

]]> https://www.wikitechy.com/interview-questions/apache-spark/what-is-shark/feed/ 0 What is the difference between Spark and Hadoop MapReduce ? https://www.wikitechy.com/interview-questions/apache-spark/what-is-the-difference-between-spark-and-hadoop-mapreduce/ https://www.wikitechy.com/interview-questions/apache-spark/what-is-the-difference-between-spark-and-hadoop-mapreduce/#respond Mon, 12 Jul 2021 16:09:43 +0000 https://www.wikitechy.com/interview-questions/?p=252

What is Apache Spark

Apache Spark is an open-source distributed cluster-computing framework.
Spark is a data processing engine developed to provide faster and ease-of-use analytics than Hadoop MapReduce.
Before Apache Software Foundation took possession of Spark, it was under the control of University of California, Berkeley’s AMP Lab.

What is Apache Hadoop

Apache Hadoop is an open-source framework written in Java that allows us to store and process Big Data in a distributed environment, across various clusters of computers using simple programming constructs.
To do this, Hadoop uses an algorithm called MapReduce, which divides the task into small parts and assigns them to a set of computers.
Hadoop also has its own file system, Hadoop Distributed File System (HDFS), which is based on the Google File System (GFS).
HDFS is designed to run on low-cost hardware.

CRITERIA	SPARK	HADOOP MAPREDUCE
Memory	Let’s save data on memory with the use of RDD’s.	Does not leverage the memory of the hadoop cluster to maximum.
Disk usage	Spark caches data in-memory and ensures low latency.	MapReduce is disk oriented.
Processing	Supports real-time processing through spark streaming.	Only batch processing is supported
Installation	Is not bound to Hadoop.	Is bound to hadoop.
Storage	Leverage exciting	HDFS
Speed	10 – 100X faster.	Fast.
Rsource management	standalone	YARN

]]> https://www.wikitechy.com/interview-questions/apache-spark/what-is-the-difference-between-spark-and-hadoop-mapreduce/feed/ 0 What is RDD ? https://www.wikitechy.com/interview-questions/apache-spark/what-is-rdd/ https://www.wikitechy.com/interview-questions/apache-spark/what-is-rdd/#respond Mon, 12 Jul 2021 15:56:56 +0000 https://www.wikitechy.com/interview-questions/?p=250

What is RDD ?

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects.
Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
Formally, an RDD is a read-only, partitioned collection of records.
RDDs can be created through deterministic operations on either data on stable storage or other RDDs.
RDD is a fault-tolerant collection of elements that can be operated in parallel.

]]> https://www.wikitechy.com/interview-questions/apache-spark/what-is-rdd/feed/ 0