Apache Spark is an open-source, lightning-fast computation technology build based on Hadoop and MapReduce technologies that support various computational techniques for fast and efficient processing. Spark is known for its in-memory cluster computation that is the main contributing feature for increasing the processing speed of the spark applications. Spark was developed as part of Hadoop’s subproject by Matei Zaharia in 2009 at UC Berkeley’s AMPLab. It was later open-sourced in the year 2010 under the BSD License which was then donated to the Apache Software Foundation in the year 2013. From 2014 onwards, Spark grabbed the top-level position among all the projects undertaken by Apache Foundation.
Apache Spark is an open-source framework engine that is known for its speed, easy-to-use nature in the field of big data processing and analysis. It also has built-in modules for graph processing, machine learning, streaming, SQL, etc. The spark execution engine supports in-memory computation and cyclic data flow and it can run either on cluster mode or standalone mode and can access diverse data sources like HBase, HDFS, Cassandra, etc.
High Processing Speed: Apache Spark helps in the achievement of a very high processing speed of data by reducing read-write operations to disk. The speed is almost 100x faster while performing in-memory computation and 10x faster while performing disk computation.
Dynamic Nature: Spark provides 80 high-level operators which help in the easy development of parallel applications.
In-Memory Computation: The in-memory computation feature of Spark due to its DAG execution engine increases the speed of data processing. This also supports data caching and reduces the time required to fetch data from the disk.
Reusability: Spark codes can be reused for batch-processing, data streaming, running ad-hoc queries, etc.
Fault Tolerance: Spark supports fault tolerance using RDD. Spark RDDs are the abstractions designed to handle failures of worker nodes which ensures zero data loss.
Stream Processing: Spark supports stream processing in real-time. The problem in the earlier MapReduce framework was that it could process only already existing data.
Lazy Evaluation: Spark transformations done using Spark RDDs are lazy. Meaning, they do not generate results right away, but they create new RDDs from existing RDD. This lazy evaluation increases the system efficiency.
Support Multiple Languages: Spark supports multiple languages like R, Scala, Python, Java which provides dynamicity and helps in overcoming the Hadoop limitation of application development only using Java.
Hadoop Integration: Spark also supports the Hadoop YARN cluster manager thereby making it flexible.
Supports Spark GraphX for graph parallel execution, Spark SQL, libraries for Machine learning, etc.
Cost Efficiency: Apache Spark is considered a better cost-efficient solution when compared to Hadoop as Hadoop required large storage and data centers while data processing and replication.
Active Developer’s Community: Apache Spark has a large developers base involved in continuous development. It is considered to be the most important project undertaken by the Apache community.
RDD stands for Resilient Distribution Datasets. It is a fault-tolerant collection of parallel running operational elements. The partitioned data of RDD is distributed and immutable. There are two types of datasets:
Parallelized collections: Meant for running parallelly.
Hadoop datasets: These perform operations on file record systems on HDFS or other storage systems.
DAG stands for Directed Acyclic Graph with no directed cycles. There would be finite vertices and edges. Each edge from one vertex is directed to another vertex in a sequential manner. The vertices refer to the RDDs of Spark and the edges represent the operations to be performed on those RDDs.
4. Apache Spark 中的 DAG 指的是什么?
DAG 代表无有向环的有向无环图。会有有限的顶点和边。来自一个顶点的每条边以顺序方式指向另一个顶点。顶点指的是 Spark 的 RDD,边表示要在这些 RDD 上执行的操作。
5. List the types of Deploy Modes in Spark.
There are 2 deploy modes in Spark. They are:
5. 列出 Spark 中 Deploy Modes 的类型。
Spark 中有 2 种部署模式。他们是:
Client Mode: The deploy mode is said to be in client mode when the spark driver component runs on the machine node from where the spark job is submitted.
The main disadvantage of this mode is if the machine node fails, then the entire job fails.
This mode supports both interactive shells or the job submission commands.
The performance of this mode is worst and is not preferred in production environments.
Cluster Mode: If the spark job driver component does not run on the machine from which the spark job has been submitted, then the deploy mode is said to be in cluster mode.
The spark job launches the driver component within the cluster as a part of the sub-process of ApplicationMaster.
This mode supports deployment only using the spark-submit command (interactive shell mode is not supported).
Here, since the driver programs are run in ApplicationMaster, in case the program fails, the driver program is re-instantiated.
In this mode, there is a dedicated cluster manager (such as stand-alone, YARN, Apache Mesos, Kubernetes, etc) for allocating the resources required for the job to run as shown in the below architecture.
Apart from the above two modes, if we have to run the application on our local machines for unit testing and development, the deployment mode is called “Local Mode”. Here, the jobs run on a single JVM in a single machine which makes it highly inefficient as at some point or the other there would be a shortage of resources which results in the failure of jobs. It is also not possible to scale up resources in this mode due to the restricted memory and space.
Receivers are those entities that consume data from different data sources and then move them to Spark for processing. They are created by using streaming contexts in the form of long-running tasks that are scheduled for operating in a round-robin fashion. Each receiver is configured to use up only a single core. The receivers are made to run on various executors to accomplish the task of data streaming. There are two types of receivers depending on how the data is sent to Spark:
Reliable receivers: Here, the receiver sends an acknowledegment to the data sources post successful reception of data and its replication on the Spark storage space.
Unreliable receiver: Here, there is no acknowledgement sent to the data sources.
7. What is the difference between repartition and coalesce?
Repartition
Coalesce
Usage repartition can increase/decrease the number of data partitions.
Spark coalesce can only reduce the number of data partitions.
Repartition creates new data partitions and performs a full shuffle of evenly distributed data.
Coalesce makes use of already existing partitions to reduce the amount of shuffled data unevenly.
Repartition internally calls coalesce with shuffle parameter thereby making it slower than coalesce.
Coalesce is faster than repartition. However, if there are unequal-sized data partitions, the speed might be slightly slower.
7. repartition 和 coalesce 有什么区别?
Repartition
Coalesce
使用重新分区可以增加/减少数据分区的数量。
Spark coalesce 只能减少数据分区的数量。
重新分区创建新的数据分区并执行均匀分布的数据的完全混洗。
Coalesce 利用已经存在的分区来不均匀地减少混洗数据的数量。
重新分区在内部使用 shuffle 参数调用 coalesce,从而使其比 coalesce 慢。
合并比重新分区更快。但是,如果存在大小不等的数据分区,则速度可能会稍慢。
8. What are the data formats supported by Spark?
Spark supports both the raw files and the structured file formats for efficient reading and processing. File formats like paraquet, JSON, XML, CSV, RC, Avro, TSV, etc are supported by Spark.
The process of redistribution of data across different partitions which might or might not cause data movement across the JVM processes or the executors on the separate machines is known as shuffling/repartitioning. Partition is nothing but a smaller logical division of data.
It is to be noted that Spark has no control over what partition the data gets distributed across.
YARN is one of the key features provided by Spark that provides a central resource management platform for delivering scalable operations throughout the cluster.
YARN is a cluster management technology and a Spark is a tool for data processing.