Spark Architecture
Before we deep dive into Spark architecture ,it is good to know about Distributed Computing.
Why Distributed computing gained Prominence:
As data volume increased ,processing of data has become a huge challenge. Single machines do not have enough power and resources to perform computations on huge amounts of data. This is where distributed computing comes in to picture.
In Distributed computing we have a cluster or group, of computers, pools the resources of many machines together, giving us the ability to use all the computer’s resources as if they were a single computer. To achieve this task you need Distributed computing framework which will coordinate work across them.
Spark does just that, managing and coordinating the execution of tasks on data across a cluster of computers.
Here is more detailed explanation of advantages of Distributed Computing
Scalability
If your data volume, read load, or write load grows bigger than a single machine can handle, you can potentially spread the load across multiple machines.
Fault tolerance/high availability
If your application needs to continue working even if one machine (or several machines, or the network, or an entire datacenter) goes down, you can use multiple machines to give you redundancy. When one fails, another one can take over.
Latency
If you have users around the world, you might want to have servers at various locations worldwide so that each user can be served from a datacenter that is geographically close to them. That avoids the users having to wait for network packets to travel halfway around the world.
Architecture :
Spark Driver:
At a high level in the Spark architecture, a Spark application consists of a driver program.The driver process runs your main() function, sits on a node in the cluster.
It is responsible for responding to a user’s program or input; and analyzing, distributing, and scheduling work across the executors
The driver process is absolutely essential — it’s the heart of a Spark Application and maintains all relevant information during the lifetime of the application
The driver accesses the distributed components in the cluster — the Spark executors and cluster manager — through a SparkSession.
Spark Session:
SparkSession provides a single unified entry point to all of Spark’s functionality. Through this we can create JVM runtime paramaters, define dataframes and Datasets, read from data sources, access catalog metadata, and issue Spark SQL queries
Cluster Manager:
The cluster manager is responsible for managing and allocating resources for the cluster of nodes on which your Spark application runs. Currently, Spark supports four cluster managers: the built-in standalone cluster manager, Apache Hadoop YARN, Apache Mesos, and Kubernetes.
Spark executor
A Spark executor runs on each worker node in the cluster.Executor is the Java Virtual Machine (JVM) that runs on a worker node of the cluster. The executors communicate with the driver program and are responsible for executing tasks on the workers.
Ways to Run Spark:
Local,
Standalone Scheduler
Hadoop YARN
Mesos
Kubernetes
Hadoop YARN and Kubernetes are most widely used Cluster technologies for running Spark