The Art of Spark Kernel Design

6 min readMar 10, 2019

Reading Notes Chapter 1–5

Spark Version 2.1; Scala version 2.11

Chapter 1 Set up Spark development environment

This chapter is actually pretty important, a well configured IDE speeds development speed 10x. A couple take aways from this chapter:

Update log4j.properties in /conf folder to print INFO level debugging information
This book covered how to set up environment for Eclipse, but not on IntelliJ, there are a couple catches in setting up IntelliJ for developing Spark. (talk about this later)
Poking around examples inside Spark is fun.
Java VisualVM is a useful debug/visualization tool to understand Java process status.

IntelliJ Setup:

Build Spark repo:

~/spark/spark (2.1.0)$ mvn install -DskipTests -Pyarn -Dscala-2.11

When you import the project, choose Maven. After indexing is done, there would be a prompt asking you if you want to use sbt to import the project, DON’T
Project Structure:

Then you could run any example in Spark’s examples folder.

Chapter 2 Design Philosophy and Architecture

Spark Features:

Spark reduces disk IO by saving intermediate results from Map tasks in memory
Increase parallelism: Spark groups tasks into stages and enabled serial & parallel stage execution.
Avoid duplicate computation: when a task failed in a stage, Spark scheduler would filter out succeeded tasks and only run the failed task/partition.
Flexible Shuffle order: MapReduce fixes order before shuffling while Spark is flexible in choosing order in Map side or Reduce side.
Flexible memory management: Spark sets memory region for storage and compute. The border is “soft” so each side could leverage idle resources from the other side.

Other features:

Checkpoint: maintain a data lineage to recover efficiently.
Easy to use: provide different language bindings & interactive shell.
SQL support.
Streaming support.
A wide range of data source connectors.
Abundant supported file types to read from.

Spark concepts:

RDD: resilient distributed dataset.
Partition: How many partitions a dataset could be divided into. Spark uses this to decide the number of tasks.
NarrowDependency: child RDD depends on 1 or some fixed partition of parent RDD.
ShuffleDependency: child RDD depends on all partitions of parent RDD.
Job: All RDD transformation are jobs. A job is composed of multiple tasks.
Stage: execution of a job.
Task: actual task.

Spark Modules:

Infrastructure

SparkConf, Spark RPC framework (based on netty), Spark Event Bus, Spark Metrics.

2. SparkContext

Spark context hides networking, messaging, storage, computation, metrics.. details from users.

3. SparkEnv

4. Storage System

How Spark uses memory & disk.

5. Scheduler System

DAGScheduler and TaskScheduler.

6. Computing Engine

MemoryManager, Tungsten, TaskMemoryManager, Task, ExternalSorter, ShuffleManager etc.

Chapter 3 Spark Infrastructure

Parts of this chapter are boring.. primarily introduced some components. Most of them don’t worth the pages. For example SparkConf, I don’t understand why the authors paste code on how to read, write and clone a SparkConf. I’ll put a brief summary below:

SparkConf: Nothing to talk about. A configuration map.

Spark RPC: this is just a typical RPC framework implemented with Netty. The book described how it works using pipeline, bootstrap, channel, channel handler. These are all basics of Netty..

ListenBus: mini Kafka, similar to Android EventBus.

Metric System: Kafka broker.

Chapter 4 Initialization of SparkContext

SparkContext is the engine of the whole Spark framework. It includes the following components:

SparkEnv: runtime environment. Executor depends on SparkEnv. SparkEnv includes multiple components including serializerManager, RpcEnv, BlockManager, mapOutpuTracker..
LiveListenerBus: receive events and invoke SparkListener.
SparkUI: SparkUI reads from different SparkListeners and present data to the Web interface.
SparkStatusTracker: monitoring information on task & stages.
DAGScheduler: job creation, split RDD into stages, stage submission.
TaskScheduler: schedule resources allocated from resource management system.
HeartbeatReceiver: receive heartbeat information from executors and pass to TaskScheduler.
ContextCleaner: clean up crap.
JobProgressListener: monitor job progress.
EventLoggingListener: persist events to disk.
ExecutorAllocationManager: dynamic management of executors.
ShutdownHookManager: manage shutdown hooks.

SparkEnv Creation

SparkEnv is needed everywhere a task needs executed. createDriverEnv method takes the following information from SparkConf:

bindAddress: host of driver instance.
advertiseAddress: host name exposed to external.
port: port of the driver instance
ioEncryptionKey: I/O encryption key.

SparkUI Implementation

Messaging passing through RPC invocation is not scalable since most are synchronous and occupy threads, this limits the scalability of a system and throughput. A decoupled & asynchronous producer/consumer system would greatly improve the scalability.

Source of SparkListenerEvent: DAGScheduler, SparkContext, DriverEndpoint, BlockerManagerMasterEndpoint, LocalSchedulerBackend.

ListenerBus: ListenerBus matches SparkListenerEvent to SparkListener.

WebUI Framework: not very interesting.

Heartbeat Receiver Instantiation: driver creates HeartbeatReceiver to track executor status.

Scheduling System Instantiation

TaskScheduler is responsible for requesting resources from Resource Management System (YARN, Mesos) — Level 1 Scheduling and Schedule Tasks to the Executors (Level 2 Scheduling).

DAGScheduler prepare tasks (create Job, split RDD into stages, Stage submission) and pass the tasks to TaskScheduler.

BlockerManager Instantiation

This includes all functionalities & components of the storage system.

Metrics System

Metrics system encapsulates Source and Sink and deliver Sources to Sinks.

Event Logging Listener

EventLoggingListener persists filtered events to logs.

ExecutorAllocationManager

ExecutorAllocationManager dynamically allocate and remove executors based on workload.

ContextCleaner

ContextCleaner cleans up inapplicable RDD, map job status, metadata, checkpoints, etc.

Chapter 5 Spark Execution Environment

This book is not well edited.. some of the sections inside the book are not well structured. The content of the book is still relevant tho.

This topic primarily covers SparkEnv. Initially I thought this is just an environment setup, it turned out to be a Runtime. Not sure why it is not just named as SparkRuntime 🤔

Components of SparkEnv

SecurityManager

SecurityManager manages account, permissions and identification (generate secret key and save to Hadoop UGI for YARN).

RPC environment

Replaced Akka before Spark 2.x.

RpcEndpoint: replace Actor in Akka.
RpcEndpointRef: replace ActorRef in Akka..

RPC message delivery rules:- at most once- at least once- exactly once(Medium still has no support for nested list? wth?)

TransportConf: configuration for transporting.
Dispatcher: messaging dispatch system. It has a thread pool of messaging processing threads to process the data. Thread is killed through a PoisonPill type of message.
TransportContext: its RpcHandler is an implementation of NettyRpcHandler.
TransportClientFactory: create client to send request to remote server.
TransportServer: accept request and respond.

SerializerManager

SparkEnv has two components for serialization — SerializerManager and ClosureSerializer.

CompressionCodec: compress data. Default to lz4.

BroadCastManager

Save configuration, RDD, Job, ShuffleDependency.. to local storage and replicate to other nodes for disaster recovery.

MapOutputTracker

Track the output status of map tasks so that reduce tasks could locate the address of map tasks’ output. Every map or reduce task would have a unique id. Every reduce task would pull blocks from 1 or multiple map tasks’ nodes. This process is called Shuffle. Every shuffle has a unique shuffle id.

In Driver:

Create MapOutputTrackerMaster & MapOutputTrackerMasterEndpoint and register with Dispatcher with name

In Executor:

Create MapOutputTrackerWorker, get MapOutputTrackerMasterEndpoint’s reference from Dispatcher.

MapOutputTrackerWorker sends tracking information of map tasks to MapOuputTrackerMaster via Master’s RpcEndpointRef. Master is responsible for collecting and maintain all output tracking information of map tasks.

Master receives two types of requests:

receive intermediate output of map task
stop tracking intermediate output of map task

Start Storage System

Includes ShuffleManager, MemoryManager, BlockTransferService, BlockManagerMaster, DiskBlockManager, BlockInfoManager and BlockManager.

Bootstrap Metrics System

MetricsConfig: store configurations of MetricsSystem.

Common Methods in MetricsSystem

buildRegistryName: register name of a source to MetricRegistry.
registerSource: register source to MetricsSystem.
Get ServletContextHandler.

OutputCommitCoordinator

OCC arbitrates if a task should submit output to HDFS.

Thoughts

Spark is a mature computing engine with well designed architecture. I wish I have read more on Spark before creating TonY project. A lot of the components & ideas in Spark could be learnt & referenced in building another Level 2 orchestrating system and computing engine.