Reading Notes Chapter 1–5
Spark Version 2.1; Scala version 2.11
Chapter 1 Set up Spark development environment
This chapter is actually pretty important, a well configured IDE speeds development speed 10x. A couple take aways from this chapter:
- Update log4j.properties in /conf folder to print INFO level debugging information
- This book covered how to set up environment for Eclipse, but not on IntelliJ, there are a couple catches in setting up IntelliJ for developing Spark. (talk about this later)
- Poking around examples inside Spark is fun.
- Java VisualVM is a useful debug/visualization tool to understand Java process status.
IntelliJ Setup:
- Build Spark repo:
~/spark/spark (2.1.0)$ mvn install -DskipTests -Pyarn -Dscala-2.11
- When you import the project, choose Maven. After indexing is done, there would be a prompt asking you if you want to use sbt to import the project, DON’T
- Project Structure:
- Then you could run any example in Spark’s examples folder.
Chapter 2 Design Philosophy and Architecture
Spark Features:
- Spark reduces disk IO by saving intermediate results from Map tasks in memory
- Increase parallelism: Spark groups tasks into stages and enabled serial & parallel stage execution.
- Avoid duplicate computation: when a task failed in a stage, Spark scheduler would filter out succeeded tasks and only run the failed task/partition.
- Flexible Shuffle order: MapReduce fixes order before shuffling while Spark is flexible in choosing order in Map side or Reduce side.
- Flexible memory management: Spark sets memory region for storage and compute. The border is “soft” so each side could leverage idle resources from the other side.
Other features:
- Checkpoint: maintain a data lineage to recover efficiently.
- Easy to use: provide different language bindings & interactive shell.
- SQL support.
- Streaming support.
- A wide range of data source connectors.
- Abundant supported file types to read from.
Spark concepts:
- RDD: resilient distributed dataset.
- Partition: How many partitions a dataset could be divided into. Spark uses this to decide the number of tasks.
- NarrowDependency: child RDD depends on 1 or some fixed partition of parent RDD.
- ShuffleDependency: child RDD depends on all partitions of parent RDD.
- Job: All RDD transformation are jobs. A job is composed of multiple tasks.
- Stage: execution of a job.
- Task: actual task.
Spark Modules:
- Infrastructure
SparkConf, Spark RPC framework (based on netty), Spark Event Bus, Spark Metrics.
2. SparkContext
Spark context hides networking, messaging, storage, computation, metrics.. details from users.
3. SparkEnv
4. Storage System
How Spark uses memory & disk.
5. Scheduler System
DAGScheduler and TaskScheduler.
6. Computing Engine
MemoryManager, Tungsten, TaskMemoryManager, Task, ExternalSorter, ShuffleManager etc.
Chapter 3 Spark Infrastructure
Parts of this chapter are boring.. primarily introduced some components. Most of them don’t worth the pages. For example SparkConf, I don’t understand why the authors paste code on how to read, write and clone a SparkConf. I’ll put a brief summary below:
SparkConf: Nothing to talk about. A configuration map.
Spark RPC: this is just a typical RPC framework implemented with Netty. The book described how it works using pipeline, bootstrap, channel, channel handler. These are all basics of Netty..
ListenBus: mini Kafka, similar to Android EventBus.
Metric System: Kafka broker.
Chapter 4 Initialization of SparkContext
SparkContext is the engine of the whole Spark framework. It includes the following components:
- SparkEnv: runtime environment. Executor depends on SparkEnv. SparkEnv includes multiple components including serializerManager, RpcEnv, BlockManager, mapOutpuTracker..
- LiveListenerBus: receive events and invoke SparkListener.
- SparkUI: SparkUI reads from different SparkListeners and present data to the Web interface.
- SparkStatusTracker: monitoring information on task & stages.
- DAGScheduler: job creation, split RDD into stages, stage submission.
- TaskScheduler: schedule resources allocated from resource management system.
- HeartbeatReceiver: receive heartbeat information from executors and pass to TaskScheduler.
- ContextCleaner: clean up crap.
- JobProgressListener: monitor job progress.
- EventLoggingListener: persist events to disk.
- ExecutorAllocationManager: dynamic management of executors.
- ShutdownHookManager: manage shutdown hooks.
SparkEnv Creation
SparkEnv is needed everywhere a task needs executed. createDriverEnv method takes the following information from SparkConf:
- bindAddress: host of driver instance.
- advertiseAddress: host name exposed to external.
- port: port of the driver instance
- ioEncryptionKey: I/O encryption key.
SparkUI Implementation
Messaging passing through RPC invocation is not scalable since most are synchronous and occupy threads, this limits the scalability of a system and throughput. A decoupled & asynchronous producer/consumer system would greatly improve the scalability.
Source of SparkListenerEvent: DAGScheduler, SparkContext, DriverEndpoint, BlockerManagerMasterEndpoint, LocalSchedulerBackend.
ListenerBus: ListenerBus matches SparkListenerEvent to SparkListener.
WebUI Framework: not very interesting.
Heartbeat Receiver Instantiation: driver creates HeartbeatReceiver to track executor status.
Scheduling System Instantiation
TaskScheduler is responsible for requesting resources from Resource Management System (YARN, Mesos) — Level 1 Scheduling and Schedule Tasks to the Executors (Level 2 Scheduling).
DAGScheduler prepare tasks (create Job, split RDD into stages, Stage submission) and pass the tasks to TaskScheduler.
BlockerManager Instantiation
This includes all functionalities & components of the storage system.
Metrics System
Metrics system encapsulates Source and Sink and deliver Sources to Sinks.
Event Logging Listener
EventLoggingListener persists filtered events to logs.
ExecutorAllocationManager
ExecutorAllocationManager dynamically allocate and remove executors based on workload.
ContextCleaner
ContextCleaner cleans up inapplicable RDD, map job status, metadata, checkpoints, etc.
Chapter 5 Spark Execution Environment
This book is not well edited.. some of the sections inside the book are not well structured. The content of the book is still relevant tho.
This topic primarily covers SparkEnv. Initially I thought this is just an environment setup, it turned out to be a Runtime. Not sure why it is not just named as SparkRuntime 🤔
Components of SparkEnv
SecurityManager
SecurityManager manages account, permissions and identification (generate secret key and save to Hadoop UGI for YARN).
RPC environment
Replaced Akka before Spark 2.x.
- RpcEndpoint: replace Actor in Akka.
- RpcEndpointRef: replace ActorRef in Akka..
RPC message delivery rules:- at most once- at least once- exactly once(Medium still has no support for nested list? wth?)
- TransportConf: configuration for transporting.
- Dispatcher: messaging dispatch system. It has a thread pool of messaging processing threads to process the data. Thread is killed through a PoisonPill type of message.
- TransportContext: its RpcHandler is an implementation of NettyRpcHandler.
- TransportClientFactory: create client to send request to remote server.
- TransportServer: accept request and respond.
SerializerManager
SparkEnv has two components for serialization — SerializerManager and ClosureSerializer.
- CompressionCodec: compress data. Default to lz4.
BroadCastManager
Save configuration, RDD, Job, ShuffleDependency.. to local storage and replicate to other nodes for disaster recovery.
MapOutputTracker
Track the output status of map tasks so that reduce tasks could locate the address of map tasks’ output. Every map or reduce task would have a unique id. Every reduce task would pull blocks from 1 or multiple map tasks’ nodes. This process is called Shuffle. Every shuffle has a unique shuffle id.
In Driver:
Create MapOutputTrackerMaster & MapOutputTrackerMasterEndpoint and register with Dispatcher with name
In Executor:
Create MapOutputTrackerWorker, get MapOutputTrackerMasterEndpoint’s reference from Dispatcher.
MapOutputTrackerWorker sends tracking information of map tasks to MapOuputTrackerMaster via Master’s RpcEndpointRef. Master is responsible for collecting and maintain all output tracking information of map tasks.
Master receives two types of requests:
- receive intermediate output of map task
- stop tracking intermediate output of map task
Start Storage System
Includes ShuffleManager, MemoryManager, BlockTransferService, BlockManagerMaster, DiskBlockManager, BlockInfoManager and BlockManager.
Bootstrap Metrics System
MetricsConfig: store configurations of MetricsSystem.
Common Methods in MetricsSystem
- buildRegistryName: register name of a source to MetricRegistry.
- registerSource: register source to MetricsSystem.
- Get ServletContextHandler.
OutputCommitCoordinator
OCC arbitrates if a task should submit output to HDFS.
Thoughts
Spark is a mature computing engine with well designed architecture. I wish I have read more on Spark before creating TonY project. A lot of the components & ideas in Spark could be learnt & referenced in building another Level 2 orchestrating system and computing engine.