Video Course
Mastering Spark Internals
Understand how Spark distributes workload and optimizes performance.
Prerequisites:
- Spark Programming Fundamentals
The best video course on Spark internals you'll find.
This video course is the result of years of research, studying source code and working with Spark in professional setups. I have put everything I learned into this course to help you develop a profound understanding of Spark.
Outperform 90% of Spark developers out there.
Develop a profound understanding of how Spark works.
👩💻 Write better Spark code and reason about design decisions.
🧑🏫 Become an expert yourself.
📈 Lay an important foundation for optimizing performance.
💰 Be confident in theoretical interview questions.
Become a Pro-Level Spark developer.
Developing a deep understanding of internals will help you ...
... develop your skills in Spark even further,
... to get your job in Data Engineering,
... and to radiate a sense of confidence and expertise when talking to colleagues.
What's in the box?
Deep-dive into Spark Core's execution model.
We will develop a profound understanding of how Spark executes workload in a distributed manner.
Learn how Spark automatically optimizes applications.
We will explore in detail, how SparkSQL's powerful query optimization engine works.
Understand resource allocation and execution on clusters.
We will cover in-depth what is happening, when Spark executes distributed applications on a cluster.
Find a detailed outline below.
Your journey to mastering Spark internals ...
- High-level architecture (4:05)
- Important terminology (2:30)
- Understanding the local deployment mode (1:22)
- How is parallelism achieved? (5:36)
- Understanding Spark's foundation: MapReduce (7:39)
- How does Spark relate to MapReduce? (2:19)
- Introduction to RDDs (Resilient Distributed Datasets) (5:17)
- Spark's implementation of RDDs (3:02)
- DAG: The directed acyclic graph (4:23)
- Understanding narrow and wide dependencies (2:38)
- Dependencies vs. transformations (2:55)
- Spark Core's optimization: Pipelining (2:29)
- Physical planning in Spark Core (6:52)
- Tasks: The unit of execution (2:48)
- Scheduling of tasks (3:22)
- Task execution on executors (2:38)
- Memory management in Spark (10:43)
- Introduction to SparkSQL (3:02)
- Our example use-case (2:36)
- Implementation of the example use-case (1:23)
- Jupyter notebook, implementation & exploring plans (12:48)
- The Catalyst: SparkSQL's optimization engine (3:04)
- What's a logical plan? (3:54)
- Planning step 1: Analysis (4:17)
- Planning step 2: Logical planning (6:02)
- Planning step 3: Physical planning (13:43)
- Planning step 4: Code generation (2:31)
Your Teacher
Philipp Brunenberg
- Bachelor's and Master's degree in computer science
- Almost a decade of experience as freelance big data software engineer
- Expert-level experience in the distributed data processing framework Apache Spark
- Publishing content on his blog and YouTube channel
- Conference speaker
- Helped many of his students becoming professional Spark developers