Course Agenda

Module 1: Getting Familiar with Spark

  • Apache Spark in Big Data Landscape and purpose of Spark
  • Apache Spark vs. Apache MapReduce
  • Components of Spark Stack
  • Downloading and installing Spark
  • Launch Spark

Module 2: Working with Resilient Distributed Dataset (RDD)

  • Transformations and Actions in RDD
  • Loading and Saving Data in RDD
  • Key-Value Pair RDD
  • MapReduce and Pair RDD Operations
  • Playing with Sequence Files
  • Using Partitioner and its impact on performance improvement

Module 3: Spark Application Programming

  • Master SparkContext
  • Initialize Spark with Java
  • Create and Run Real time Project with Spark
  • Pass functions to Spark
  • Submit Spark applications to the cluster

Module 4: Spark Libraries

Module 5: Spark configuration, monitoring, and tuning

  • Understand various components of Spark cluster
  • Configure Spark to modify
    • Spark properties
    • environmental variables
    • logging properties
  • Visualizing Jobs and DAGs
  • Monitor Spark using the web UIs, metrics, and external instrumentation
  • Understand performance tuning requirements

Module 6: Spark Streaming

  • Understanding the Streaming Architecture – DStreams and RDD batches
  • Receivers
  • Common transformations and actions on DStreams

Module 7: MLlib and GraphX