Logo black r

Big Data Processing
with Apache Spark

Big Data Processing with Apache Spark Overview

Organizations from diverse disciplines are inundated with data. The ability to effectively and efficiently process large datasets is critical to making data-driven business decisions and for building data-intensive services (e.g., recommendations, predictions, diagnosis, etc.).

In this course, you learn to use the Apache Spark framework for the purposes of such big data management and analysis, focusing on the fundamental concepts, overall architecture, various components, APIs, and language interfaces of Spark. The emphasis is on learning through practical examples and use cases in real world applications.

Apache Spark is an open-source cluster computing framework. While the well-established Hadoop platform relies on the disk-based MapReduce paradigm, Spark was designed from the ground up to exploit aggregate cluster memory for small to medium sized datasets, as well as to scale gracefully to large datasets. Spark offers a unified stack of tightly integrated components, thereby providing an exceptional ability to build applications that seamlessly combine different processing models. The framework caters to the needs of a variety of applications involving streaming data, structured data, unstructured data, as well as graph data.


Who is this course for?

Big Data Processing with Apache Spark is for any data enthusiast who wants to derive value from datasets. It is an ideal course for aspiring data engineers and data scientists as well as for:


Prerequisites

Programming

This course is open to both beginner and experienced programmers; familiarity with programming in Java, C++, and Python is encouraged. Experience in Java or Scala is a plus. Students are expected to know the basic concepts of compiling/running developed programs and object-oriented programming and are expected to be familiar with basic Unix/Linux command line utilities. They should be able to install open-source software on their laptops.

Source Code Management

We will use GitHub throughout the course for sharing and maintaining code, so students should familiarize themselves with GitHub activities such as code committing, forking and cloning repositories, and creating pull requests and branches.

Data Processing

Beginner-level experience with SQL and related database query processing is recommended.


Outcomes

Upon completion of this course, students will:

Have an in-depth understanding of the fundamental concepts, design principles, and system architecture of Apache Spark.
Have hands-on experience tackling problems in processing and analyzing big data sets.
Be well prepared for Spark certification exams from leading companies like Databricks and Cloudera.

Want to see this course in your city?
Let us know!

Course Structure and Syllabus

Week 1

History and Fundamentals

Introduce big data processing frameworks. Learn the history and use cases of Hadoop, including what brought us to Spark. Also learn the fundamentals of Scala programming.

Week 2

Object-Oriented and Functional Programming

Learn object-oriented programming and functional programming in Scala. We cover important aspects of functional programming (higher-order functions, closures, collections, currying, etc.) that will be useful in writing good code with Spark.

Week 3

Basics of Spark

Introduce the basics of Spark, including the data model of RDDs, supported operations on resilient distributed datasets (RDDs), and working with datasets with key-value pairs. We deep-dive into the concepts of dependencies, lineage, and stages to understand the underlying architecture providing scalability and fault tolerance.

Week 4

Advanced Concepts in Spark

Continue the exploration of Spark by diving into the more advanced concepts of partitioning and broadcasts. Cover topics relevant to debugging and performance tuning.

Week 5

Spark Libraries

Move up the level of abstraction by discussing Spark libraries, focusing on SQL-style processing with Spark DataFrames and DataSets. We create machine learning and predictive models using MLlib by considering the use case of building a recommendation engine.

Week 6

Process Streaming Data | Discuss Future Work

Learn to process streaming data (e.g., tweets from Twitter) and graph-structured data using Spark Streaming and GraphX/Bagel, respectively. Conclude by discussing future directions and potential next steps.