Course Abstract

Training duration : 90 minutes

The speed in which intelligent decision making can take place is changing the very fabric of the modern industry. From autonomous routing decisions for real-time traffic avoidance to intelligent integrated systems for customer service and assisted analysis or forecasting, the time to correct decision can be a huge differentiator. During this course, you will learn how to harness the power of Apache Spark to build up a data ingestion pipeline that can process and learn from streaming data. These learnings can then be applied to make simple decisions using SparkSQL in a parallel streaming system. The workshop material has been created based on the learnings of building mission-critical real-time analytics systems at Twilio.


Learning Objectives

  • How to write Apache Spark Structured Streaming applications

  • How to architect collaborative streaming applications that can run 24/7/365

  • How to model event data / best practices for making your data work for you.


Instructor Bio:

Principal Software Engineer| Twilio

Scott Haines

Scott Haines is a full-stack engineer with a current focus on real-time, highly available, trustworthy analytics systems. He is currently working at Twilio as a Principal Software Engineer on the Voice Insights team, where he helped drive spark adoption and streaming pipeline architectures, and build out a massive stream-processing platform. Prior to Twilio, he worked on writing the backend Java APIs for Yahoo! Games, as well as the real-time game ranking/rating engine (built on Storm) to provide personalized recommendations and page views for 10 million customers. He finished his tenure at Yahoo! working for Flurry Analytics where he wrote the alerts/notifications system for mobile.

Background knowledge

  • This course is for current or aspiring Data Engineers, Machine Learning Engineers, Software Engineers

  • Knowledge of following tools and concepts is useful:

  • Python and Pandas DataFrames

  • Spark Scala or PySpark Basics

  • Understanding of how to form SQL queries

Real-world applications

  • Data science teams at advertising companies use Apache Spark for advertising channel analysis.

  • Apache Spark is used for distributed IoT applications for tracking and data management

  • In the e-commerce industry, real-time transaction information could be passed to a streaming clustering algorithm through Apache Spark