Course Abstract

Training duration: 1 hour 30 min (Hands-on)

One major problem encountered in the data science world is scalability. Working on a single computer limits how much and how fast you can process data. Most real-world datasets are bigger than a single computer can process, so learning a parallel computing framework becomes increasingly necessary to be productive. In this session, you will learn how to work, hands-on, with the Dask framework to build scalable transformations to support analytic applications.

DIFFICULTY LEVEL: INTERMEDIATE

Learning Objectives

  • Understand the types of problems solved with parallel computing

  • Identify the major components of Dask: Collection Types and Scheduler

  • Be familiar with types of parallel processing provided by Dask

  • Understand how graphs represent tasks with dependencies

  • Be able to explain the difference between Pandas and Dask DataFrames

  • How to examine graph processes using the scheduler dashboard

Instructor

Instructor Bio:

Data Science Consultant | Yerrington Consulting

David Yerrington

At the age of 8, David began learning the BASIC programming language while living in Alaska's outskirts. He studied music performance but found the beginning of his career building a small software and consulting company in the late '90s. David's career spans almost 20 years including several startups as a lead engineer building scalable data services from prototype to production. During his time at Sony/Gracenote, he lead the implementation of prototypes featured in the Consumer Electronics Show, spanning problems with recommendation, content classification, and profiling type projects. David also held roles as a data scientist at a YC backed dating app company and an analytics startup researching and building scalable recommendation pipelines. While working at General Assembly as a Lead Global Data Science Instructor, David helped architect the first significant versions of their data science immersive curriculum. Also, he piloted many of the hybrid Data Science Immersive programs still taught today. Currently, David consults and contracts full-time for various clients and projects ranging from NLP, recommendation, big data, and professional training for large and small teams. David enjoys playing the cello in orchestras and a small group that performs classic video game covers when not working.

Course Outline

Module 1:  Intro to Parallel Computing and Dask

In this module, we will examine the concept of parallel computing briefly and how Dask works, hands-on, using Juptyer notebooks.

- Understand the types of problems solved with parallel computing

- Identify the major components of Dask:  Collection Types and Scheduler

- Be familiar with types of parallel processing provided by Dask

- Understand how graphs represent tasks with dependencies


Module 2: Pandas vs Dask

One of the most useful aspect of Dask is it's DataFrame data type which is modeled after the Pandas API.  We will work together on a few examples of how Dask and Pandas are similar but also how to use them together effectively.

- Be able to explain the difference between Pandas and Dask DataFrames

- How to examine graph processes using the scheduler dashboard

Background knowledge

  • Strong understanding of Python and Pandas required

  • Knowledge of Pandas aggregation and core data transformation methods

  • Ability to configure a Python environment and install packages

  • Familiarity with Jupyter Notebooks