Data Science Consultant | Yerrington Consulting
Understand the types of problems solved with parallel computing
Identify the major components of Dask: Collection Types and Scheduler
Be familiar with types of parallel processing provided by Dask
Understand how graphs represent tasks with dependencies
Explain the difference between Pandas and Dask DataFrames
Know how to examine graph processes using the scheduler dashboard
One major problem encountered in the data science world is scalability. Working on a single computer limits how much and how fast you can process data. Most real-world datasets are bigger than a single computer can process, so learning a parallel computing framework becomes increasingly necessary to be productive. In this session, you will learn how to work, hands-on, with the Dask framework to build scalable transformations to support analytic applications.
Module 1: Intro to Parallel Computing
This module will briefly examine the concept of parallel computing and which ideas are most relevant to how Dask works.
- Understand the types of problems solved with parallel computing
- Identify the major components of Dask: Collection Types and its Scheduler
- Be familiar with types of parallel processing provided by Dask
Module 2: Intro to Dask
Coding of more specific, hands-on examples, using Jupyter notebooks. This module explores a few more cases that fundamentally illustrate the underlying datatypes provided by Dask while also overviewing their tradeoffs.
-Understand how graphs represent tasks with dependencies
- Examining tasks in real-time using the Dask dashboard
- Assess trade-offs between various Dask data types.
Module 3: Pandas + Desk
One of the most useful aspects of Dask is its DataFrame data type, which behaves similarly to the Pandas API. We will work together on a few examples of how Dask and Pandas are similar but how to use them together effectively.
-Be able to explain the difference between Pandas and Dask DataFrames
-Become familiar with storage options
-Understand nuances with schema
This course is geared to data scientists, data engineers, machine learning engineers and software engineers of all levels who wish to gain a deep understanding of Parallel Computing with Dask and Pandas and how to apply it to real-world situations.
Strong understanding of Python and Pandas required.
Knowledge of Pandas aggregation and core data transformation methods.
Ability to configure a Python environment and install packages.
Familiarity with Jupyter Notebooks.
Access to live training and QA session with the Instructor
Access to the on-demand recording
Certificate of completion