Course Curriculum
-
1
Data Analytics at Scale: A Four-legged Stool
-
Abstract and Bio
-
Data Analytics at Scale: A Four-legged Stool
-
-
2
Why you can’t Apply Common Software Best Practices Directly to Data Workflows, and What you can do About it
-
Abstract and Bio
-
Why you can’t Apply Common Software Best Practices Directly to Data Workflows, and What you can do About it
-
-
3
Orchestrating Data Assets instead of Tasks, with Dagster
-
Abstract and Bio
-
Orchestrating Data Assets instead of Tasks, with Dagster
-
-
4
Cloud Directions, MLOps and Production Data Science
-
Abstract and Bio
-
Cloud Directions, MLOps and Production Data Science
-
Data Analytics at Scale: A Four-legged Stool
In this talk, we will discuss four tactics that enable successful enterprise analytics efforts. The first concerns data integration. Because essentially all enterprise data resides in data silos, an integration effort is required before meaningful cross-silo analysis is possible. Data science practitioners routinely report spending at least 80% of their time doing “data preparation” (aka data munging). I describe why this activity is hard and tactics that can be employed to make it less costly. Once one has clean cross-silo data, then two further tactics entail using an analytics suite and an information discovery tool. The first is required to do data analytics while the second is necessary when one doesn’t know what analysis to perform. I discuss desired features of each tool, as well as make some comments about machine learning. The fourth tactic entails data lakes and lake houses. Please put everything in a DBMS, so the integration challenge of data lakes is as manageable as possible.
Michael Stonebraker, PhD, Adjunct Professor @ MIT
Why you can’t Apply Common Software Best Practices Directly to Data Workflows, and What you can do About it
To derive the most value from data, data professionals must be able to set up their workflow in a way that will maximize not only their own efficiency and productivity but also data reproducibility. This presentation will outline the specific challenges to adopting software engineering best practices for data and analytics workflows, why they exist, and how data scientists can craft environments to best address common pitfalls and encourage reproducibility. We will cover specific actions leaders can take and offer real life examples and use cases. Attendees will walk away with a deeper understanding of how to avoid common pitfalls, how to improve team collaboration and reproducibility in data workflows.
Anna Filippova, Director, Community & Data @ DBT
Orchestrating Data Assets instead of Tasks, with Dagster
Data practitioners use orchestrators to schedule and run the computations that keep data assets, like datasets and ML models, up-to-date.
Traditional orchestrators think in terms of “tasks”. This talk discusses an alternative, declarative approach to data orchestration that puts data assets at the center. This approach, called “software-defined assets”, is implemented in Dagster, an open source data orchestrator.
Sandy Ryza, Lead Engineer - Dagster Project @ Elementl
Cloud Directions, MLOps and Production Data Science
Recent trends in cloud technology, including serverless computing, promise new approaches for abstracting away infrastructure. Unfortunately these offerings fall short of the challenge of MLOps. In this talk we will cover some of the important promises and weaknesses of current cloud offerings, and describe research from Berkeley's RISElab and the resulting open source Aqueduct system, which are putting Production Data Science at the fingertips of anyone working with data and models.
Joseph M. Hellerstein, Jim Gray Professor of Computer Science @ UC, Berkeley