Course curriculum

  • 1

    What's new in Apache Airflow 2.3?

    • Abstract and Bio

    • What's new in Apache Airflow 2.3?

  • 2

    A Systematic Approach for Building Full-Spectrum Model Monitoring

    • Abstract and Bio

    • A Systematic Approach for Building Full-Spectrum Model Monitoring

  • 3

    Scaling Machine Learning with Data Mesh

    • Abstract and Bio

    • Scaling Machine Learning with Data Mesh

  • 4

    Human-Friendly, Production-Ready Data Science with Metaflow

    • Abstract and Bio

    • Human-Friendly, Production-Ready Data Science with Metaflow

  • 5

    Data Science, Meet Data Mesh: What We Can Learn from Bioinformatics about the Power of Standardization in Distributed Systems

    • Abstract and Bio

    • Data Science, Meet Data Mesh: What We Can Learn from Bioinformatics about the Power of Standardization in Distributed Systems

Abstracts and Speaker

What's new in Apache Airflow 2.3?

This session will talk about the awesome new features the community has built that were recently released in. Apache Airflow 2.3.

Highlights:
- Dynamic Task Mapping
- First-class support for DB Downgrades
- Pruning old DB records (No need of using Maintenance DAGs anymore)
- Building Connections using JSON
- UI Improvements

The talk will also cover the growth of Airflow Community over years and why Airflow is still the defacto tool for Workflow Orchestration.


  Kaxil Naik, Director of Airflow Engineering @ Astronomer

A Systematic Approach for Building Full-Spectrum Model Monitoring

At many companies ML models make high stakes decisions each day which makes it critical to monitor models to prevent poor decisions. However, there are lots of technical and organizational challenges in doing so effectively at scale. In this talk we’ll discuss a systematic framework to build and roll-out full-spectrum Model Monitoring for identifying and preventing problems with models. We’ll do a deep dive into Lyft’s model monitoring architecture (which includes real-time feature validation, performance drift detection, anomaly detection, and model score monitoring), how we leveraged open source, and the cultural change needed to get data scientists to effectively monitor their models. We’ll also discuss why we decided to build vs buy, our wins and learnings, and why monitoring in itself may not be sufficient for preventing model degradation.


  Mihir Mathur, Product Manager @ Lyft

Scaling Machine Learning with Data Mesh

With the quick rise in popularity of Data Mesh we now approach new frontiers in the Data Mesh space to solve for more complex scenarios such as model training at scale. This talk will discuss how to architect your Data Mesh platform to create scalable self service Machine Learning Data Products. Thereby allowing both Data Scientists and Machine Learning Engineers to easily provision and deploy infrastructure reducing time to market while also gaining all the benefits of Data Mesh.  


  Shawn Kyzer, Principal Data Engineer @ Thoughtworks  (Spain)

Human-Friendly, Production-Ready Data Science with Metaflow

 In this talk, we discuss the problem space and the approach we took to solving it with Metaflow, the open-source framework we developed at Netflix, which now powers hundreds of business-critical ML projects at Netflix and other companies from bioinformatics and drones to real estate. We wanted to provide the best possible user experience for data scientists, allowing them to focus on parts they like (modeling using their favorite off-the-shelf libraries) while providing robust built-in solutions for the foundational infrastructure: data, compute, orchestration, and versioning.


  Ville Tuulos, Co-Founder @ Outerbounds

Data Science, Meet Data Mesh: What We Can Learn from Bioinformatics about the Power of Standardization in Distributed Systems

In this talk, we will argue for the large-scale adoption of data mesh principles to advance data science. Specifically, there is a need for domain-specific data standards including well-defined data structures for key entities in the domain and metadata to support particular use cases.  Examples will demonstrate how bioinformaticians create data pipelines that draw from data sources about gene (GenBank) and protein (UniProt) sequences, protein structures (Protein Data Bank), gene expression (Expression Atlas), bioactive molecules (ChemBL), and metabolic and signaling pathways (KEGG Pathway Database). We will also review an example metadata standard for human pathogen genomic sequences and describe why domain-specific metadata is needed in addition to common metadata standards.  The talk will conclude with tips on how to create data products within a data mesh architecture.


  Dan Sullivan, PhD, Principal Data Architect @ 4 Mile Analytics