Description

In this hands-on session, you'll learn how to generate high-quality synthetic data that preserves privacy using differential privacy techniques. We’ll walk through how to train differentially private generative models with MOSTLY AI’s open-source Synthetic Data SDK and explore how this method compares to traditional anonymization approaches in terms of both utility and risk. You’ll gain practical insights into configuring privacy parameters, understanding the impact of privacy budgets, and evaluating synthetic data output.

We’ll also cover how to assess the fidelity of synthetic datasets using predictive and discriminative machine learning models, and how to create hybrid datasets that blend real and synthetic data for improved utility. Through live demonstrations and real-world examples, you’ll develop a strong understanding of the privacy-utility trade-offs and how to confidently apply privacy-safe synthetic data in your own data science workflows.

Session Outline:

Lesson 1: Introduction to Differential Privacy Get familiar with the core concepts of differential privacy and how it differs from traditional anonymization techniques. By the end of this lesson, you’ll be able to explain what differential privacy is, what a privacy budget (epsilon) means, and why it provides stronger privacy guarantees than pseudonymization or masking. 

Lesson 2: Setting Up and Using the Synthetic Data SDK Learn how to install and configure MOSTLY AI’s open-source Synthetic Data SDK to generate synthetic datasets with differential privacy enabled. You’ll run the SDK in LOCAL mode using a prepared dataset, explore the configuration options for privacy settings, and review the structure of the synthetic output. 

Lesson 3: Evaluating Utility vs. Privacy Trade-offs Compare synthetic datasets generated with different privacy settings to understand how utility is impacted by stricter privacy budgets. By the end of this lesson, you’ll be able to evaluate the usefulness of differentially private synthetic data using predictive models and summary statistics. 

Lesson 4: Creating Hybrid Datasets with Real and Synthetic Data Explore how to combine real and synthetic data to create hybrid datasets that retain utility while improving privacy. You’ll walk through a practical example and learn how to use synthetic data to augment or replace sensitive parts of your dataset.

Difficulty: Intermediate

Pre-reqs: 

This tutorial is designed for data engineers, data scientists, ML engineers, and analysts with basic Python skills and familiarity with working in Jupyter Notebooks. Attendees should have a general understanding of machine learning workflows and working with tabular datasets (e.g., CSV files or pandas DataFrames). No prior experience with synthetic data is required. To participate fully in the hands-on exercises, attendees should have the following installed before the session: Python 3.11+, Git Installation and setup of the Synthetic Data SDK will be covered as part of the tutorial. But feel free to get started beforehand by visiting https://github.com/mostly-ai/mostlyai.


Local ODSC chapter in NYC, USA

Celebrate 10 Years of AI Innovation at ODSC East 2025!

Join us on May 13th-15th, 2025, for 3 days of immersive learning and networking with AI experts - https://hubs.li/Q02YK3hT0

Use code CommunityEast2025 for an extra discount.

Instructor's Bio

Dr. Michael Platzer

Co-Founder and CTO of MOSTLY AI

Dr. Michael Platzer is co-founder and CTO of MOSTLY AI, a leader in privacy-safe synthetic data generation. He earned his degrees in mathematics and in business with distinction, led consumer analytic teams at global technology leaders, before starting his venture to pioneer the field of synthetic data. His company's mission is to democratize data access and data insights in a safe and responsible way for everyone.  

Webinar

  • 1

    UPCOMING WEBINAR "Differentially-Private Synthetic Data for Everyone"

    • Ai+ Training

    • Join Session Here