Course Abstract

Training duration : 90 minutes

Resampling methods like the bootstrap are becoming increasingly common in modern data science. For good reason too; the bootstrap is incredibly powerful. Unlike t-statistics, the bootstrap doesn’t depend on a normality assumption nor require any arcane formulas. You’re no longer limited to working with well understood metrics like means. One can easily build tools that compute confidence for an arbitrary metric. What’s the standard error of a Median? Who cares! I used the bootstrap. With all of these benefits the bootstrap begins to look a little magical. That’s dangerous. To understand your tool you need to understand how it fails, how to spot the failure, and what to do when it does. As it turns out, methods like the bootstrap and the t-test struggle with very similar types of data. We’ll explore how these two methods compare on troublesome data sets and discuss when to use one over the other.

DIFFICULTY LEVEL: BEGINNER

Learning Objectives

  • Explore what types to data the bootstrap has trouble with.

  • How to identify these problems in the wild and how to deal with the problematic data.

  • Explore simulated data and share the code to conduct the simulations yourself. However, this isn’t just a theoretical problem.

  • Explore real Firefox data and discuss how Firefox’s data science team handles this data when analyzing experiment

  • Spot potential issues in your data and avoid false confidence in your results.

Instructor

Instructor Bio:

Principal Data Scientist | Mozilla

Ryan Harter

Ryan Harter is a Senior-Staff Data Scientist with Mozilla working on Firefox. He has years of experience solving business problems in the technology and energy industries both as a data scientist and data engineer. Ryan shares practical advice for applying data science as a mentor and in his blog.

INTERESTED IN MORE HANDS-ON TRAINING SESSIONS?

Course Outline

Module 1: Introduction to Bootstrapping

 - Cover terminology. What is a pseudosample?

 - Introduce the advantages and disadvantages of Bootstrapping 

Module 2: Simulating Bootstrap performance

 - Simulate how the bootstrap compares to the CLT over a few distributions

 - Uniform, Binomial, Pereto 

Module 3: Conclusions 

- Summarize the learnings 

- Compare some alternate definitions for Bootstrapping

Background knowledge

  • This course is for current or aspiring Data Scientists, Software Developers, and AI Product Managers

  • No specific background needed, but some basics of bootstrapping would help

Real-world applications

  • Bootstrapping can be used for experimentation purposes. For example, Netflix utilizes bootstrapping in streaming video experimation to deliver high quality streaming video at scale to all of its diverse members.

  • Bootstrapping can also be used for exploratory data analysis, and to evaluate and enhance machine learning model-performance.