Missing Data in Supervised Machine Learning

Supervised Learning 5: Missing Data in Supervised ML

This course is available only as a part of subscription plans.

Course Abstract

Training duration : 90 minutes

Datasets are almost never complete and this can introduce various biases to your analysis. Due to these biases, your supervised machine learning model can produce incorrect predictions. The goal of this post is to give you an idea of why some of the most common approaches for dealing with missing values often introduce some type of bias. I will describe the methods and techniques that can help you to arrive at an unbiased conclusion in the face of missing data.

DIFFICULTY LEVEL: INTERMEDIATE

Learning Objectives

Describe the three main types of missingness patterns
Evaluate simple approaches for handling missing values
Apply XGBoost to a dataset with missing values
Apply multivariate imputation
Apply the reduced-features model (also called the pattern submodel approach)
Decide which approach is best for your dataset

Instructor Bio:

Andras Zsom, PhD

Lead Data Scientist and Adjunct Lecturer in Data Science | Brown University, Center for Computation and Visualization

Andras Zsom, PhD

Andras Zsom is a Lead Data Scientist in the Center for Computation and Visualization and an Adjunct Lecturer in Data Science at Brown University, Providence, RI, USA. He works with high-level academic administrators to tackle predictive modeling problems and to promote data-driven decision making, he collaborates with faculty members on data-intensive research projects, and he is the instructor of a mandatory course in the Data Science Master’s curriculum.

INTERESTED IN MORE HANDS-ON TRAINING SESSIONS?

VIEW PLANS

Course Outline

Module 1: Missing data patterns

- MCAR - Missing Complete At Random

- MAR - Missing At Random

- MNAR - Missing Not At Random

Module 2: Apply the reduced-features model (also called the pattern submodel approach)

- Reduced-features model (or pattern submodel approach)

Module 3: How to determine the patterns?

- A python implementation

Module 4: Decide which approach is best for your dataset

- XGB models

- Imputation

- Reduced-features

Have questions?

GET IN TOUCH >>

Background knowledge

Experience with python and scikit-learn
Knowledge of building a machine learning pipeline (e.g., cross validation, hyper-parameter tuning)

Applicable Use-cases

Supervised Learning can be used in Customer churn modeling can help identify which of the customers of a business are likely to stop engaging with the business and why.
Dynamic pricing for marketing campaigns for any goods or services rely on pricing data. Airlines and ride-share services have successfully implemented dynamic price optimization strategies using supervised learning
Tackling missing data scenarios help rectify and enhance modeling capabilities in a variety of business applications including streaming, finance, e-commerce.

CHECK OUT NEW AND FEATURED COURSES

SEE ALL COURSES>>