How to prepare your data for supervised machine learning

Supervised Learning 2: How to Prepare Your Data

This course is available only as a part of subscription plans

Course Abstract

Training duration: 90 min (Hands-on)

Supervised Learning is a course series that walks through all steps of the classical supervised machine learning pipeline. We use python and packages like scikit-learn, pandas, numpy, and matplotlib. The course series focuses on topics like cross validation and splitting strategies, evaluation metrics, supervised machine learning algorithms (like linear and logistic regression, support vector machines, and tree-based methods like random forest, gradient boosting, and XGBoost), and interpretability. You can complete the courses in sequence or complete individual courses based on your interest. Part 2 of the course series is on how to prepare your data for training and evaluating a machine learning model. Two steps are covered: how to split and preprocess your data. My experience is that beginner practitioners often make a mistake referred to as data leakage when splitting their dataset. Data leakage means that you use information in the model training process which will not be available at prediction time. The unfortunate side effect is that the model seems to perform well in production but poorly in deployment. Two modules are dedicated to splitting with the hope that the participants will be well-equipped to avoid data leakage upon completing the modules. The third module is on preprocessing. There are two driving concepts behind preprocessing: the feature matrix needs to be numerical (no strings or any other data types are allowed when using sklearn), and some machine learning models converge faster and perform better if all features are standardized.

DIFFICULTY LEVEL: BEGINNER | INTERMEDIATE

Learning Objectives

Describe why data splitting is necessary in machine learning
Summarize the properties of IID data
List examples of non-IID datasets
Apply IID splitting techniques
Apply non-IID splitting techniques
Identify when a custom splitting strategy is necessary
Describe the two motivating concepts behind preprocessing
Apply various preprocessors to categorical and continuous features
Perform preprocessing with a sklearn pipeline and ColumnTransformer

Instructor Bio:

Andras Zsom, PhD

Lead Data Scientist and Adjunct Lecturer in Data Science | Brown University, Center for Computation and Visualization

Andras Zsom, PhD

Andras Zsom is a Lead Data Scientist in the Center for Computation and Visualization and an Adjunct Lecturer in Data Science at Brown University, Providence, RI, USA. He works with high-level academic administrators to tackle predictive modeling problems and to promote data-driven decision making, he collaborates with faculty members on data-intensive research projects, and he is the instructor of a mandatory course in the Data Science Master’s curriculum.

INTERESTED IN MORE HANDS-ON TRAINING SESSIONS?

VIEW PLANS

Course Outline

Module 1: Split IID data

Review why we split the data (hyperparameter tuning, the bias-variance trade off, the generalization error)
The properties of Independent and Identically Distributed (IID) data
The basic approach to split the data into training, validation, and test sets
K-Fold cross validation
How to split imbalanced data in classification, the stratified split
The uncertainty introduced by data splitting and how to measure it

Module 2: Split non-IID data

Examples of non-IID datasets
Guiding questions to ask yourself when coming up with a splitting strategy
Split a dataset with group structure: GroupShuffleSplit and GroupKFold
How to work with time series data: the TimeSeriesSplit
The limitations of sklearn: when you should consider writing your own custom splitting function

Module 3: Preprocess continuous and categorical features

Review the driving concepts behind preprocessing
Overview of sklearn transformers and methods
Apply the one hot encoder to categorical features
Apply the ordinal encoder to ordinal features
Standardize continuous features
Introduction to sklearn’s ColumnTransformer and pipelines

Have questions?

GET IN TOUCH >>

Background knowledge

Python coding experience
Familiarity with pandas and numpy
Prior experience with scikit-learn and matplotlib are a plus but not required

Applicable Use-cases

The dataset can be expressed as a 2D feature matrix with the columns as features and the rows as data points
A continuous or categorical target variable exists and the dataset is IID (the points are independent and identically distributed)
Fraud detection, predict if patients have a certain illness, predict the selling or rental price of properties, predict customer satisfaction

CHECK OUT NEW AND FEATURED COURSES

SEE ALL COURSES>>