Supervised Learning 2: How to Prepare Your Data
This course is available only as a part of subscription plans
Training duration: 90 min (Hands-on)
Describe why data splitting is necessary in machine learning
Summarize the properties of IID data
List examples of non-IID datasets
Apply IID splitting techniques
Apply non-IID splitting techniques
Identify when a custom splitting strategy is necessary
Describe the two motivating concepts behind preprocessing
Apply various preprocessors to categorical and continuous features
Perform preprocessing with a sklearn pipeline and ColumnTransformer
Andras Zsom, PhD
Andras Zsom, PhD
Module 1: Split IID data
Review why we split the data (hyperparameter tuning, the bias-variance trade off, the generalization error)
The properties of Independent and Identically Distributed (IID) data
The basic approach to split the data into training, validation, and test sets
K-Fold cross validation
How to split imbalanced data in classification, the stratified split
The uncertainty introduced by data splitting and how to measure it
Module 2: Split non-IID data
Examples of non-IID datasets
Guiding questions to ask yourself when coming up with a splitting strategy
Split a dataset with group structure: GroupShuffleSplit and GroupKFold
How to work with time series data: the TimeSeriesSplit
The limitations of sklearn: when you should consider writing your own custom splitting function
Module 3: Preprocess continuous and categorical features
Review the driving concepts behind preprocessing
Overview of sklearn transformers and methods
Apply the one hot encoder to categorical features
Apply the ordinal encoder to ordinal features
Standardize continuous features
Introduction to sklearn’s ColumnTransformer and pipelines
Python coding experience
Familiarity with pandas and numpy
Prior experience with scikit-learn and matplotlib are a plus but not required