Introduction to machine learning and the bias-variance trade

Supervised Machine Learning 1: Introduction to machine learning and the bias-variance tradeoff

This course is available only as a part of subscription plans.

Course Abstract

Training duration: 90 min (Hands-on)

Supervised Learning is a course series that walks through all steps of the classical supervised machine learning pipeline. We use python and packages like scikit-learn, pandas, numpy, and matplotlib. The course series focuses on topics like cross validation and splitting strategies, evaluation metrics, supervised machine learning algorithms (like linear and logistic regression, support vector machines, and tree-based methods like random forest, gradient boosting, and XGBoost), and interpretability. You can complete the courses in sequence or complete individual courses based on your interest. This course starts with a high-level overview of supervised machine learning focusing on regression and classification problems, what questions can be answered with these tools, and what the ultimate goal of a machine learning pipeline is. Then we will walk through the math behind linear and logistic regression models with regularization. Finally, we put together a simple pipeline using a toy dataset to illustrate the bias-variance tradeoff, a key concept in machine learning that drives how models are selected.

DIFFICULTY LEVEL: BEGINNER

Learning Objectives

Describe how a task like spam filtering can be solved with explicit coding instructions vs. a machine learning algorithm that learns from examples (training data)
Summarize the similarities and differences between supervised and unsupervised ML
List the pros and cons of supervised machine learning
Define the mathematical model behind linear and logistic regression
Explain what the loss function is
Describe the two main types of regularization and why it is important
Perform a simple train/validation/test split on IID data
Apply linear and logistic regression to datasets
Tune the regularization hyperparameter
Identify models with high bias and high variance and Select the best model and measure its performance on a previously unseen dataset, the test set

Instructor Bio:

Andras Zsom, PhD

Lead Data Scientist and Adjunct Lecturer in Data Science | Brown University, Center for Computation and Visualization

Andras Zsom, PhD

Andras Zsom is a Lead Data Scientist in the Center for Computation and Visualization and an Adjunct Lecturer in Data Science at Brown University, Providence, RI, USA. He works with high-level academic administrators to tackle predictive modeling problems and to promote data-driven decision making, he collaborates with faculty members on data-intensive research projects, and he is the instructor of a mandatory course in the Data Science Master’s curriculum.

INTERESTED IN MORE HANDS-ON TRAINING SESSIONS?

VIEW PLANS >>

Course Outline

Module 1: Intro to Machine Learning (20 minutes)

Motivation: why supervised ML is the most successful area of ML
The example of the spam filter: workflow with explicit coding instructions vs. machine learning
The feature matrix and the target variable
Supervised and unsupervised machine learning
The pros and cons of supervised ML
Automation and predictions

Module 2: Overview of linear and logistic regression with regularization (30 min)

The mathematical models behind linear and logistic regression
The cost function
Brief description of gradient descent
Motivate regularization
L1 (Lasso) and l2 (Ridge) regularization

Module 3: The bias-variance tradeoff (40 min)

Split a dataset into train/validation/test sets
Standardize the dataset
Train linear models with various regularization strength
Calculate the train and validation scores
Plot the scores and the predictions of corresponding models
Identify regions of high bias and high variance
Select the best regularization strength
Calculate the test score

Have questions?

GET IN TOUCH >>

Background knowledge

Python coding experience
Familiarity with pandas and numpy
Prior experience with scikit-learn and matplotlib are a plus but not required

Applicable Use-cases

The dataset can be expressed as a 2D feature matrix with the columns as features and the rows as data points
A continuous or categorical target variable exists and the dataset is IID (the points are independent and identically distributed)
The final model will predict that target variable given the features of previously unseen data points. Some examples include but are not limited to fraud detection, predict if patients have a certain illness, predict the selling or rental price of properties, predict customer satisfaction

CHECK OUT NEW AND FEATURED COURSES

SEE ALL COURSES>>