Course Abstract

Training duration: 90 min (Hands-on)

Supervised Learning is a course series that walks through all steps of the classical supervised machine learning pipeline. We use python and packages like scikit-learn, pandas, numpy, and matplotlib. The course series focuses on topics like cross validation and splitting strategies, evaluation metrics, supervised machine learning algorithms (like linear and logistic regression, support vector machines, and tree-based methods like random forest, gradient boosting, and XGBoost), and interpretability. You can complete the courses in sequence or complete individual courses based on your interest. This course starts with a high-level overview of supervised machine learning focusing on regression and classification problems, what questions can be answered with these tools, and what the ultimate goal of a machine learning pipeline is. Then we will walk through the math behind linear and logistic regression models with regularization. Finally, we put together a simple pipeline using a toy dataset to illustrate the bias-variance tradeoff, a key concept in machine learning that drives how models are selected.


Learning Objectives

  • Describe how a task like spam filtering can be solved with explicit coding instructions vs. a machine learning algorithm that learns from examples (training data)

  • Summarize the similarities and differences between supervised and unsupervised ML

  • List the pros and cons of supervised machine learning

  • Define the mathematical model behind linear and logistic regression

  • Explain what the loss function is

  • Describe the two main types of regularization and why it is important

  • Perform a simple train/validation/test split on IID data

  • Apply linear and logistic regression to datasets

  • Tune the regularization hyperparameter

  • Identify models with high bias and high variance and Select the best model and measure its performance on a previously unseen dataset, the test set

Instructor Bio:

Andras Zsom, PhD

Lead Data Scientist and Adjunct Lecturer in Data Science | Brown University, Center for Computation and Visualization

Andras Zsom, PhD

Andras Zsom is a Lead Data Scientist in the Center for Computation and Visualization and an Adjunct Lecturer in Data Science at Brown University, Providence, RI, USA. He works with high-level academic administrators to tackle predictive modeling problems and to promote data-driven decision making, he collaborates with faculty members on data-intensive research projects, and he is the instructor of a mandatory course in the Data Science Master’s curriculum.

Course Outline

Module 1: Intro to Machine Learning (20 minutes)

  • Motivation: why supervised ML is the most successful area of ML

  • The example of the spam filter: workflow with explicit coding instructions vs. machine learning

  • The feature matrix and the target variable

  • Supervised and unsupervised machine learning

  • The pros and cons of supervised ML

  • Automation and predictions 

Module 2: Overview of linear and logistic regression with regularization (30 min)

  • The mathematical models behind linear and logistic regression

  • The cost function

  • Brief description of gradient descent

  • Motivate regularization

  • L1 (Lasso) and l2 (Ridge) regularization

Module 3: The bias-variance tradeoff (40 min)

  • Split a dataset into train/validation/test sets

  • Standardize the dataset

  • Train linear models with various regularization strength

  • Calculate the train and validation scores

  • Plot the scores and the predictions of corresponding models

  • Identify regions of high bias and high variance

  • Select the best regularization strength

  • Calculate the test score

Background knowledge

  • Python coding experience

  • Familiarity with pandas and numpy

  • Prior experience with scikit-learn and matplotlib are a plus but not required

Applicable Use-cases

  • The dataset can be expressed as a 2D feature matrix with the columns as features and the rows as data points

  • A continuous or categorical target variable exists and the dataset is IID (the points are independent and identically distributed)

  • The final model will predict that target variable given the features of previously unseen data points. Some examples include but are not limited to fraud detection, predict if patients have a certain illness, predict the selling or rental price of properties, predict customer satisfaction