Get Ahead with Expert-Led Training in Supervised Learning

Supervised Machine Learning is a 6-part course series that walks through all steps of the classical supervised machine learning pipeline. We use python and packages like scikit-learn, pandas, numpy, and matplotlib. The course series focuses on topics like cross validation and splitting strategies, evaluation metrics, supervised machine learning algorithms (like linear and logistic regression, support vector machines, and tree-based methods like random forest, gradient boosting, and XGBoost), and interpretability. You can complete the courses in sequence or complete individual courses based on your interest.


Andras Zsom is a Lead Data Scientist in the Center for Computation and Visualization and an Adjunct Lecturer in Data Science at Brown University, Providence, RI, USA. He works with high-level academic administrators to tackle predictive modeling problems and to promote data-driven decision making, he collaborates with faculty members on data-intensive research projects, and he is the instructor of a mandatory course in the Data Science Master’s curriculum.

Andras Zsom, PhD

Lead Data Scientist and Adjunct Lecturer in Data Science | Brown University, Center for Computation and Visualization

Supervised Learning 1: Introduction to Machine Learning and the Bias-Variance Tradeoff

Module 1: Intro to Machine Learning

Module 2: Overview of linear and logistic regression with regularization

Module 3: The bias-variance tradeoff

Supervised Learning 2: How to Prepare your Data for Supervised Machine Learning

Module 1: Split IID data (train/validation/test, KFoldCV, stratified splits in classification)

Module 2: Split non-IID data (GroupKFold, TimeSeriesSplit)

Module 3: Preprocess features (OneHotEncoder and OrdinalEncoder for categorical features, StandardScaler for continuous features)

Supervised Learning 3: Evaluation Metrics in Supervised Machine Learning

Module 1: Hard predictions in classification (the confusion matrix and derived metrics such as accuracy, precision, recall, f_beta score)

Module 2: Working with predicted probabilities in classification (ROC curve, precision-recall curve, AUC, the logloss metric)

Module 3: Regression metrics (MSE, RMSE, MAE, R2 score)

Supervised Learning 4: Non-linear Supervised Machine Learning Algorithms

Module 1: K-Nearest Neighbors

Module 2: Support Vector Machines (various kernels, hyperparameters, visualize predictions in simple cases with 1 or 2 features, pros and cons)

Module 3: Random Forests (CART, hyperparameters, visualize step-like predictions in simple cases with 1 or 2 features, pros and cons)

Module 4: XGBoost (hyperparameters, early stopping, missing values, pros and cons)

Supervised Learning 5: Missing Data in Supervised ML

Module 1: Missing Data Patterns

Module 2: Apply the Reduced-Features Model (also called the Pattern Submodel Approach)

Module 3: How to Determine the Patterns?

Module 4: Decide Which Approach is Best for Your Dataset

Module 1: Global features importances using the coefficients of linear models

Module 2: Permutation feature importance and algorithm-specific metrics (e.g., gini impurity, XGBoost metrics like weight, cover, gain)

Module 3: Local feature importance with SHAP values

Background knowledge

  • Python coding experience

  • Familiarity with pandas and numpy

  • Prior experience with scikit-learn and matplotlib are a plus but not required

Applicable Use-cases

  • The dataset can be expressed as a 2D feature matrix with the columns as features and the rows as data points

  • Fraud detection, predict if patients have a certain illness, predict the selling or rental price of properties, predict customer satisfaction

  • Dynamic pricing for marketing campaigns for any goods or services rely on pricing data. Airlines and ride-share services have successfully implemented dynamic price optimization strategies using supervised learning