Get Ahead with Expert-Led Training in Supervised Learning

Supervised Machine Learning is a 6-part course series that walks through all steps of the classical supervised machine learning pipeline. We use python and packages like scikit-learn, pandas, numpy, and matplotlib. The course series focuses on topics like cross validation and splitting strategies, evaluation metrics, supervised machine learning algorithms (like linear and logistic regression, support vector machines, and tree-based methods like random forest, gradient boosting, and XGBoost), and interpretability. You can complete the courses in sequence or complete individual courses based on your interest.


Lead Data Scientist and Adjunct Lecturer in Data Science | Brown University, Center for Computation and Visualization

Andras Zsom, PhD

Andras Zsom is a Lead Data Scientist in the Center for Computation and Visualization and an Adjunct Lecturer in Data Science at Brown University, Providence, RI, USA. He works with high-level academic administrators to tackle predictive modeling problems and to promote data-driven decision making, he collaborates with faculty members on data-intensive research projects, and he is the instructor of a mandatory course in the Data Science Master’s curriculum.

Supervised Learning 1: Introduction to Machine Learning and the Bias-Variance Tradeoff

Module 1: Intro to Machine Learning

Module 2: Overview of linear and logistic regression with regularization

Module 3: The bias-variance tradeoff

Supervised Learning 2: How to Prepare your Data for Supervised Machine Learning

Module 1: Split IID data (train/validation/test, KFoldCV, stratified splits in classification)

Module 2: Split non-IID data (GroupKFold, TimeSeriesSplit)

Module 3: Preprocess features (OneHotEncoder and OrdinalEncoder for categorical features, StandardScaler for continuous features)

Supervised Learning 3: Evaluation Metrics in Supervised Machine Learning

Module 1: Hard predictions in classification (the confusion matrix and derived metrics such as accuracy, precision, recall, f_beta score)

Module 2: Working with predicted probabilities in classification (ROC curve, precision-recall curve, AUC, the logloss metric)

Module 3: Regression metrics (MSE, RMSE, MAE, R2 score)

Supervised Learning 4: Non-linear Supervised Machine Learning Algorithms

Module 1: K-Nearest Neighbors

Module 2: Support Vector Machines (various kernels, hyperparameters, visualize predictions in simple cases with 1 or 2 features, pros and cons)

Module 3: Random Forests (CART, hyperparameters, visualize step-like predictions in simple cases with 1 or 2 features, pros and cons)

Module 4: XGBoost (hyperparameters, early stopping, missing values, pros and cons)

Supervised Learning 5: Missing Data in Supervised ML

Module 1: Missing Data Patterns

Module 2: Apply the Reduced-Features Model (also called the Pattern Submodel Approach)

Module 3: How to Determine the Patterns?

Module 4: Decide Which Approach is Best for Your Dataset

Module 1: Global features importances using the coefficients of linear models

Module 2: Permutation feature importance and algorithm-specific metrics (e.g., gini impurity, XGBoost metrics like weight, cover, gain)

Module 3: Local feature importance with SHAP values

Background knowledge

  • Python coding experience

  • Familiarity with pandas and numpy

  • Prior experience with scikit-learn and matplotlib are a plus but not required

Applicable Use-cases

  • The dataset can be expressed as a 2D feature matrix with the columns as features and the rows as data points

  • Fraud detection, predict if patients have a certain illness, predict the selling or rental price of properties, predict customer satisfaction

  • Dynamic pricing for marketing campaigns for any goods or services rely on pricing data. Airlines and ride-share services have successfully implemented dynamic price optimization strategies using supervised learning