Supervised Machine Learning Series
This series is only available as a part of the subscription plans
Supervised Machine Learning is a 6-part course series that walks through all steps of the classical supervised machine learning pipeline. We use python and packages like scikit-learn, pandas, numpy, and matplotlib. The course series focuses on topics like cross validation and splitting strategies, evaluation metrics, supervised machine learning algorithms (like linear and logistic regression, support vector machines, and tree-based methods like random forest, gradient boosting, and XGBoost), and interpretability. You can complete the courses in sequence or complete individual courses based on your interest.
Andras Zsom, PhD
Lead Data Scientist and Adjunct Lecturer in Data Science | Brown University, Center for Computation and Visualization
Module 1: Intro to Machine Learning
Module 2: Overview of linear and logistic regression with regularization
Module 3: The bias-variance tradeoff
Module 1: Split IID data (train/validation/test, KFoldCV, stratified splits in classification)
Module 2: Split non-IID data (GroupKFold, TimeSeriesSplit)
Module 3: Preprocess features (OneHotEncoder and OrdinalEncoder for categorical features, StandardScaler for continuous features)
Module 1: Hard predictions in classification (the confusion matrix and derived metrics such as accuracy, precision, recall, f_beta score)
Module 2: Working with predicted probabilities in classification (ROC curve, precision-recall curve, AUC, the logloss metric)
Module 3: Regression metrics (MSE, RMSE, MAE, R2 score)
Module 1: K-Nearest Neighbors
Module 2: Support Vector Machines (various kernels, hyperparameters, visualize predictions in simple cases with 1 or 2 features, pros and cons)
Module 3: Random Forests (CART, hyperparameters, visualize step-like predictions in simple cases with 1 or 2 features, pros and cons)
Module 4: XGBoost (hyperparameters, early stopping, missing values, pros and cons)
Module 1: Missing Data Patterns
Module 2: Apply the Reduced-Features Model (also called the Pattern Submodel Approach)
Module 3: How to Determine the Patterns?
Module 4: Decide Which Approach is Best for Your Dataset
Module 1: Global features importances using the coefficients of linear models
Module 2: Permutation feature importance and algorithm-specific metrics (e.g., gini impurity, XGBoost metrics like weight, cover, gain)
Module 3: Local feature importance with SHAP values
Python coding experience
Familiarity with pandas and numpy
Prior experience with scikit-learn and matplotlib are a plus but not required