Supervised Machine Learning Series
This series is only available as a part of the subscription plans
Supervised Machine Learning is a 6-part course series that walks through all steps of the classical supervised machine learning pipeline. We use python and packages like scikit-learn, pandas, numpy, and matplotlib. The course series focuses on topics like cross validation and splitting strategies, evaluation metrics, supervised machine learning algorithms (like linear and logistic regression, support vector machines, and tree-based methods like random forest, gradient boosting, and XGBoost), and interpretability. You can complete the courses in sequence or complete individual courses based on your interest.
Andras Zsom, PhD
Module 1: Intro to Machine Learning
Module 2: Overview of linear and logistic regression with regularization
Module 3: The bias-variance tradeoff
Module 1: Split IID data (train/validation/test, KFoldCV, stratified splits in classification)
Module 2: Split non-IID data (GroupKFold, TimeSeriesSplit)
Module 3: Preprocess features (OneHotEncoder and OrdinalEncoder for categorical features, StandardScaler for continuous features)
Module 1: Hard predictions in classification (the confusion matrix and derived metrics such as accuracy, precision, recall, f_beta score)
Module 2: Working with predicted probabilities in classification (ROC curve, precision-recall curve, AUC, the logloss metric)
Module 3: Regression metrics (MSE, RMSE, MAE, R2 score)
Module 1: K-Nearest Neighbors
Module 2: Support Vector Machines (various kernels, hyperparameters, visualize predictions in simple cases with 1 or 2 features, pros and cons)
Module 3: Random Forests (CART, hyperparameters, visualize step-like predictions in simple cases with 1 or 2 features, pros and cons)
Module 4: XGBoost (hyperparameters, early stopping, missing values, pros and cons)
Module 1: Missing Data Patterns
Module 2: Apply the Reduced-Features Model (also called the Pattern Submodel Approach)
Module 3: How to Determine the Patterns?
Module 4: Decide Which Approach is Best for Your Dataset
Module 1: Global features importances using the coefficients of linear models
Module 2: Permutation feature importance and algorithm-specific metrics (e.g., gini impurity, XGBoost metrics like weight, cover, gain)
Module 3: Local feature importance with SHAP values
Python coding experience
Familiarity with pandas and numpy
Prior experience with scikit-learn and matplotlib are a plus but not required