Course Abstract

Training duration: 90 min (Hands-on)

Supervised Learning is a course series that walks through all steps of the classical supervised machine learning pipeline. We use python and packages like scikit-learn, pandas, numpy, and matplotlib. The course series focuses on topics like cross validation and splitting strategies, evaluation metrics, supervised machine learning algorithms (like linear and logistic regression, support vector machines, and tree-based methods like random forest, gradient boosting, and XGBoost), and interpretability. You can complete the courses in sequence or complete individual courses based on your interest. We review four non-linear supervised machine learning algorithms in part 4 of the course series (K-Nearest Neighbors, Support Vector Machines, Random Forests, XGBoost). When you work on a project, generally you should try as many algorithms as you can on your dataset because it is difficult to know apriori which algorithm will perform best. Thus it is important to understand how various algorithms work, what hyperparameters need to be tuned, what the pros and cons of each algorithm are, etc. While we will not cover the in-depth math behind these algorithms as we did with linear and logistic regression in part 1, you will have a solid intuitive understanding of how the algorithms work upon completing this course. We will use a couple of toy datasets and visualizations I found helpful when learning about the properties of a new algorithm. As a result, you will be well-equipped to master other algorithms we do not cover here by yourself. I will also describe a couple of insights I gained about these algorithms over the years that might not be obvious to new users.

DIFFICULTY LEVEL: INTERMEDIATE

Learning Objectives

  • Summarize how each algorithm works

  • Describe which hyperparameters need to be tuned and what range the values should have

  • Apply the algorithms in regression and classification

  • Visualize the predictions of toy datasets

  • Summarize under what circumstances a certain algorithm is expected to perform well or poorly and why

Instructor Bio:

Andras Zsom, PhD

Lead Data Scientist and Adjunct Lecturer in Data Science | Brown University, Center for Computation and Visualization

Andras Zsom, PhD

Andras Zsom is a Lead Data Scientist in the Center for Computation and Visualization and an Adjunct Lecturer in Data Science at Brown University, Providence, RI, USA. He works with high-level academic administrators to tackle predictive modeling problems and to promote data-driven decision making, he collaborates with faculty members on data-intensive research projects, and he is the instructor of a mandatory course in the Data Science Master’s curriculum.

Course Outline

Module 1: KNN 

  • General overview of why we need to train multiple algorithms on the same dataset
  • Introduce the pros and cons summary table we will fill out on each algorithm
  • Describe how KNN works
  • Walk through the hyperparameters and what the range of the values should be
  • Apply it to toy datasets in regression and classification
  • Visualize the predictions to learn about the properties of the model
  • Summarize pros and cons


Module 2: SVM

  • Describe how SVM works, the focus is on radial basis functions
  • Walk through the hyperparameters and what the range of the values should be
  • Apply it to toy datasets in regression and classification
  • Visualize the predictions to learn about the properties of the model
  • Summarize pros and cons


Module 3: RF

  • Describe how RF works, start with CARTs
  • Walk through the hyperparameters and what the range of the values should be
  • Apply it to toy datasets in regression and classification
  • Visualize the predictions to learn about the properties of the model
  • Summarize pros and cons


Module 4: XGBoost

  • Describe how XGBoost works, contrast XGBoost to other tree-based models
  • Walk through the hyperparameters and what the range of the values should be
  • Apply it to toy datasets in regression and classification
  • Visualize the predictions to learn about the properties of the model
  • Summarize pros and cons

Background knowledge

  • Python coding experience

  • Familiarity with pandas and numpy

  • Prior experience with scikit-learn and matplotlib are a plus but not required

Applicable Use-cases

  • The dataset can be expressed as a 2D feature matrix with the columns as features and the rows as data points

  • A continuous or categorical target variable exists

  • Some examples include but are not limited to fraud detection, predict if patients have a certain illness, predict the selling or rental price of properties, predict customer satisfaction