Get Ahead with Expert-Leds Training in Supervised and Unsupervised Machine Learning

Supervised and Unsupervised Machine Learning

Supervised Machine Learning is a 6-part course series that walks through all steps of the classical supervised machine learning pipeline. We use python and packages like scikit-learn, pandas, numpy, and matplotlib. The course series focuses on topics like cross-validation and splitting strategies, evaluation metrics, supervised machine learning algorithms (like linear and logistic regression, support vector machines, and tree-based methods like the random forest, gradient boosting, and XGBoost), and interpretability. 


 

Unsupervised Machine learning is a 3-part course series, we will provide a foundational understanding of one of the major branches of machine learning: unsupervised learning. Most of the world’s data is unlabeled, and applying machine learning to this unlabeled data to solve real-world problems is one of the great challenges of artificial intelligence. 

We will show why unsupervised learning is so critical to working with data, especially if the data that is not only unlabeled but is very large scale and high volume. We will compare unsupervised learning with supervised learning and later combine the two approaches to develop semi-supervised learning solutions.

This course is an applied course, and we will use two simple, production-ready Python frameworks to develop unsupervised learning solutions: scikit-learn and TensorFlow. We will also use pandas, numpy, matplotlib, and other common data science packages.

Using unsupervised learning, we will discover meaningful patterns buried deep in data, patterns that may be near impossible for humans to find. We will use unsupervised learning to detect anomalies, perform group segmentation, develop recommender systems, and generate synthetic data such as text and images.

The course series focuses on topics such as dimensionality reduction (principal component analysis, singular value decomposition, random projection, isomap, multidimensional scaling, locally linear embedding, t-SNE, dictionary learning, and independent component analysis), clustering (k-means, hierarchical clustering, DBSCAN, and HDBSCAN), autoencoders, restricted Boltzmann machines, deep belief networks, generative adversarial networks, and time series clustering.


Instructors

Lead Data Scientist and Adjunct Lecturer in Data Science | Brown University, Center for Computation and Visualization

Andras Zsom, PhD

Andras Zsom is a Lead Data Scientist in the Center for Computation and Visualization and an Adjunct Lecturer in Data Science at Brown University, Providence, RI, USA. He works with high-level academic administrators to tackle predictive modeling problems and to promote data-driven decision making, he collaborates with faculty members on data-intensive research projects, and he is the instructor of a mandatory course in the Data Science Master’s curriculum.

Co-founder and Head of Data | Glean

Ankur Patel

Ankur Patel is the co-founder and Head of Data at Glean. Glean uses NLP to extract data from invoices and generate vendor spend intelligence for clients. Ankur is an applied machine learning specialist in both unsupervised learning and natural language processing, and he is the author of Hands-on Unsupervised Learning Using Python: How to Build Applied Machine Learning Solutions from Unlabeled Data and Applied Natural Language Processing in the Enterprise: Teaching Machines to Read, Write, and Understand. Previously, Ankur led teams at 7Park Data, ThetaRay, and R-Squared Macro and began his career at Bridgewater Associates and J.P. Morgan. He is a graduate of Princeton University and currently resides in New York City.

Course Outlines

Supervised Machine Learning

Supervised Learning 1: Introduction to Machine Learning and the Bias-Variance Tradeoff

Module 1: Intro to Machine Learning

Module 2: Overview of linear and logistic regression with regularization

Module 3: The bias-variance tradeoff


Supervised Learning 2: How to Prepare your Data for Supervised Machine Learning

Module 1: Split IID data (train/validation/test, KFoldCV, stratified splits in classification)

Module 2: Split non-IID data (GroupKFold, TimeSeriesSplit)

Module 3: Preprocess features (OneHotEncoder and OrdinalEncoder for categorical features, StandardScaler for continuous features)


Supervised Learning 3: Evaluation Metrics in Supervised Machine Learning

Module 1: Hard predictions in classification (the confusion matrix and derived metrics such as accuracy, precision, recall, f_beta score)

Module 2: Working with predicted probabilities in classification (ROC curve, precision-recall curve, AUC, the logloss metric)

Module 3: Regression metrics (MSE, RMSE, MAE, R2 score)


Supervised Learning 4: Non-linear Supervised Machine Learning Algorithms

Module 1: K-Nearest Neighbors

Module 2: Support Vector Machines (various kernels, hyperparameters, visualize predictions in simple cases with 1 or 2 features, pros and cons)

Module 3: Random Forests (CART, hyperparameters, visualize step-like predictions in simple cases with 1 or 2 features, pros and cons)

Module 4: XGBoost (hyperparameters, early stopping, missing values, pros and cons)


Supervised Learning 5: Missing Data in Supervised ML

Module 1: Missing Data Patterns

Module 2: Apply the Reduced-Features Model (also called the Pattern Submodel Approach)

Module 3: How to Determine the Patterns?

Module 4: Decide Which Approach is Best for Your Dataset


Supervised Learning 6: Interpretability

Module 1: Global features importances using the coefficients of linear models

Module 2: Permutation feature importance and algorithm-specific metrics (e.g., gini impurity, XGBoost metrics like weight, cover, gain)

Module 3: Local feature importance with SHAP values

Unsupervised Machine Learning

Unsupervised Learning 1: Intro to Unsupervised Learning, Dimensionality Reduction, and Anomaly Detection

Module 1: Introduction to Unsupervised Learning

  • How unsupervised learning fits into the machine learning ecosystem
  • Common problems in machine learning: insufficient labeled data, the curse of dimensionality, and outliers

Module 2: Introduction to Dimensionality Reduction

  • Motivation for dimensionality reduction: reduce the computational complexity of large data, remove non-relevant information and surface salient information, perform anomaly detection, perform clustering
  • Linear Dimensionality Reduction Algos
  • Non-linear Dimensionality Reduction Algos

Module 3: Application: Anomaly Detection

  • Introduce use case: credit card fraud detection
  • Explore and prepare the data
  • Define evaluation function
  • Apply linear dimensionality reduction and evaluate results
  • Apply non-linear dimensionality reduction and evaluate results


Unsupervised Learning 2: Clustering and Group Segmentation

Module 1: Introduction to Clustering

  • Why the need for clustering is exists / the real world motivation
  • How to find patterns in data with zero or few labels
  • How to efficiently label data when only few labels are available


Module 2: Overview of Clustering Algorithms

  • K-Means
  • Hierarchical clustering
  • DBSCAN
  • HDBSCAN
  • Apply to MNIST and Fashion MNIST datasets
  • Visualize clusters and evaluate results


Module 3: Application: Group Segmentation

  • Introduce use case: loan applications
  • Explore and prepare the data
  • Define evaluation function
  • Apply clustering algorithms and evaluate results



Unsupervised Learning 3: Deep Unsupervised Learning, Semi-supervised Learning, and Generative Models

Module 1: Introduction to Deep Unsupervised Learning

  • Motivation for representation learning and refresher on neural networks 
  • Compare shallow vs. deep learning and deep learning vs. classical machine learning
  • Explore use cases of deep unsupervised learning today


Module 2: Semi-supervised Learning

  • Intro to automatic feature extraction and autoencoders, including comparison of autoencoders to dimensionality reduction and an overview of complete, undercomplete, and overcomplete autoencoders
  • Intro to semi-supervised learning using autoencoders and how supervised and unsupervised learning complement each other
  • Develop semi-supervised fraud detection application using autoencoders
  • Compare the unsupervised, supervised, and semi-supervised solutions and evaluate results


Module 3: Generative Modeling

  • Intro to generative modeling, including restricted Boltzmann machines (RBMs), deep belief networks (DBNs), and generative adversarial networks (GANs)
  • Deep dive into GANs, including how a generator and a discriminator work together to produce synthetic data
  • Frame how generative modeling and GANs fit into the overall space of unsupervised learning
  • Demonstration of GANs in action using code to produce synthetic data

Background Knowledge

  • Python coding experience

  • Familiarity with pandas , numpy and scikit-learn

  • Prior experience with matplotlib are a plus but not required

  • Understanding of basic machine learning concepts, including supervised learning

  • Experience with deep learning and frameworks such as TensorFlow or PyTorch is a plus

Applicable Use-cases

  • Fraud Detection: Identify fraud in transactional data such as credit card, ACH, wire, and insurance claims

  • Cybersecurity: Stop malicious activity such as hacking

  • Anti-money Laundering: Detect potential money laundering for banks

  • Machine Maintenance: Monitor sensor data to detect when machines are starting to malfunction

  • Disease Diagnosis: Spot potential disease using healthcare IoT sensor data

  • The dataset can be expressed as a 2D feature matrix with the columns as features and the rows as data points

  • Dynamic pricing for marketing campaigns for any goods or services rely on pricing data. Airlines and ride-share services have successfully implemented dynamic price optimization strategies using supervised learning

  • Fraud detection, predict if patients have a certain illness, predict the selling or rental price of properties, predict customer satisfaction