Course Abstract

Training duration: 90 minutes

Gradient Boosted Trees have become a widely used method for prediction using structured data. They generally provide the best predictive power, but are sometimes criticized for being "difficult to interpret". However, to some degree, this criticism is misdirected -- rather than being uninterpretable, they simply have more complicated interpretations, reflecting a more sophisticated understanding of the underlying dynamics of the variables. In this course, we will work hands-on using XGBoost with real-world data sets to demonstrate how to approach data sets with the twin goals of prediction and understanding in a manner such that improvements in one area yield improvements in the other. Using modern tooling such as Individual Conditional Expectation (ICE) plots and SHAP, as well as a sense of curiosity, we will extract powerful insights that could not be gained from simpler methods. In particular, attention will be placed on how to approach a data set with the goal of understanding as well as prediction.


Learning Objectives

  • How to approach data exploration.

  • How to assess the ""coherence"" of a model

  • How to interpret complicated models (such as from Gradient Boosting or Random Forests)

  • How to ascribe reasons to individual predictions


Instructor Bio:

Principal | Numeristical

Brian Lucena,PhD

Brian Lucena is Principal at Numeristical and the creator of StructureBoost, ML-Insights, and SplineCalib. His mission is to enhance the understanding and application of modern machine learning and statistical techniques. He does this through academic research, open-source software development, and educational content such as live stream classes and interactive Jupyter notebooks. Additionally, he consults for organizations of all sizes from small startups to large public enterprises. In previous roles, he has served as SVP of Analytics at PCCI, Principal Data Scientist at Clover Health, and Chief Mathematician at Guardian Analytics. He has taught at numerous institutions including UC-Berkeley, Brown, USF, and the Metis Data Science Bootcamp.


Course Outline

Module1: Understanding the overall dynamics of your data and your model: 

- Using sophisticated modeling packages (like XGBoost) to understand more complicated dynamics in the data 

- How to approach data exploration to understand more complicated relationships between the variables in your data 

- Why the "coherence" of a model is important - arguably, on the same level as its predictive performance 

- How to assess the "coherence" of a model using ICE plots 

Module 2: Understanding and explaining individual predictions from the model 

- How to ascribe "reasons" to individual predictions 

- How to "consolidate" features to make the reasons more coherent and understandable 

- Using visualizations independently and from the SHAP package

Background knowledge

  • Background in Python, Numpy, Pandas, Scikit-learn

Real-world applications

  • Explaining how your model works to your boss or a business stakeholder in any industry.

  • Data scientists gain insights about the "real-world" from the finance, healthcare and other sensitive data and the model they have built.

  • Data science applications can give meaningful reasons for the predictions they make.