Course Abstract

Training duration: 4 hours (Hands-on)

The Association of Fraud Examiners (ACFE) consistently estimates that organizations lose approximately 5% of their revenues due to fraud. Based on world GDP estimates, this would be anywhere from $3-4 trillion annually. Fraud is one of the most interesting problems to try and solve because the people in your data are not trying to be found. Data science techniques are now at the forefront of this industry to help fight the battle against criminals. This course outlines the typical fraud framework at an organization and where data science can play a role. It will also layout how to build an analytically advanced fraud system at an organization. Moving beyond just simple rules and anomaly detection, these supervised and unsupervised approaches to fraud modeling will help an organization combat every present problem of fraud. These fraud modeling approaches can also be used in other industries to help organizations find unique customers or problems that might exist in their current systems.

DIFFICULTY LEVEL: INTERMEDIATE

Learning Objectives

  • Use an insurance fraud data set through the course to solidify the concepts

  • Use network analysis to create good features for fraud models like centrality and connectivity.

  • Properly oversample or undersample a rare event data set as well as use synthetic sampling techniques like SMOTE.

  • Build a supervised fraud classification model using one of the following: logistic regression, tree based algorithms, and naive Bayes models.

  • Build a supervised NOT-fraud classification model using one of the above techniques.

  • Interpret a complicated model using LIME.

Instructor Bio:

Aric LaBarr, PhD

Associate Professor of Analytics | Institute for Advanced Analytics at NC State University

Aric LaBarr, PhD

A Teaching Associate Professor in the Institute for Advanced Analytics, Dr. Aric LaBarr is passionate about helping people solve challenges using their data. There he helps design the innovative program to prepare a modern work force to wisely communicate and handle a data-driven future at the nation's first Master of Science in analytics degree program. He teaches courses in predictive modeling, forecasting, simulation, financial analytics, and risk management. Previously, he was Director and Senior Scientist at Elder Research, where he mentored and led a team of data scientists and software engineers. As director of the Raleigh, NC office he worked closely with clients and partners to solve problems in the fields of banking, consumer product goods, healthcare, and government. Dr. LaBarr holds a B.S. in economics, as well as a B.S., M.S., and Ph.D. in statistics — all from NC State University.

Course Outline

1. Review of Fraud 

Lesson 1.1  The Problem of Fraud - How can we analytically define fraud? There are important characteristics of fraud that puts a better perspective on the modeling and identification of fraud.

Lesson 1.2 Detection and Prevention - The two biggest pieces that any holistic fraud solution should have are detection of previous instances of fraud and prevention of new instances. This section also defines the typical fraud identification process in organizations.

Lesson 1.3 Analytical Solution - Now that we now what fraud is as well as the organizational structure of how to deal with fraud, we need to introduce the analytical approaches to becoming a mature organization on detecting and preventing fraud. 


2. Data Preparation

Lesson 2.1 Review of Feature Engineering - The best way to glean information from data is to develop good features to help detect and identify fraud. We talk about and develop strategies for developing good features for anomaly detection. Briefly review RFM Features and categorical feature creation as well.

Lesson 2.2 Introduction of Network Approaches - When generating features, we can also incorporate the ideas of network analysis to our modeling framework. Who people are connected to could play a major role in detecting instances of fraud as well as complex fraud rings.

Lesson 2.3 Obtaining Labeled Data - The hardest part about modeling fraud is obtaining labeled cases of fraud. In this section we will talk about using anomaly models, subject matter experts, and/or unsupervised techniques to obtain labels for suspected fraud.

Lesson 2.4 Sampling Concerns - Fraud is typically and hopefully a rare event at a company. However, this poses problems for modeling. In this section we cover the process of oversampling and undersampling to account for rare event modeling problem. We also introduce the Synthetic Minority Oversampling TEchnique (SMOTE).


3. Supervised Fraud Models

Lesson 3.1 Classification Scoring - This section reviews the concepts of classification models and how they are used to rank and score observations for fraud.

Lesson 3.2 Logistic Regression - This section reviews the concept of logistic regression which is a more statistical based and interpretable model for fraud detection.

Lesson 3.3 Tree-Based Algorithms - This section covers the concepts of tree based models. It starts with focusing on decision trees and their generalization to random forests. We then introduce the concepts of gradient boosting approaches without going into too much mathematical detail.

Lesson 3.4 Naive Bayes Model - The naive Bayes model is a great model to use for fraud detection. This section introduces the main ideas and uses for the naive Bayes model.

Lesson 3.5 Supervised NOT-Fraud Model and Model Evaluation - The previous techniques all focus on previous instances of fraud and detecting those again. Here we talk about the important process of identifying new instances of fraud we haven't seen before using the NOT-Fraud model and combining it with the fraud model. We also discuss how to properly evaluate your fraud models.


4. Clustering and Implementation

Lesson 4.1 Clustering of Scored Observations - Once you have an idea about which observations don't look like either previous instances of fraud or not fraud, how do you isolate and investigate these? We need to use clustering to help isolate groups of observations that might identify new types of fraud.

Lesson 4.2 Interpretability - The people who are typically investigating cases of fraud are not the data scientists who build the models. With this being the case, we need to make sure our models are interpretable and implementable for easy use by investigators by using scorecards or Local Interpretable Model Explanations (LIME). 

Lesson 4.3 Long-term Fraud Strategy - To wrap-up our discussion on fraud, we talk about how to continue to implement these pieces into the grander fraud framework at a company. We also talk about how to evaluate an entire fraud system, not just the models themselves, after the models have been introduced.

Background knowledge

  • Introductory R/Python

  • Basic introduction to supervised modeling

  • Basic introduction to classification models like logistic regression, decision trees, etc. (this isn't required, but helpful for understanding)