Course Abstract

Training duration : 4 hours

Many companies generate and store vast amounts of unlabelled data every day. Outside of certain unsupervised applications, data must be accompanied by informative labels for its potential to be maximised. However, data annotation efforts are constrained by the human factor and comes with a trade-off: internal annotators (i.e. employees) possess crucial context but they do not scale; while external annotators (e.g. crowdsourced marketplaces such as MTurk) scale only at the expense of domain-specific context. In this training course, we will explore how Active (human-in-the-loop) and Semi-Supervised (ML/AI-assisted) Learning frameworks can be combined to develop in-house solutions for executing rapid data labelling projects. We will consider various sampling strategies, query methods, measures of informativeness, and types of learners. By the end of the session, you will be equipped with a multitude of tools that you can utilise to scale up your data annotation efforts without losing all-important context.

DIFFICULTY LEVEL: ADVANCED

Learning Objectives

  • Recap of Supervised Learning model and techniques

  • Understand Active Learning and explore human-in-the-loop to scale data annotation

  • Understand Semi-Supervised Learning to attach the data annotation problem

  • Putting Everything Together - A Complete Data Annotation Pipeline

Instructor

Instructor Bio:

Senior Data Scientist | Attest

Gokhan Ciflikli, PhD

Gokhan is a senior data scientist at Attest. He is also a member of the Quanteda Initiative, and a guest lecturer for the LSE summer school course Introduction to Data Science and Machine Learning. As a computational social scientist, his core expertise lies in latent variable analysis, predictive modelling, and causal inference. Prior to industry, he was a postdoctoral researcher in analytic software development at the London School of Economics, where he received his PhD. Previously, he held research positions at UCL and Uppsala University, primarily developing machine learning pipelines and working on large-scale NLP problems.

Course Outline

Module 1: Recap of Supervised Learning 

- Brief recap of the supervised learning paradigm, train-test split procedure; cross-validation; in-sample vs. out-of-sample forecasts; accuracy vs. precision; the bias-variance trade-off 

Module 2: Active Learning 

- Using Active Learning to leverage the least confident predictions of an estimator 

- Expedite its learning by querying their labels from a human annotator

- Explore how the human-in-the-loop can help scale up the data annotation process.

 Module 3: Semi-Supervised Learning 

- How Semi-Supervised Learning attacks the problem of data annotation from the opposite angle 

- Explore the underpinnings of the so-called ML/AI-assisted data annotation 

- How to leverage the most confident predictions of estimator to label data at scale 

Module 4: Putting Everything Together

 - A Complete Data Annotation Pipeline 

- Walk-through of an interactive Jupyter notebook

- Demonstration of how two aforementioned frameworks can be combined to create bespoke data labeling jobs. 

- Explore a multitude of scenarios by utilizing the individual components in various configurations

 - Assess their pros and cons.

Background knowledge

  • This course is for current and aspiring Data Scientists, Data Analysts and AI Product Managers

  • Knowledge of following tools and concepts is useful:

  • Familiarity with Python and Jupyter notebooks (R users should be able to follow the material)

  • Specifically for Python, prior working experience using numpy, pandas, scikit-learn, and modAL libraries

  • General grasp of supervised learning concepts

Real-world applications

  • Data annotation techniques has been at the foreront of self-driving technolgies used by companies like Toyota, Voyage, Lyft.

  • Robotics and automation companies like OpenAI, Skydio, and even General Motors employ data annotation at scale in data science applications.

  • Data annotation and labeling is well practiced in platforms like Pinterest and Airbnb to recognize images, translate languages and generate realistic text.