Course Abstract

Training duration: 3 hours (Hands-on)

Machine learning is usually taught from tutorials using small, clean datasets put into data-frames and orchestrated with Jupyter notebooks; all done in one, in-memory, local environment. This is a fine style for presenting a new topic and teaching the main ideas, but unfortunately, these patterns are not conducive to the delivery of real production applications at scale. Real industrial situations involve multiple environments and data sets from databases or other data stores rather than file-based input. They interact with live production systems and must be coordinated with software delivery teams and product owners. They must be production quality, with good design, well-tested and maintainable. This often results in data scientists having to choose between the environment that they are used to, and one that is suitable for delivery to production; and an awkward migration from one to the other. In this workshop, we show how to maintain data science productivity as well as collaborate effectively and deliver value continuously and seamlessly. We demonstrate and guide the participants through CI/CD practices for machine learning and a new pattern of working that avoids most of the pitfalls of the typical approach.

DIFFICULTY LEVEL: INTERMEDIATE

Learning Objectives

  • How to maintain data science productivity as well as collaborate effectively and deliver value continuously and seamlessly when bringing machine learning models into production

  • Continuous Integration / Continuous Delivery (CI/CD) practices for Machine Learning and a new pattern of working that avoids most of the pitfalls of data scientists working in isolation

  • Small, safe incremental changes to code and models allowing for code to be deployed to production frequently, collecting feedback as we develop

  • How to coordinate data, model and application code all at the same time as compared to just application code

Instructor

Instructor Bio:

Global Head of Artificial Intelligence | Thoughtworks

Christoph Windheuser, PhD

Christoph Windheuser is the Global Head of Artificial Intelligence at ThoughtWorks Inc. Before joining ThoughtWorks, he gained more than 20 years of experience in the industry in several positions at SAP and Capgemini. Prior to that, he completed his Ph.D. in Neural Networks with a focus on Speech Recognition at the University of Bonn, Germany, Carnegie Mellon University in Pittsburgh, USA, Waseda University in Tokyo, Japan, and France Telekom (E.N.S.T.) in Paris, France.

Principal Data Scientist | ThoughtWorks, Inc.

David Johnston, PhD

David Johnston is a Principle Data Scientist at ThoughtWorks. David helps clients turn their business problems into problems that can be solved with data science, artificial intelligence, optimization, and similar quantitative, data-driven techniques. David is a proponent of rethinking how we solve business problems and how we optimally apply data science in realistic situations. David is a leader in developing more effective approaches to delivering machine learning that is thoroughly integrated with the larger software application development environment. He has a Ph.D. in physics and has also worked as a researcher in the field of experimental cosmology at top universities, NASA and US government laboratories.

Lead Data Engineer | ThoughtWorks, Inc.

Eric Nagler

Eric Nagler serves ThoughtWorks, Inc. as a Lead Data Engineer with eight years of developing innovative batch and real-time data solutions for multiple different clients in multiple different domains. Eric holds a Masters Degree in Computer Science with a focus in Parallel and Distributed Algorithms. Eric’s data interests include Big Data Application Architecture and Design, Parallel Computation, Big Data ETL, NoSQL Data Storage, Natural Language Processing, Geospatial Analytics, Web Services, and Data Visualization.

Course Outline

Introduction (60 minutes): What is Continuous Delivery for Machine Learning (CD4ML)?
Doing the plumbing (20 minutes): Set up the Jenkins build pipeline and ensure your project is configured correctly
Data Science (20 minutes): Develop the model and test the code using Test Driven Development methodology
Machine Learning Engineering: Improve the model in several steps and monitor the results of your improvement
Continuous Deployment( 20 minutes): Setup a performance test of the model, which only allows automatic deployment if the model passes a quality threshold
Our app in the wild (20 minutes): Monitor your application in production with fluent, elastic search and kibana

Background knowledge

  • Basic knowledge of how to develop a machine learning model

  • Basic skills in Python

  • Familiar with Docker, ElasticSearch Stack, Jenkins