Overview

Large scale distributed training has become an essential element to scaling the productivity for ML engineers.  Today, ML models are getting larger and more complex in terms of compute and memory requirements.  The amount of data we train on at Facebook is huge.  In this talk, we will learn about the Distributed Training Platform to support large scale data and model parallelism.   We will touch base on Distributed Training support for PyTorch and how we are offering a flexible training platform for ML engineers to increase their productivity at the Facebook scale.

AI+ SUBSCRIPTION PLANS

New on-demand courses are added weekly

Session Overview

  • 1

    ODSC East 2020: Distributed Training Platform at Facebook

    • Overview and Author Bio

    • Distributed Training Platform at Facebook

Instructor Bio:

Senior Engineering Manager | Facebook

Mohamed Fawzy

Senior Software Engineering Manager - Facebook AI Infrastructure Mohamed Fawzy leads the Distributed AI Group at Facebook. The Distributed AI group is focused on scaling and product-ionizing machine learning training at facebook scale by building platform and infrastructure to support large scale distributed training on heterogeneous hardware.

Software Engineer | Facebook

Kiuk Chung

Kiuk Chung is a Software Engineer at Facebook leading PyTorch Elastic Training. Prior to Facebook, he spent six years in various teams within Amazon, building a cloud-native infrastructure for deep learning and high-performance computing. More specifically, he scaled deep learning for product recommendations and search and worked on releasing AWS Batch.