Description

Reinforcement Learning with Human Feedback

"LLMs have proven to be tremendously successful at generating text. A very important step in their fine-tuning involves humans evaluating the output. In order to improve the model with human feedback, RLHF is a widely used method. In this talk, we'll explore several aspects, including: A brief review of reinforcement learning. How RLHF is used to fine-tune large language models. Proximal Policy Optimization (PPO), the reinforcement learning training technique used for RLHF. Direct Preference Optimization (DPO), an alternate method to fine-tune an LLM with human feedback, which doesn't use RL, and has performed quite well."

Instructors Bio

Luis Serrano, PhD

Author of Grokking Machine Learning and Creator of Serrano Academy

Luis Serrano is a Machine Learning scientist and popularizer. He is the author of the Amazon Bestseller Grokking Machine Learning, and the creator of Serrano Academy, a popular educational YouTube channel with over 150K subscribers and millions of views. Luis has a PhD in mathematics from the University of Michigan, and worked as a mathematics researcher before venturing into the world of technology. He worked in AI at Google, Apple, and Cohere, and as a Quantum AI research scientist at Zapata Computing. He has published popular courses at platforms such as Coursera, Udacity, and DeepLearning.ai.

Outline:

  • Large Language Models

  • How to fine-tune them with reinforcement learning

  • Quick intro to reinforcement learning

  • PPO (the reinforcement learning technique to fine-tune LLMs)

  • DPO (the non-reinforcement learning technique to fine-tune LLMs)