Image captioning is an interesting problem in the intersection between Computer Vision and Natural Language Processing, and has attracted great attention from both Vision and Language communities. This tutorial will focus on some of the recent Vision-and-Language Pretraining approaches for image captioning. We will cover our latest developed approaches, such as object-semantics aligned pre-training (OSCAR) and visual-vocabulary pre-training (VIVO), and discuss their key principles and how we address the core challenges in image caption generation. Our discovery leads to new image captioning framework that achieves state-of-the-art performance on the NoCaps benchmark and suppresses human performance for the first time.

Instructor's Bio

Lijuan Wang, Principal Research Manager at Microsoft

She received B.E. from Huazhong Univ. of Science and Technology and Ph.D. from Tsinghua Univ., China in 2001 and 2006 respectively. In 2006, she joined the speech group of Microsoft Research Asia, where she was a Lead Researcher. In 2016, she joined Microsoft Research in Redmond. In 2020, she leads a science team working on both computer vision research and product in Microsoft Azure Cognitive Services team.

Her research areas include deep learning and machine learning on multimodal perception intelligence. Over the years, she has been the key contributor in developing technologies on vision-language pre-training, image captioning, object detection, printed/handwritten text recognition, avatar animation (talking head), and speech synthesis (TTS)/recognition, which have been shipped into Microsoft products, such as Cognitive Services, Azure Kinect Body Tracking SDK, Office365, Seeing AI, etc. She has published 50+ papers on top conferences and journals and she is the inventor/co-inventor of more than 15+ granted/pending USA patents. She is a senior member of IEEE.

Kevin Lin, Applied Scientist at Microsoft

Kevin Lin is an applied scientist of the computer vision science group in Microsoft Cloud & AI. 

He received a Ph.D. degree in electrical engineering from the University of Washington, in 2020, and an M.S. degree in computer science from National Taiwan University, in 2014. His current research interests include computer vision, and reasoning across vision and language.

Local ODSC chapter in Austin, USA

Use discount code - Meetup2020 - to get up to 10% extra off on your pass for ODSC Conference.


  • 1

    Tutorial on Recent Advances in Image Captioning: Describing Images as Well as People Do

    • AI+ Training

    • Webinar recording

    • AI+ Subscription Plans