Vision transformer is a recent breakthrough in the area of computer vision. While transformer-based models have dominated the field of natural language processing since 2017, CNN-based models are still demonstrating state-of-the-art performances in vision problems. Last year, a group of researchers from Google figured out how to make a transformer work on recognition. They called it a "vision transformer". The follow-up works by the community demonstrated superior performance of vision transformers not only in recognition but also in other downstream tasks such as detection, segmentation, multi-modal learning, and scene text recognition to mention a few.
In this talk, we will go into a deeper understanding of the model architecture of vision transformers. Most importantly, we will focus on the concept of self-attention and its role in vision. Then, we will present different model implementations utilizing the vision transformer as the main backbone.
Since self-attention can be applied beyond transformers, we will also discuss a promising direction in building general-purpose model architectures. In particular, networks can process a variety of data formats such as text, audio, image, and video.
Abstract & Bio
Vision Transformer and its Applications