Academic Journal of Computing & Information Science, 2023, 6(13); doi: 10.25236/AJCIS.2023.061306.

A Survey of Transformers in Video Prediction


Weichen Ji

Weichen Ji

State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China, 100024


Transformer is an encoder-decoder architecture based on the self-attention mechanism, which can effectively obtain global information and has shown great strength in the construction of long-distance dependencies. In recent years, Transformer has become the mainstream architecture of Natural Language Processing (NLP). Inspired by Transformer's success in NLP, researchers have gradually applied it to video processing tasks, one of which is video prediction. The essence of video prediction tasks is to generate future frames based on past ones. The model in this task needs to have strong sequence modeling capabilities, so the application and fusion of Transformer has become a major direction. This paper sorts out the application of Transformer in video prediction, lists typical models and analyzes improvement ideas, and finally summarizes and looks forward to its development.


Transformer, Self-attention, Video prediction, Computer Vision, Survey

