Welcome to Francis Academic Press

Academic Journal of Computing & Information Science, 2023, 6(13); doi: 10.25236/AJCIS.2023.061306.

A Survey of Transformers in Video Prediction

Author(s)

Weichen Ji

Corresponding Author:
Weichen Ji
Affiliation(s)

State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China, 100024

Abstract

Transformer is an encoder-decoder architecture based on the self-attention mechanism, which can effectively obtain global information and has shown great strength in the construction of long-distance dependencies. In recent years, Transformer has become the mainstream architecture of Natural Language Processing (NLP). Inspired by Transformer's success in NLP, researchers have gradually applied it to video processing tasks, one of which is video prediction. The essence of video prediction tasks is to generate future frames based on past ones. The model in this task needs to have strong sequence modeling capabilities, so the application and fusion of Transformer has become a major direction. This paper sorts out the application of Transformer in video prediction, lists typical models and analyzes improvement ideas, and finally summarizes and looks forward to its development.

Keywords

Transformer, Self-attention, Video prediction, Computer Vision, Survey

Cite This Paper

Weichen Ji. A Survey of Transformers in Video Prediction. Academic Journal of Computing & Information Science (2023), Vol. 6, Issue 13: 35-46. https://doi.org/10.25236/AJCIS.2023.061306.

References

[1] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.

[2] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.

[3] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020.

[4] Esser P, Rombach R, Ommer B. Taming transformers for high-resolution image synthesis[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 12873-12883.

[5] Arnab A, Dehghani M, Heigold G, et al. Vivit: A video vision transformer[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 6836-6846.

[6] Parmar N, Vaswani A, Uszkoreit J, et al. Image transformer[C]//International conference on machine learning. PMLR, 2018: 4055-4064.

[7] Carion N, Massa F, Synnaeve G, et al. End-to-end object detection with transformers[C]//European conference on computer vision. Springer, Cham, 2020: 213-229.

[8] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020.

[9] Ranzato M A, Szlam A, Bruna J, et al. Video (language) modeling: a baseline for generative models of natural videos[J]. arXiv preprint arXiv:1412.6604, 2014.

[10] Lotter W, Kreiman G, Cox D. Deep predictive coding networks for video prediction and unsupervised learning[J]. arXiv preprint arXiv:1605.08104, 2016.

[11] Bolte J A, Bar A, Lipinski D, et al. Towards corner case detection for autonomous driving[C]//2019 IEEE Intelligent vehicles symposium (IV). IEEE, 2019: 438-445.

[12] Shi X, Chen Z, Wang H, et al. Convolutional LSTM network: A machine learning approach for precipitation nowcasting[J]. Advances in neural information processing systems, 2015, 28.

[13] Liu W, Luo W, Lian D, et al. Future frame prediction for anomaly detection–a new baseline[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 6536-6545.

[14] Clark A, Donahue J, Simonyan K. Adversarial video generation on complex datasets[J]. arXiv preprint arXiv:1907.06571, 2019.

[15] Tulyakov S, Liu M Y, Yang X, et al. Mocogan: Decomposing motion and content for video generation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 1526-1535.

[16] Terwilliger A, Brazil G, Liu X. Recurrent flow-guided semantic forecasting[C]//2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2019: 1703-1712.

[17] Shi X, Chen Z, Wang H, et al. Convolutional LSTM network: A machine learning approach for precipitation nowcasting[J]. Advances in neural information processing systems, 2015, 28.

[18] Villegas R, Yang J, Zou Y, et al. Learning to generate long-term future via hierarchical prediction[C]//international conference on machine learning. PMLR, 2017: 3560-3569.

[19] Zhang J, Wang Y, Long M, et al. Z-order recurrent neural networks for video prediction[C]//2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2019: 230-235.

[20] Villegas R, Pathak A, Kannan H, et al. High fidelity video prediction with large stochastic recurrent neural networks[J]. Advances in Neural Information Processing Systems, 2019, 32.

[21] Wu B, Nair S, Martin-Martin R, et al. Greedy hierarchical variational autoencoders for large-scale video prediction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 2318-2328.

[22] Babaeizadeh M, Saffar M T, Nair S, et al. FitVid: Overfitting in pixel-level video prediction[J]. arXiv preprint arXiv:2106.13195, 2021.

[23] Weissenborn D, Täckström O, Uszkoreit J. Scaling autoregressive video models[J]. arXiv preprint arXiv:1906.02634, 2019.

[24] Parmar N, Vaswani A, Uszkoreit J, et al. Image transformer[C]//International conference on machine learning. PMLR, 2018: 4055-4064.

[25] Chen X, Mishra N, Rohaninejad M, et al. Pixelsnail: An improved autoregressive generative model[C]//International Conference on Machine Learning. PMLR, 2018: 864-872.

[26] Menick J, Kalchbrenner N. Generating high fidelity images with subscale pixel networks and multidimensional upscaling[J]. arXiv preprint arXiv:1812.01608, 2018.

[27] Ye X, Bilodeau G A. Video prediction by efficient transformers[J]. Image and Vision Computing, 2022: 104612.

[28] Gupta A, Tian S, Zhang Y, et al. Maskvit: Masked visual pre-training for video prediction[J]. arXiv preprint arXiv:2206.11894, 2022.

[29] Rakhimov R, Volkhonskiy D, Artemov A, et al. Latent video transformer[J]. arXiv preprint arXiv:2006.10704, 2020.

[30] Van Den Oord A, Vinyals O. Neural discrete representation learning[J]. Advances in neural information processing systems, 2017, 30.

[31] Kingma D P, Welling M. Auto-encoding variational bayes[J]. arXiv preprint arXiv:1312.6114, 2013.

[32] Zhong K, Wang Y, Pei J, et al. Super efficiency SBM-DEA and neural network for performance evaluation[J]. Information Processing & Management, 2021, 58(6): 102728.

[33] Jan N, Gwak J, Pei J, et al. Analysis of networks and digital systems by using the novel technique based on complex fuzzy soft information[J]. IEEE Transactions on Consumer Electronics, 2022, 69(2): 183-193.

[34] Yu Z, Pei J, Zhu M, et al. Multi-attribute adaptive aggregation transformer for vehicle re-identification[J]. Information Processing & Management, 2022, 59(2): 102868.

[35] Li J, Li S, Cheng L, et al. BSAS: A Blockchain-Based Trustworthy and Privacy-Preserving Speed Advisory System[J]. IEEE Transactions on Vehicular Technology, 2022, 71(11): 11421-11430

[36] Johnson J, Alahi A, Fei-Fei L. Perceptual losses for real-time style transfer and super-resolution[C]//European conference on computer vision. Springer, Cham, 2016: 694-711.

[37] Zhang R, Isola P, Efros A A, et al. The unreasonable effectiveness of deep features as a perceptual metric[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 586-595.

[38] Wu C, Liang J, Ji L, et al. Nüwa: Visual synthesis pre-training for neural visual world creation[C]//European Conference on Computer Vision. Springer, Cham, 2022: 720-736.

[39] LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.

[40] Liu Z, Luo S, Li W, et al. Convtransformer: A convolutional transformer network for video frame synthesis[J]. arXiv preprint arXiv:2011.10185, 2020.

[41] Liu Z, Yeh R A, Tang X, et al. Video frame synthesis using deep voxel flow[C]//Proceedings of the IEEE international conference on computer vision. 2017: 4463-4471.

[42] Villegas R, Yang J, Hong S, et al. Decomposing motion and content for natural video sequence prediction[J]. arXiv preprint arXiv:1706.08033, 2017.

[43] Soomro K, Zamir A R, Shah M. UCF101: A dataset of 101 human actions classes from videos in the wild[J]. arXiv preprint arXiv:1212.0402, 2012.

[44] Su S, Delbracio M, Wang J, et al. Deep video deblurring for hand-held cameras[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 1279-1288.

[45] Xue T, Chen B, Wu J, et al. Video enhancement with task-oriented flow[J]. International Journal of Computer Vision, 2019, 127(8): 1106-1125.

[46] Yuan H, Cai Z, Zhou H, et al. TransAnomaly: Video Anomaly Detection Using Video Vision Transformer[J]. IEEE Access, 2021, 9: 123977-123986.

[47] Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation[C]//International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015: 234-241.

[48] Arnab A, Dehghani M, Heigold G, et al. Vivit: A video vision transformer[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 6836-6846.

[49] Jin P, Mou L, Xia G S, et al. Anomaly Detection in Aerial Videos With Transformers[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-13.

[50] Pan J, Wang S, Bai J, et al. Diverse Dance Synthesis via Keyframes with Transformer Controllers[C]//Computer Graphics Forum. 2021, 40(7): 71-83.

[51] Bi S, Yuan C, Liu S, et al. Spatiotemporal Prediction of Urban Online Car-Hailing Travel Demand Based on Transformer Network[J]. Sustainability, 2022, 14(20): 13568.