Welcome to Francis Academic Press

The Frontiers of Society, Science and Technology, 2024, 6(12); doi: 10.25236/FSST.2024.061211.

A Review of the Development of Multimodal Large Models

Author(s)

Zheng Miaomiao1, Gao Yi1,2,3

Corresponding Author:
Gao Yi
Affiliation(s)

1Xizang University for Nationalities, Shaanxi, Xianyang, 712082, China

2The Collaborative Research Center for Language and Writing Education in Ethnic Regions, Shaanxi, Xianyang, 712082, China

3Xizang Key Laboratory of Optical Information Processing and Visualization Technology, Shaanxi, Xianyang, 712082, China

Abstract

With the continuous advancement of deep learning technology, multimodal large language models built on large-scale language models and large-scale vision models have been making breakthroughs and achieving significant accomplishments in the field of natural language processing. The concept of general artificial intelligence and the explosive popularity of ChatGPT have brought large language models into people's daily lives. These models are typically based on the Transformer architecture, enabling them to handle and generate large amounts of text data while demonstrating strong language understanding and generation capabilities. As multimodal large models progressively enhance their language understanding and reasoning abilities, the application of instruction fine-tuning, context learning, and chain-of-thought tools has become increasingly widespread. This paper mainly analyzes the key technologies and development trends of multimodal large models, as well as the numerous challenges they face.

Keywords

Deep learning; Multimodal large models; Transformer; challenges

Cite This Paper

Zheng Miaomiao, Gao Yi. A Review of the Development of Multimodal Large Models. The Frontiers of Society, Science and Technology (2024), Vol. 6, Issue 12: 63-68. https://doi.org/10.25236/FSST.2024.061211.

References

[1] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C] // Proc of the 30th Annual Conf on Neural Information Processing Systems. New York: Curran Associates, 2017: 5990−6008.

[2] Brown T B, Mann B, Ryder N, et al. Language Models are Fewshot Learners[OL]. arXiv Preprint, arXiv:2005.14165.

[3] Chen Y C, Li L, Yu L, et al. Uniter: Universal Image-text Representation Learning[OL]. arXiv Preprint, arXiv:1909.11740.

[4] Zhang Z, Zhang A, Li M, et al. Automatic Chain of Thought Prompting in Large Language Models[OL]. arXiv Preprint, arXiv: 2210.03493.

[5] Peter J.WORTH.Word Embeddings and Semantic Spaces in Natural Language Processing[J]. International Journal of Intelligence Science,2023,13(1):1-21.

[6] Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[DB/OL].https://cdn.openai. com/research-covers/language-unsupervised/language_understanding_ paper.pdf, 2018.

[7] Ouyang Long, Wu J, Jiang Xu, et al. Training language models to follow instructions with human feedback[J]. Advances in Neural Information Processing Systems, 2022. 35, 27730−27744.