Integrating Natural Language Processing and Audio Generation: A Study on Text-to-Music Generation Models

<p>Bolin Shang</p>

doi:10.25236/AJCIS.2024.071216

Academic Journal of Computing & Information Science, 2024, 7(12); doi: 10.25236/AJCIS.2024.071216.

Integrating Natural Language Processing and Audio Generation: A Study on Text-to-Music Generation Models

Author(s)

Bolin Shang

Corresponding Author:

Bolin Shang

Affiliation(s)

Pomfret School, PO Box 128 398 Pomfret Street Pomfret, Connecticut, Pomfret, 06258, USA

Download PDF
|
Download: 42
|
View: 7037

Abstract

With the rapid development of artificial intelligence technology, the traditional way of music creation is limited by personal experience, while this research of text-based music generation brings new possibilities for music creation. The aim of this research is to develop a text-to-music generator that combines natural language processing and audio generation techniques to promote innovation in music creation. This research uses pre-trained models AudioLDm and DreamBooth to generate high quality audio through diffusion models and using Gradio to display the resulting music for creators to use. Studies have shown that text-guided latent diffusion models can effectively de-noise and generate music that fits specific styles and instruments. With the implementation of this system, the safety of elderly people at home can be greatly enhanced.In addition, the project explores techniques for personalization transformation and style transfer, aiming to customize models with a small number of inputs.

Keywords

Text-to-music generation, AudioLDM, DreamBooth, Diffusion models, Style transfer

Cite This Paper

Bolin Shang. Integrating Natural Language Processing and Audio Generation: A Study on Text-to-Music Generation Models. Academic Journal of Computing & Information Science (2024), Vol. 7, Issue 12: 111-116. https://doi.org/10.25236/AJCIS.2024.071216.

References

[1] Wang, Luping; Chen, Haizi; Li, Jianming. "Deep learning approaches for music generation: A comprehensive survey." IEEE Transactions on Audio, Speech, and Language Processing, Vol. 30, No. 6, 2022, pp. 2234-2248.

[2] Rombach, Robin; Blattmann, Andreas; Lorenz, Dominik; Esser, Patrick; Ommer, Bjorn. "High-resolution image synthesis with latent diffusion models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10684-10695.

[3] Yang, Yifan; Wu, Xiaolong; Zhou, Bolei. "Understanding diffusion models: A unified perspective." arXiv preprint arXiv:2208.11970, 2022.

[4] Dhariwal, Prafulla; Jun, Heewoo; Payne, Christine; Kim, Jong Wook; Radford, Alec; Sutskever, Ilya. "Jukebox: A generative model for music." arXiv preprint arXiv:2005.00341, 2020.

[5] Zhang, Mingxu; Wang, Yongwei; Liu, Huaijun. "Efficient music generation through neural architecture optimization." Neural Networks, Vol. 158, February 2023, pp. 142-156.

[6] Liu, Haohe; Chen, Zehua; Yuan, Yi; Mei, Xiangtao; Liu, Jiliang. "AudioLDM: Text-to-audio generation with latent diffusion models." arXiv preprint arXiv:2301.12503, 2023.

[7] Ruiz, Nataniel; Li, Yuanzhen; Jampani, Varun; Pritch, Yael; Rubinstein, Michael; Aberman, Kfir. "DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation." arXiv preprint arXiv: 2208.12242, 2022.

[8] Chen, Xiaoming; Zhang, Yuhong; Wang, Zhihong. "Personalized audio generation: A comprehensive review." Digital Signal Processing, Vol. 134, January 2023, pp. 103802-103815.

[9] Kreuk, Felix; Synnaeve, Gabriel; Polyak, Adam; Singer, Uri; Défossez, Alexandre. "Audiogen: Textually guided audio generation." arXiv preprint arXiv:2209. 15352, 2022.

[10] Li, Shaofeng; Wu, Yuxuan; Zhang, Kaisheng. "Advanced techniques in music generation using transformer architectures." International Conference on Machine Learning (ICML), 2023, pp. 8234-8243.

[11] Wang, Tianhao; Liu, Mengyu; Chen, Jiancheng. "Disentangled control in music generation through attribute manipulation." IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 31, No. 5, 2023, pp. 1572-1584.

[12] Chen, Longwei; Yu, Hongyi; Zhou, Xiaohui. "Self-supervised learning for audio representation: A new perspective." IEEE Signal Processing Letters, Vol. 29, March 2022, pp. 1247-1251.

[13] Zhao, Jing; Wang, Xiaomei; Li, Mingming. "Recent advances in self-supervised learning for audio processing." IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45, No. 6, 2023, pp. 7182-7199.