Academic Journal of Computing & Information Science, 2023, 6(13); doi: 10.25236/AJCIS.2023.061324.
Jianghao Luo, Zhenhua Tang
School of Physics & Optoelectronic Engineering, Guangdong University of Technology, Guangzhou, China
This work addresses the scarcity of Cantonese speech emotion datasets by introducing a dedicated dataset and employing innovative methodologies. A tailored feature set, specifically designed for Cantonese, captures intricate emotional expressions. Enhanced efficiency in Cantonese speech emotion recognition is showcased through the utilization of a self-normalization network-based model. With an impressive accuracy of 92.3% on the Cantonese dataset, the model demonstrates robust generalization capabilities across diverse Chinese and English datasets. The obtained results underscore the potential applications of this research in various domains, including Cantonese language education, psychological counseling, and voice assistants. Understanding of Cantonese emotional expressions is advanced, contributing to the preservation of linguistic and cultural heritage. Despite the notable achievements, limitations in dataset coverage and emotion variety are acknowledged. Future endeavors will prioritize expanding the dataset's breadth and incorporating a wider range of emotional expressions. Additionally, the exploration of more comprehensive Cantonese emotion recognition will involve the investigation of multimodal approaches, where audio, visual, and textual cues are combined. These efforts are aimed at addressing current limitations and pushing the field toward a more nuanced understanding of Cantonese emotional communication.
Cantonese emotion recognition, feature set, self-normalizing neural network, Multimodal
Jianghao Luo, Zhenhua Tang. Feature Selection and Fusion in Cantonese Speech Emotion Analysis. Academic Journal of Computing & Information Science (2023), Vol. 6, Issue 13: 169-177. https://doi.org/10.25236/AJCIS.2023.061324.
[1] C. Chunlan, "An Exploration of the Application of Cantonese Chanting in the Teaching of Tang Poetry," Art and Literature for the Masses, no. 20, pp. 200-202, 2023 (in Chinese), doi: 10.20112/j.cnki.ISSN1007-5828.2023.20.066.
[2] S. P. Mishra, P. Warule, and S. Deb, "Variational mode decomposition based acoustic and entropy features for speech emotion recognition," Applied Acoustics, vol. 212, p. 109578, 2023.
[3] S. Jothimani and K. Premalatha, "MFF-SAug: Multi feature fusion with spectrogram augmentation of speech emotion recognition using convolution neural network," Chaos, Solitons & Fractals, vol. 162, p. 112512, 2022.
[4] Y. Pan, "Integrating Cantonese nursery rhymes into early childhood music classrooms: A lesson for learning music, language, and culture," Journal of General Music Education, vol. 35, no. 1, pp. 34-45, 2021.
[5] C. Hema and F. P. G. Marquez, "Emotional speech recognition using cnn and deep learning techniques," Applied Acoustics, vol. 211, p. 109492, 2023.
[6] X. Xu, D. Li, Y. Zhou, and Z. Wang, "Multi-type features separating fusion learning for Speech Emotion Recognition," Applied Soft Computing, vol. 130, p. 109648, 2022.
[7] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, "Meld: A multimodal multi-party dataset for emotion recognition in conversations," arXiv preprint arXiv:1810.02508, 2018.
[8] C. Busso et al., "IEMOCAP: Interactive emotional dyadic motion capture database," Language resources and evaluation, vol. 42, pp. 335-359, 2008.
[9] W. Yu et al., "Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality," in Proceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 3718-3727.
[10] K. Nugroho and E. Noersasongko, "Enhanced Indonesian ethnic speaker recognition using data augmentation deep neural network," Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 7, pp. 4375-4384, 2022.
[11] L. Trinh Van, T. Dao Thi Le, T. Le Xuan, and E. Castelli, "Emotional speech recognition using deep neural networks," Sensors, vol. 22, no. 4, p. 1414, 2022.
[12] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, "Self-normalizing neural networks," Advances in neural information processing systems, vol. 30, 2017.
[13] J. Li, Q. Xu, M. Wu, T. Huang, and Y. Wang, "Pan-cancer classification based on self-normalizing neural networks and feature selection," Frontiers in Bioengineering and Biotechnology, vol. 8, p. 766, 2020.
[14] Y. Lu, S. Gould, and T. Ajanthan, "Bidirectionally self-normalizing neural networks," Neural Networks, vol. 167, pp. 283-291, 2023.
[15] X. Cai, Z. Wu, K. Zhong, B. Su, D. Dai, and H. Meng, "Unsupervised cross-lingual speech emotion recognition using domain adversarial neural network," in 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2021: IEEE, pp. 1-5.
[16] A. S. Alluhaidan, O. Saidani, R. Jahangir, M. A. Nauman, and O. S. Neffati, "Speech Emotion Recognition through Hybrid Features and Convolutional Neural Network," Applied Sciences, vol. 13, no. 8, p. 4750, 2023.
[17] K. Mountzouris, I. Perikos, and I. Hatzilygeroudis, "Speech Emotion Recognition Using Convolutional Neural Networks with Attention Mechanism," Electronics, vol. 12, no. 20, p. 4376, 2023.
[18] A. B. A. Qayyum, A. Arefeen, and C. Shahnaz, "Convolutional neural network (CNN) based speech-emotion recognition," in 2019 IEEE international conference on signal processing, information, communication & systems (SPICSCON), 2019: IEEE, pp. 122-125.
[19] A. Jadhav, V. Kadam, S. Prasad, N. Waghmare, and S. Dhule, "An Emotion Recognition from Speech using LSTM," in 2023 International Conference on Sustainable Computing and Smart Systems (ICSCSS), 2023: IEEE, pp. 834-842.
[20] A. A. Abdelhamid et al., "Robust speech emotion recognition using CNN+ LSTM based on stochastic fractal search optimization algorithm," IEEE Access, vol. 10, pp. 49265-49284, 2022.