Multimodal conversational emotion recognition based on hier-archical Transformer

<p>Bai Yimin<sup>1</sup>, Zhang Pengwei<sup>1</sup>, Zhang Jingze<sup>2</sup>, Chen Jingxia<sup>1</sup></p>

doi:10.25236/AJCIS.2025.080104

Academic Journal of Computing & Information Science, 2025, 8(1); doi: 10.25236/AJCIS.2025.080104.

Multimodal conversational emotion recognition based on hier-archical Transformer

Author(s)

Bai Yimin¹, Zhang Pengwei¹, Zhang Jingze², Chen Jingxia¹

Corresponding Author:

Chen Jingxia

Affiliation(s)

¹Shaanxi University of Science & Technology, Xi'an, China

²Xi’an Gaoxin No.1 Experimental High School, Xi'an, China

Download PDF
|
Download: 47
|
View: 2882

Abstract

Addressing the issues of limited single-modality feature representation, inadequate multimodal feature fusion, and the difficulty of modeling conversation scenarios in multimodal conversational emotion recognition tasks, a Hierarchical Transformer-based multimodal conversational emotion recognition model has been proposed. The model employs self-attention bidirectional gated recurrent units to delve into the contextual dependencies of single-modality features such as text, video, and audio, thereby enhancing the representational power of features. Through hierarchical gated multi-head attention, it learns complementary information among modalities and adaptively learns the weights of each modality, reducing the noise interference of redundant information on multimodal features. The hierarchical Transformer is used to model conversation scenarios, utilizing a masking mechanism to simulate dependencies within contextual language, within speakers, and between speakers, gaining a deeper understanding of the emotional states of speakers in conversations. On the IEMOCAP and MELD benchmark datasets, the model achieved accuracy and F1 scores of 71.10% and 70.97% on IEMOCAP, and 67.16% and 66.11% on MELD, respectively, outperforming similar methods in terms of accuracy.

Keywords

Emotion Recognition in Conversation (ERC), Multimodality, Transformer, Gated Fusion, Multi-Head Attention

Cite This Paper

Bai Yimin, Zhang Pengwei, Zhang Jingze, Chen Jingxia. Multimodal conversational emotion recognition based on hier-archical Transformer. Academic Journal of Computing & Information Science (2025), Vol. 8, Issue 1: 19-31. https://doi.org/10.25236/AJCIS.2025.080104.

References

[1] Chatterjee A Narahari K N, Joshi M, et al. SemEval-2019 task 3: EmoContext contextual emotion detection in text[C]//Proceedings of the 13th international workshop on semantic evaluation. 2019: 39-48.

[2] Zhou L, Gao J, Li D, et al. The Design and Implementation of XiaoIce, an Empathetic Social Chat-bot.2018[2024-03-27]. DOI: 10.48550/arXiv. 1812. 08989.

[3] Huang Zhong. Research on Expression Recognition and Reproduction Methods for Humanoid Robots[D]. Hefei University of Technology, 2017.

[4] Hazarika D, Poria S, Zadeh A, et al. Conversational memory network for emotion recognition in dyadic dialogue videos[C]//Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting. NIH Public Access, 2018, 2018: 2122.

[5] Poria S, Cambria E, Hazarika D, et al. Context-dependent sentiment analysis in user-generated videos[C]//Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers). 2017: 873-883.

[6] Jiao W, Yang H, King I, et al. HiGRU: Hierarchical Gated Recurrent Units for Utterance-Level Emotion Recognition[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019: 397-406.

[7] Hazarika D, Poria S, Zadeh A, et al. Conversational memory network for emotion recognition in dyadic dialogue videos[C]//Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting. NIH Public Access, 2018, 2018: 2122.

[8] Hazarika D, Poria S, Mihalcea R, et al. Icon: Interactive conversational memory network for multimodal emotion detection[C]//Proceedings of the 2018 conference on empirical methods in natural language processing. 2018: 2594-2604.

[9] Majumder N, Poria S, Hazarika D, et al. Dialoguernn: An attentive rnn for emotion detection in conversations[C]//Proceedings of the AAAI conference on artificial intelligence. 2019, 33(01): 6818-6825.

[10] Ghosal D, Majumder N, Poria S, et al. DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, 2019.

[11] Li J, Ji D, Li F, et al. HiTrans: A Transformer-Based Context- and Speaker-Sensitive Model for Emotion Detection in Con-versations[C]// Proceedings of the 28th International Conference on Computational Linguis-tics.2020.DOI: 10.18653/v1/2020.coling-main.370.

[12] Li J, Lin Z, Fu P, et al. A hierarchical transformer with speaker modeling for emotion recognition in conversation[J]. ar**v preprint ar**v: 2012.14781, 2020.

[13] Hu J, Liu Y, Zhao J, et al. MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation[C] //Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021: 5666-5675.

[14] Hu D, Hou X, Wei L, et al. MM-DFN: Multimodal dynamic fusion network for emotion recognition in conversations[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022: 7037-7041.

[15] Hu G, Lin T E, Zhao Y, et al. UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition[C]//Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022: 7837-7851.

[16] Zhang T, Tan Z, Wu X. HAAN-ERC: hierarchical adaptive attention network for multimodal emotion recognition in con-versation[J]. Neural Computing and Applications, 2023, 35(24): 17619-17632.

[17] Du Jinming, Sun Yuanyuan, Lin Hongfei, et al. Integrating Knowledge Graphs and Curriculum Learning for Conversational Emotion Recognition[J]. Journal of Computer Research and Development, 2024, 61(05): 1299-1309.

[18] Feng Hongqi, Guo Yongxiang, Zhang Denghui, et al. Multimodal Conversation Emotion Recognition Combining Multi-Level Attention and Multi-Stream Graph Neural Networks[J/OL]. Computer Engineering and Applications: 1-11 [2024-05-16].http://kns.cnki.net/kcms/ detail/11.2127.TP. 20231218. 1335.014.html.

[19] Liu Xinyu, Xia Hongbin, Liu Yuan. A Conversational Emotion Recognition Model with Speaker Feature Fusion[J/OL]. Journal of Miniature Microcomputer Systems: 1-8 [2024-03-27]. http://kns.cnki. net/kcms/detail/21.1106.TP.20240229.1556.002.html.

[20] Shen Xudong, Huang Xianying, Zou Shihao. A Multimodal Conversational Emotion Recognition Model Based on Tempo-rally Aware DAG[J]. Journal of Computer Applications Research, 2024, 41(01): 51-58.

[21] Zadeh A, Chen M, Poria S, et al. Tensor Fusion Network for Multimodal Sentiment Analysis[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.2017: 1103-1114.

[22] Zadeh A, Liang P P, Mazumder N, et al. Memory fusion network for multi-view sequential learning[C]//Proceedings of the AAAI conference on artificial intelligence. 2018, 32(1).

[23] Liu Z, Shen Y, Lakshminarasimhan V B, et al. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018: 2247-2256.

[24] Tsai Y H H, Bai S, Liang P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the conference. Association for computational linguistics. Meeting. NIH Public Access, 2019, 2019: 6558.

[25] Sahay S, Okur E, Kumar S H, et al. Low Rank Fusion based Transformers for Multimodal Sequences[C]//Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML). 2020: 29-34.

[26] Rahman W, Hasan M K, Lee S, et al. Integrating multimodal information in large pretrained transformers[C]//Proceedings of the conference. Association for Computational Linguistics. Meeting. NIH Public Access, 2020, 2020: 2359.

[27] Busso C, Bulut M, Lee C C, et al. IEMOCAP: Interactive emotional dyadic motion capture database [J]. Language resources and evaluation, 2008, 42: 335-359.

[28] Poria S, Hazarika D, Majumder N, et al. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 527-536.

[29] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.Linguistics.2020.

[30] Kalyan K S, Rajasekharan A, Sangeetha S. AMMU: a survey of transformer-based biomedical pretrained language models[J]. Journal of biomedical informatics, 2022, 126: 103982.

[31] Zhu Y, Xu Y, Yu F, et al. Deep Graph Contrastive Representation Learning[J]. 2020.DOI:10. 48550/arXiv.2006.04131.

[32] Huang G, Liu Z, Laurens V D M, et al. Densely Connected Convolutional Networks[J]. IEEE Computer Society, 2016.

[33] Mao Y, Liu G, Wang X, et al. DialogueTRM: Exploring Multi-Modal Emotional Dynamics in a Conversation[C]//Findings of the Association for Computational Linguistics: EMNLP 2021. 2021: 2694-2704.