Welcome to Francis Academic Press

Academic Journal of Computing & Information Science, 2021, 4(8); doi: 10.25236/AJCIS.2021.040803.

GME-Dialogue-NET: Gated Multi-modal Sentiment Analysis Model Based on Fusion Mechanism


Meng Yang, Yegang Li, Hao Zhang

Corresponding Author:
Meng Yang

School of Computer Science and Technology, Shandong University of Technology, Zibo 255049, Shandong, China


In comparison with the single mode, the utilization of multi-mode information of text, video and audio could lead to more accurate sentiment analysis. GME-Dialogue-NET, a gated multi-modal sentiment analysis model, is raised for the multi-modal emotion prediction and sentiment analysis. The model judges whether the audio or video modal is the noise through GME (Gated Multi-modal Embedding, GME) and then accepts or refuses the modal information based on the judgement. The model uses the Attention Mechanism of context vector to allocate more attention to the context with greater relevance to the current sentence. GME-Dialogue-NET divides participants of the dialogue into speaker and listener to better capture the dependence between emotion and state. It raises that the fusion mechanism CPA (Circulant-Pairwise Attention, CPA) could pay effective attention with different degrees on different modals to attain more helpful emotional and sentimental representation and thus make emotion prediction and sentiment analysis. Compared with the current model, both the weighted accuracy and the F1 score of emotion prediction were improved, especially for the three emotions of sadness, anger and excitement. In the sentiment regression task, the comparison between GME-Dialogue-NET with current advanced model Multilogue-Net shows that MAE (Mean absolute error, MAE) of GME-Dialogue-NET reduces by 0.1 percentage and the Pearson Correlation Coefficient (R) of GME-Dialogue-NET rises by 0.11 percentage.


Natural language processing, Multi-modal sentiment analysis, Multi-modal fusion mechanism

Cite This Paper

Meng Yang, Yegang Li, Hao Zhang. GME-Dialogue-NET: Gated Multi-modal Sentiment Analysis Model Based on Fusion Mechanism. Academic Journal of Computing & Information Science (2021), Vol. 4, Issue 8: 10-18. https://doi.org/10.25236/AJCIS.2021.040803.


[1] Richards J M, Butler E A, Gross J J. Emotion regulation in romantic relationships: The cognitive consequences of concealing feelings[J]. Journal of social and personal relationships, 2003, 20 (5): 599-620

[2] He Jun, Liu Yue, He Zhongwen.Research process of multimodal emotion recognition[J]. Application Research of Computers. 2018, 35 (11): 3201-3205.

[3] Datcu, D., Rothkrantz, L.Semantic audio-visual data fusion for automatic emotion recognition. [J] Euromedia’2008.

[4] Kanade T, Cohn J F, Tian Y. Comprehensive database for facial expression analysis[C]//Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580). IEEE, 2000: 46-53

[5] Burkhardt F, Paeschke A, Rolfes M, et al. A database of German emotional speech[C]//Ninth european conference on speech communication and technology. 2005.

[6] Wöllmer M, Metallinou A, Eyben F, et al. Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional lstm modeling[C]//Proc. INTERSPEECH 2010.The Prefecture of Grenoble, in France: ISCA Press, 2010: 2362-2365.

[7] Poria S, Cambria E, Hazarika D, et al. Context-dependent sentiment analysis in user-generated videos[C]//Proceedings of the 55th annual meeting of the association for computational linguistics. Stroudsburg, PA: ACL Press, 2017: 873-883..

[8] Zadeh A, Chen M, Poria S, et al. Tensor fusion network for multimodal sentiment analysis[J]. ArXiv, 2017, 1707.07250

[9] Zadeh A, Liang P P, Mazumder N, et al. Memory fusion network for multi-view sequential learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Menlo Park, CA: AAAI Press, 2018: 5634–5641.

[10] Ghosal D, Akhtar S M, Chauhan D, Poria S, et al. Contextual inter-modal attention for multi-modal sentiment analysis[C]//In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA: ACL Press.2018: 3454-3466.

[11] Majumder N, Poria S, Hazarika D, et al. Dialoguernn: An attentive rnn for emotion detection in conversations[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Menlo Park, CA: AAAI Press, 2019: 6818-6825.

[12] Cho K, Van Merriënboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. ArXiv, 2014, 1406.1078.

[13] Shenoy A, Sardana A. Multilogue-net: A context aware rnn for multi-modal emotion detection and sentiment analysis in conversation[J]. ArXiv preprint ArXiv: 2020, 2002.08267,.

[14] Ekman P. Facial expression and emotion[J].American psychologist, 1993, 48 (4): 384-392.

[15] Kingma D P, Ba J. Adam: A method for stochastic optimization[J]. ArXiv, 2014, 1412.6980.

[16] Zadeh A, Zellers R, Pincus E,, et al.MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos[J]. arXiv preprint arXiv: 2016, 1606.06259.

[17] Busso C, Bulut M, Lee C C, et al.IEMOCAP: Interactive emotional dyadic motion capture database[J].Language resources and evaluative, 2008, 42 (4): 335–359

[18] Pennington J, Socher R, Manning C D. Glove: Global vectors for word representation[C]//Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014: 1532-1543.

[19] Degottex G, Kane J, Drugman T, et al. COVAREP—A collaborative voice analysis repository for speech technologies[C]// 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, 960–964

[20] Tao Huawei, Cha Cheng, Liang Ruiyu, et al. Spectrogram feature extraction algorithm for speech emotion recognition[J].JOURNAL OF SOUTHEAST UNIVERSITY (Natural Science Edition), 2015, 45 (05): 817-821.

[21] Baltrušaitis T, Robinson P, Morency L P. Openface: an open source facial behavior analysis toolkit[C]//2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2016: 1-10.

[22] Zadeh A, Chong Lim Y, Baltrusaitis T, et al. Convolutional experts constrained local model for 3d facial landmark detection[C]//Proceedings of the IEEE International Conference on Computer Vision Workshops. IEEE 2017: 2519-2528.

[23] Zhu Q, Yeh M C, Cheng K T, et al. Fast human detection using a cascade of histograms of oriented gradients[C]//2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06). IEEE, 2006, 2: 1491.