Welcome to Francis Academic Press

Academic Journal of Computing & Information Science, 2023, 6(3); doi: 10.25236/AJCIS.2023.060306.

Self-attentive Residual TCN Speech Emotion Recognition with Fused Acoustic Features


Bin Zhang, Yanping Zhu

Corresponding Author:
Yanping Zhu

School of Microelectronics and Control Engineering, Changzhou University, Changzhou, China


In order to improve the overall performance of the speech emotion recognition system, the problem of insufficient emotion information due to a single speech feature and the problem of recognition models not making full use of the emotion information contained in the features are addressed. In this paper, a self-attentive residual temporal convolution network (S-ResTCN) fusing Mel frequency cepstrum coefficients with rhythmic features is proposed. Firstly, the rhythmic features and mel frequency cepstral coefficients of speech were extracted on the EMO-DB and CASIA databases respectively, and their statistical functions were calculated to form 128-dimensional acoustic fusion features; then, the S-ResTCN network was designed and built, and the dependency modeling between the feature elements was completed by using the residual temporal convolution network, which made the network pay more attention to the parameters related to the emotional state in the features through the self-attentive mechanism, and generated the self-attentive mechanism feature matrix; finally, the softmax function was used for classification and recognition. The results showed that the S-ResTCN network improved the accuracy by 1.52%-14.12% over the existing network of the EMO-DB database and improved the accuracy by 1.27%-6.53% over the existing network of the CASIA database.


speech emotion recognition, temporal convolution network, self-attention mechanism, mel-frequency cepstral coefficients, rhyme features

Cite This Paper

Bin Zhang, Yanping Zhu. Self-attentive Residual TCN Speech Emotion Recognition with Fused Acoustic Features. Academic Journal of Computing & Information Science (2023), Vol. 6, Issue 3: 42-51. https://doi.org/10.25236/AJCIS.2023.060306.


[1] Khalil R A, Jones E, Babar M I, et al. Speech emotion recognition using deep learning techniques: A review [J]. IEEE Access, 2019, 7: 117327-117345.

[2] Milton A, Roy S S, Selvi S T. SVM scheme for speech emotion recognition using MFCC feature[J]. International Journal of Computer Applications, 2013, 69(9):183-194.

[3] Sun L, Zou B, Fu S, et al. Speech emotion recognition based on DNN-decision tree SVM model[J]. Speech Communication, 2019, 115: 29-37.

[4] Wang W, Watters P A, Cao X, et al. Significance of phonological features in speech emotion recognition[J]. International Journal of Speech Technology, 2020, 23: 633-642.

[5] Han D, Kong Y, Han J, et al. A survey of music emotion recognition [J]. Frontiers of Computer Science, 2022, 16(6): 166335.

[6] Yi Y, Tian Y, He C, et al. DBT: multimodal emotion recognition based on dual-branch transformer[J]. The Journal of Supercomputing, 2022: 1-23.

[7] SHARMA P, ABROL V, DILEEP A, et al. Class specific GMM based sparse feature for speech units classification[C]. European Signal Processing Conference. SHARMAP, 2017.

[8] SCHULLER B, RIGOLL G, LANG M. Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture[C]. IEEE international conference on acoustics, speech, and signal processing. SCHULLERB, 2004.

[9] Ye Y, Chen J. Multi-modal Speech Emotion Recognition Based on TCN and Attention[C]. Proceedings of the 11th International Conference on Computer Engineering and Networks. Springer Singapore, 2022.

[10] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735-1780.

[11] Zhao J, Mao X, Chen L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks[J]. Biomedical signal processing and control, 2019, 47: 312-323.

[12] Dangol R, Alsadoon A, Prasad P W C, et al. Speech emotion recognition Using Convolutional neural network and long-short Term Memory[J]. Multimedia Tools and Applications, 2020, 79: 32917-32934. 

[13] Desheng H, Xueying Zhang, et al. Speech emotion recognition based on primary and secondary network feature fusion[J]. Journal of Taiyuan University of Technology, 2021, 52(05):769-774.