The Frontiers of Society, Science and Technology, 2024, 6(7); doi: 10.25236/FSST.2024.060704.
Zhichao Zhou1,2, Chaofan Hu1,2
1School of Mechanical and Electrical Engineering, Guilin University of Electronic Technology, 541004 Guilin, China
2Guangxi Key Laboratory of Manufacturing System & Advanced Manufacturing Technology, Guilin University of Electronic Technology, 541004, Guilin, China
Speech is the main way of human communication, which carries both the speaker's information and the speaker's emotion. A variety of applications can harness emotion in speech to serve human needs more effectively. The deep learning algorithm is a practical solution to the classification nature of speech recognition. Various algorithms have been widely utilized for voice data and achieved remarkable performance. However, in real life, the testing data to be under a different distribution from the training data, this will cause the out-of-distribution(OOD) problem. This article proposes a new domain generalization method for speech classification based on Stable Learning (StableNet) to address the OOD problem. The StableNet can remove the connection between features through learning weights for training samples, which makes deep models learn more useful features instead of the fake connection between the discriminative features and labels. We evaluate the performance of the proposed method by conducting speech classification experiments on voice datasets. We also investigate the importance of various features on speech classification in noisy environments. The effects of proposed method on speech recognition performance are evaluated.
Voice recognition; Domain Generalization; Stable Learning; Deep Learning; Signal Processing
Zhichao Zhou, Chaofan Hu. An Intelligent Speech Recognition Method Based on Stable Learning. The Frontiers of Society, Science and Technology (2024), Vol. 6, Issue 7: 17-26. https://doi.org/10.25236/FSST.2024.060704.
[1] ITU-T G.7299 “A Silence Compression Scheme for G.729 Optimized for Terminals Conforming to Recommendation V.70”. 1996, International Telecommunication Union.
[2] ETSI GSM 06.32 “Full rate speech; VAD for full rate speech traffic channel”, 1998, European Telecommunication Standards Institute.
[3] J.H.L. Hansen, S. Bou-Ghazale, “Robust speech recognition training via duration and spectral-based stress token generation”, IEEE Rans. Speech Audio Proc., (3):415-421, 1995.
[4] J.H.L. Hansen, S. Bou-Ghazale, ”Getting Started with SUSAS: A Speech Under Simulated and Actual Stress Database”, EUROSPEECH-97, pp. 1743-1746.
[5] J.H.L. Hansen, “Morphological constrained feature enhancement with adaptive cepstral compensation (MCE-ACC) for speech recognition in noise and Lombard effect”, IEEE Rans. Speech Audio Proc., vol. 2, pp. 598-614, Oct. 1994.
[6] B.A. Hanson, T. Applebauni, ”Robust speaker-independent word recognition using instantaneous, dynamic and acceleration features: Experiments with Lombard and noisy speech“, ICASSP-90, pp. 857-860.
[7] J.C. Junqua, “The Lombard reflex and its role on hunian listeners and autoniatic speech recognizers”, J. Acous. Soc. Am., vol. 93, pp. 510-524, Jan. 1993.
[8] R. Ruiz et al., “Time- and spectruni-related variabilities in stressed speech under laboratory and real conditions”, Speech Comm., vol. 20, pp. 111-129, 1996.
[9] Rosdi F, Mustafa M B, Salim S, et al. Automatic Speech Intelligibility Detection for Speakers with Speech Impairments: The Identification of Significant Speech Features[J]. Sains Malaysiana, 2019, 48(12):2737-2747.
[10] Revathi, A,Nagakrishnan, R, Sasikaladevi,N. Comparative analysis of Dysarthric speech recognition: multiple features and robust templates[J]. Multimedia Tools and Applications, 2022, 22(81),31245-31259.
[11] Mitra V , Wang W , Franco H , et al. Evaluating robust features on Deep Neural Networks for speech recognition in noisy and channel mismatched conditions. 2014.
[12] Mitra V , Wen W , Franco H . Deep convolutional nets and robust features for reverberation-robust speech recognition[C] 2014 IEEE Spoken Language Technology Workshop (SLT).
[13] Jiao Y, Tu M, Berisha V, Liss J (2018) Simulating Dysarthric Speech for Training Data Augmentation in Clinical Speech Applications. 2018 IEEE international conference on acoustics, speech, and signal processing (ICASSP), Calgary, pp 6009–6013.
[14] Takashima Y, Nakashima T, Takiguchi T, Ariki Y (2015) Feature extraction using pre-trained convolutive bottleneck nets for dysarthric speech recognition. 2015 23rd European Signal Processing Conference(EUSIPCO), Nice, pp 1411–1415.
[15] España-Bonet C, Fonollosa JA (2016) Automatic speech recognition with deep neuralnetworks for impaired speech. In: International Conference on Advances in Speech and Language Technologies forIberian Languages. Springer, Cham, pp 97–107.
[16] Sloane S, Dahmani H, Amami R et al (2012) Using speech rhythm knowledge to improve dysarthric speech recognition. Int J Speech Technol 15:57–64.