Speaker recognition system based on MFCC feature extraction CNN architecture

<p>Zhiyi Ji<sup>1</sup>, Guanghao Cheng<sup>2</sup>, Tianyu Lu<sup>3</sup>, Zhiqi Shao<sup>4</sup></p>

doi:10.25236/AJCIS.2024.070707

Academic Journal of Computing & Information Science, 2024, 7(7); doi: 10.25236/AJCIS.2024.070707.

Speaker recognition system based on MFCC feature extraction CNN architecture

Author(s)

Zhiyi Ji¹, Guanghao Cheng², Tianyu Lu³, Zhiqi Shao⁴

Corresponding Author:

Zhiyi Ji

Affiliation(s)

¹Wuxi Taihu University, Wuxi, China

²Central South University, Changsha, China

³Tianjin University of Technology, Tianjin, China

⁴Shandong Institute of Petroleum and Chemical Technology, Dongying, China

Download PDF
|
Download: 67
|
View: 4232

Abstract

This project adopts a self-designed neural network architecture to develop a concise and efficient speaker identification system. The main structure of the system consists of two major components: First, the MFCC (Mel-Frequency Cepstral Coefficients) feature extraction, which captures the unique voice characteristics of the speaker through meticulous audio signal processing; Second, the convolutional neural network (CNN), composed of multiple convolutional layers, pooling layers, and a fully connected layer, is primarily used for in-depth analysis and learning of the extracted features, thereby achieving high-precision speaker identification. Through the MFCC feature extraction and CNN processing, the system was trained and tested on a self-built data set, achieving an accuracy of 89%, realizing high-precision identification. The system is characterized by its simplicity and efficiency, making it suitable for deployment on edge devices without relying on powerful central servers, enabling quick response.

Keywords

MFCC feature extraction; Convolutional Neural Network (CNN); Speaker recognition; Identity recognition; Audio processing

Cite This Paper

Zhiyi Ji, Guanghao Cheng, Tianyu Lu, Zhiqi Shao. Speaker recognition system based on MFCC feature extraction CNN architecture. Academic Journal of Computing & Information Science (2024), Vol. 7, Issue 7: 47-59. https://doi.org/10.25236/AJCIS.2024.070707.

References

[1] Yu, M., Yuan, Y., Dong, H., & Wang, Z. (2006) Text-dependent speaker recognition method using MFCC and LPCC features, Journal of Computer Applications, 26.04: 883-885.

[2] Tiwari, V. (2010). MFCC and its applications in speaker recognition. International journal on emerging technologies, 1(1), 19-22.

[3] Juanhong, L., Yu, H., & Heyu, H. (2020) End-To-End Speech Recognition Based On Deep Convolution Neural Network, Computer Applications and Software, 37.4: 192-196.

[4] Feng, C., & Cheng, W. (2023) Speech Recognition Algorithm Based on Residual Convolutional Neural Network, Computer and Digital Engineering, 51.2

[5] Huy, P., Fernando, A., Navin, C., Y Oliver, C., & Maarten De, V. (2018) DNN Filter Bank Improves 1-Max Pooling CNN for Single-Channel EEG Automatic Sleep Stage Classification., The IEEE Engineering in Medicine and Biology Society, 453-456.

[6] Zhang, Q., Liu, Y., Pan, J., & Yan, Y. (2015) Continuous speech recognition by convolutional neural networks, Chinese Journal of Engineering: 1212-1217.

[7] LeCun, Y., Huang, F. J., & Bottou, L. (2004, June). Learning methods for generic object recognition with invariance to pose and lighting. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004. (Vol. 2, pp. II-104). IEEE.

[8] Chen, K., & Wang, A. (2024). Survey on regularization methods for convolutional neural network. Computer Applications Research, 04, 961-969.

[9] O-Yeon, K., Min-Ho, L., Cuntai, G., & Seong-Whan, L. (2020) Subject-Independent Brain–Computer Interfaces Based on Deep Convolutional Neural Networks, IEEE Transactions on Neural Networks and Learning Systems, 31.10: 3839-3852.

[10] Mirco, R., & Yoshua, B. (2018) Speaker recognition from raw waveform with sincnet, 2018 IEEE Workshop on Spoken Language Technology (SLT 2018), abs/1808.00158: 1021-1028.

[11] Can, C., & Yingcai, Y. (2014) Application of Window Function in Signal Processing, Journal of Beijing Institute of Graphic Communication: 71-74, 77. doi:10.3969/j.issn.1004-8626.2014.04.029.

[12] Wang, Y. (2021). Research of Speech Recognition Model based on Convolutional Neural Network And its Training Optimization. Chongqing University of Posts and Telecommunications. doi:10.27675/d.cnki.gcydx.2021.000406

[13] Zhao, X., & Zhang, K. (2022) Speech recognition based on three-layer structure optimized convolutional neural network, Journal of Shihezi University: Natural Science Edition, 40.1: 127-132.

[14] Chang-zheng, L., & Lei, Z. (2016) Research on Optimization Algorithm of Convolution Neural Network in Speech Recognition, Journal of Harbin University of Science and Technology, 21.3: 34-38.

[15] Zhichao, W., Ji, X., Pengyuan, Z., & Yonghong, Y. (2018) Structure optimization and computing acceleration for convolutional neural network acoustic models, Journal of Chongqing University of Posts and Telecommunications: Natural Science Edition, 30.3: 416-422.