Huihui Jin1, Longyin Luo1, Xinyi Wang2, Xiaoqi Zhu3, Lian Qian4, Zhice Zhang4
1Wenzhou-Kean University, Wenzhou, Zhejiang Province, China
2Sichuan University, Chengdu, Sichuan, China
3Australian National University, ACT, Australia
4The Affiliated High School to Hangzhou Normal University, Hangzhou, Zhejiang, China
These authors contributed equally to this work
How to effectively evaluate and identify the potential default risk of borrowers and calculate the default probability of borrowers before issuing loans is the basis and important link of the credit risk management of modern financial institutions. This paper mainly studies the statistical analysis of historical loan data of banks and other financial institutions with the help of the idea of non-balanced data classification, and uses machine learning algorithms (not statistical algorithms) such as random forest, logical regression and decision tree to establish loan default prediction model. The experimental results show that neural network and random forest algorithm outperform decision tree and logistic regression classification algorithm in prediction performance. In addition, by using the random forest algorithm to rank the importance of features, the features that have a greater impact on the final default can be obtained, so as to make a more effective judgment on the loan risk in the financial field.
Random Forest, Bank Credit, Loan Default Prediction, Data Mining
Huihui Jin, Longyin Luo, Xinyi Wang, Xiaoqi Zhu, Lian Qian, Zhice Zhang. Financial Credit Default Forecast Based on Big Data Analysis. Academic Journal of Business & Management (2021) Vol. 3, Issue 8: 51-56. https://doi.org/10.25236/AJBM.2021.030810.
 Brown, I., & Mues, C. (2012). An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39(3), 3446-3453.
 Gao Jiawei, Liang Jiye. (2008). Research progress on classification of unbalanced data sets (Doctoral dissertation).
 Lin, W., Wu, Z., Lin, L., Wen, A., & Li, J. (2017). An ensemble random forest algorithm for insurance big data analysis. IEEE access, 5, 16568-16575.
 Lu Hongyan, & Feng Qian. (2019). Review of random forest algorithm. Journal of Hebei Academy of Sciences, 3
 Khashman, A. (2011), Credit Risk Evaluation Using Neural Networks: Emotional versus Conventional Models; Applied Soft Computing, 11, pp.5477-5484.
 Oshiro, T. M., Perez, P. S., & Baranauskas, J. A. (2012, July). How many trees in a random forest?. In International workshop on machine learning and data mining in pattern recognition (pp. 154-168). Springer, Berlin, Heidelberg.
 Python Software Foundation (2018).Python Language Reference, version3.5. http://www.python.org
 Shen Chu (2019). The method of non-equilibrium data classification based on the generation model and Its Application Research (master's thesis, Hebei University)
 Wei Zhengtao, Yang Youlong, & Bai Jing. (2018). Improvement of random forest classification algorithm based on unbalanced data. Journal of Chongqing University, 41 (4), 54-62
 Zhu, L., Qiu, D., Ergu, D., Ying, C., & Liu, K. (2019). A study on predicting loan default based on the random forest algorithm. Procedia Computer Science, 162, 503-513.