Welcome to Francis Academic Press

Academic Journal of Business & Management, 2023, 5(3); doi: 10.25236/AJBM.2023.050317.

Identifying the Optimal Machine Learning Model for Predicting Car Insurance Claims: A Comparative Study Utilising Advanced Techniques


Xiaonan Li

Corresponding Author:
Xiaonan Li

Central University of Finance and Economics, Beijing, China


This study presents an investigation into the use of machine learning techniques for predicting car insurance claims, with a specific focus on identifying the optimal model for this task. By utilising advanced techniques such as SMOTEEN, ANOVA, and Chi-squared tests, the challenge of processing imbalanced data and identifying relevant features were addressed. Our evaluation of five popular and effective models, including logistic regression (LR), random forest (RF), support vector machine (SVM), multi-layer perceptron (MLP), and extreme gradient boosting (XGBoost), yields result that demonstrate the superiority of the RF model in predicting car insurance claims. Furthermore, our study illustrates the advantages of using machine learning algorithms in handling large and complex datasets, making predictions on future insurance claims, and adapting to changing circumstances, making it a valuable tool for practitioners in the insurance industry.


machine learning, car insurance claims, LR, RF, SVM, MLP, XGBoost

Cite This Paper

Xiaonan Li. Identifying the Optimal Machine Learning Model for Predicting Car Insurance Claims: A Comparative Study Utilising Advanced Techniques. Academic Journal of Business & Management (2023) Vol. 5, Issue 3: 112-120. https://doi.org/10.25236/AJBM.2023.050317.


[1] M. A. Fauzan and H. Murfi, "The accuracy of XGBoost for insurance claim prediction," Int. J. Adv. Soft Comput. Appl, vol. 10, no. 2, pp. 159-171, 2018.

[2] D. Huangfu, "Data Mining for Car Insurance Claims Prediction," WORCESTER POLYTECHNIC INSTITUTE, 2015. 

[3] S. Matthews and B. Hartman, "Machine Learning in Ratemaking, an Application in Commercial Auto Insurance," Risks, vol. 10, no. 4, p. 80, 2022.

[4] K. Weerasinghe and M. Wijegunasekara, "A comparative study of data mining algorithms in the prediction of auto insurance claims," European International Journal of Science and Technology, vol. 5, no. 1, pp. 47-54, 2016.

[5] P. Hanafizadeh and N. R. Paydar, "A data mining model for risk assessment and customer segmentation in the insurance industry," International Journal of Strategic Decision Sciences (IJSDS), vol. 4, no. 1, pp. 52-78, 2013.

[6] T. Seo, K. H. Park, and H. Chung, "SOCAR: Socially-Obtained CAR Dataset for Image Recognition in the Wild," in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 430-438. 

[7] W. Breuer, A. Haake, M. Hass, and E. Sachsenhausen, "Silence is Silver, Speech is Gold: The Benefits of Machine Learning and Text Analysis in the Financial Sector," in The Monetization of Technical Data: Springer, 2023, pp. 69-86.

[8] K. A. Smith, R. J. Willis, and M. Brooks, "An analysis of customer retention and insurance claim patterns using data mining: A case study," Journal of the operational research society, vol. 51, no. 5, pp. 532-541, 2000.

[9] J. S. Kong et al., "Machine learning-based injury severity prediction of level 1 trauma center enrolled patients associated with car-to-car crashes in Korea," Computers in biology and medicine, vol. 153, p. 106393, 2023.

[10] D. Agarwal and K. Tripathi, "A Framework for Structural Damage detection system in automobiles for flexible Insurance claim using IOT and Machine Learning," in 2022 International Mobile and Embedded Technology Conference (MECON), 2022: IEEE, pp. 5-8. 

[11] M. Hanafy and R. Ming, "Classification of the Insureds Using Integrated Machine Learning Algorithms: A Comparative Study," Applied Artificial Intelligence, pp. 1-32, 2022.

[12] A. L’heureux, K. Grolinger, H. F. Elyamany, and M. A. Capretz, "Machine learning with big data: Challenges and approaches," Ieee Access, vol. 5, pp. 7776-7797, 2017.

[13] O. Stucki, "Predicting the customer churn with machine learning methods: case: private insurance customer data," 2019.

[14] A. Katal, M. Wazid, and R. H. Goudar, "Big data: issues, challenges, tools and good practices," in 2013 Sixth international conference on contemporary computing (IC3), 2013: IEEE, pp. 404-409. 

[15] Y. Liu, Y. Liu, X. Bruce, S. Zhong, and Z. Hu, "Noise-robust oversampling for imbalanced data classification," Pattern Recognition, vol. 133, p. 109008, 2023.

[16] G. E. Batista, R. C. Prati, and M. C. Monard, "A study of the behavior of several methods for balancing machine learning training data," ACM SIGKDD explorations newsletter, vol. 6, no. 1, pp. 20-29, 2004.

[17] C. Diamantini and D. Potena, "Bayes vector quantizer for class-imbalance problem," IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 5, pp. 638-651, 2008.

[18] L. Huang, T. Song, and T. Jiang, "Linear regression combined KNN algorithm to identify latent defects for imbalance data of ICs," Microelectronics Journal, vol. 131, p. 105641, 2023.

[19] G. G. Sundarkumar and V. Ravi, "A novel hybrid undersampling method for mining unbalanced datasets in banking and insurance," Engineering Applications of Artificial Intelligence, vol. 37, pp. 368-377, 2015.

[20] A. A. Salarian, H. Etemadfard, A. Rahimzadegan, and M. Ghalehnovi, "Investigating the Role of Clustering in Construction-Accident Severity Prediction Using a Heterogeneous and Imbalanced Data Set," Journal of Construction Engineering and Management, vol. 149, no. 2, p. 04022161, 2023.

[21] W. Xu, S. Wang, D. Zhang, and B. Yang, "Random rough subspace based neural network ensemble for insurance fraud detection," in 2011 Fourth International Joint Conference on Computational Sciences and Optimization, 2011: IEEE, pp. 1276-1280. 

[22] S. Baran and P. Rola, "Prediction of motor insurance claims occurrence as an imbalanced machine learning problem," arXiv preprint arXiv: 2204. 06109, 2022.

[23] S. Meng, Y. Gao, and Y. Huang, "Actuarial intelligence in auto insurance: Claim frequency modeling with driving behavior features and improved boosted trees," Insurance: Mathematics and Economics, 2022.

[24] Y. Sun, A. K. Wong, and M. S. Kamel, "Classification of imbalanced data: A review," International journal of pattern recognition and artificial intelligence, vol. 23, no. 04, pp. 687-719, 2009.

[25] R. Akbani, S. Kwek, and N. Japkowicz, "Applying support vector machines to imbalanced datasets," in European conference on machine learning, 2004: Springer, pp. 39-50. 

[26] C. Cardie and N. Howe, "Improving minority class prediction using case-specific feature weights," 1997.

[27] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of artificial intelligence research, vol. 16, pp. 321-357, 2002.

[28] T. Chen and C. Guestrin, "Xgboost: A scalable tree boosting system," in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785-794. 

[29] J. Pesantez-Narvaez, M. Guillen, and M. Alcañiz, "Predicting motor insurance claims using telematics data—XGBoost versus logistic regression," Risks, vol. 7, no. 2, p. 70, 2019.

[30] J. P. Bradford, C. Kunz, R. Kohavi, C. Brunk, and C. E. Brodley, "Pruning decision trees with misclassification costs," in European Conference on Machine Learning, 1998: Springer, pp. 131-136.