Welcome to Francis Academic Press

Academic Journal of Computing & Information Science, 2021, 4(6); doi: 10.25236/AJCIS.2021.040606.

K-means clustering for the analysis of incomplete business data

Author(s)

Qi An1, XiMing Ma2

Corresponding Author:
Qi An
Affiliation(s)

1School of Mathematical Sciences, Nanjing Normal Universtiy, Nanjing, China

2College of Data Science and Application, Inner Mongolia University of Technology, Huhhot, China

Abstract

Missing values can significantly reduce the accuracy and availability of business data. Usually, when clustering incomplete data, the data with missing values are deleted, and only the complete data are analyzed. However, this often leads to significant loss or deviation of information. This paper mainly studies how to use unsupervised machine learning techniques to deal with missing values. The combination of imputation method and clustering technology forms a new method to deal with missing values, which is helpful to overcome the problem of missing data. We propose a strategy based on the combination of K-means, big data K-means, p-k-means, and mean imputation method, singular value decomposition imputation method, k-nearest neighbor imputation method. By comparing the performance of nine methods in different business data sets. The experimental analysis was carried out on four benchmark data sets. The effectiveness of K-means clustering and imputation methods is verified on different data sets, and the results also have a certain application prospect.

Keywords

Missing data, Imputation, Clustering, Business

Cite This Paper

Qi An, XiMing Ma. K-means clustering for the analysis of incomplete business data. Academic Journal of Computing & Information Science (2021), Vol. 4, Issue 6: 35-42. https://doi.org/10.25236/AJCIS.2021.040606.

References

[1] MA, Zongfang, LIU Zhe, et al. “Credal Transfer Learning With Multi-Estimation for Missing Data.” IEEE Access 8 (2020), pp. 70316-70328.

[2] XIONG Zhongmin, GUO Huaiyu, and WU Yuexin. “Review of Missing Data Processing Methods”. In: Computer Engineering and Applications 57.14 (2021), pp. 27–38.

[3] Pedro J Garcı´ea-Laencina et al. “K nearest neighbours with mutual information for simultaneous classification and missing data imputation”. In: Neurocomputing 72.7-9 (2009), pp. 1483–1493.

[4] Roderick JA Little and Donald B Rubin. Statistical analysis with missing data. Vol. 793. John Wiley & Sons, 2019.

[5] CHEN Wanjiao. “Research on Application of Missing Data Imputation in Medical Field”. In: South China University of Technology (2019).

[6] HAN Jiawei, PEI Jian, and Kamber Micheline. Data mining: concepts and techniques. Elsevier, 2011.

[7] James MacQueen et al. “Some methods for classification and analysis of multivariate observations”. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. Vol. 1. 14. Oakland, CA, USA. 1967, pp. 281–297.

[8] ZHOU Wang, ZHANG Chenlin, and WU Jianxin. “Qualitative balanced clustering algorithm based on Hartigan-Wong and Lloyd”. In: Journal of Shandong University (Engineering Science) 46.05 (2016), pp. 37–44.

[9] T. Olga et al. “Missing value estimation methods for DNA microarrays”. In: Bioinformatics 17.6 (2001), pp. 520–525.

[10] Julie Josse and Fran¸cois Husson. “Handling missing values in exploratory multivariate data analysis methods”. In: Journal de la Soci´et´e Fran¸caise de Statistique 153.2 (2012), pp. 79–99.

[11] Jaap Brand. Development, implementation and evaluation of multiple imputation strategies for the statistical analysis of incomplete data sets. 1999.

[12] ZANG Chuanyu et al. “Research on K-Means Algorithm Analysis and Improvement”. In: Computer Science and Application 6.9 (2016), p. 14.

[13] A. Martiniano et al. “Application of a neuro fuzzy network in prediction of absenteeism at work”. In: Information Systems and Technologies (CISTI), 2012 7th Iberian Conference on. 36 vols. 18. 2012.

[14] N. Dehouche. “Dataset on usage and engagement patterns for Facebook Live sellers in Thailand”. In: Data in Brief 30 (2020), p. 105661.

[15] C. O. Sakar et al. “Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and LSTM recurrent neural networks”. In: Neural Computing and Applications (2018).

[16] Nuno Gon¸calo Costa Fernandes Marques de Abreu et al. “Analise do perfil do cliente Recheio e desenvolvimento de um sistema promocional”. PhD thesis. 2011.

[17] HONG Qing et al. “Video user group classification based on barrage comments sentiment analysis andclustering algorithms”. In: Computer Engineering & Science 40.06 (2018), pp. 1125–1139.

[18] Peter J Rousseeuw. “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis”. In: Journal of computational and applied mathematics 20 (1987), pp. 53–65.