*Academic Journal of Computing & Information Science*,
2022,
5(1);
doi: 10.25236/AJCIS.2022.050107.

Sufen Chen^{1}, Xueqiang Zeng^{2}

Xueqiang Zeng

^{1}School of Information Engineering, Nanchang Institute of Technology, Nanchang, Jiangxi Province 330099, P.R. China

^{2}School of Computer & Information Engineering, Jiangxi Normal University, Nanchang, Jiangxi Province 330022, P.R. China

- Download PDF
- |
- Download: 32
- |
- View: 1205

In the era of big data, machine learning-based data analysis has been integrated into almost all walks of modern life. Before applying machine learning, a machine learning algorithm with its proper hyper-parameters have to be decided, where rich machine learning knowledge and lots of practical manual iterations are required. In order to popularize machine learning and allow non-professionals to use machine learning to solve problems, automatic machine learning model selection is particularly important. Among various existing automatic machine learning model selection methods, Progressive Sampling-based Bayesian Optimization (PSBO) is one of the most efficient and effective ones. However, PSBO adopted the progressive sampling with the traditional random sampling strategy, which does not consider the importance of individual samples. Based on the idea that more important and effective samples will make the model training results better, the paper proposed a Sample Importance Guided Progressive Sampling-based Bayesian Optimization (SIG-PSBO) for automatic machine learning. SIG-PSBO defines the sample importance by the difficulty to distinguish categories in a PCA feature space. Then samples with higher sample importance are more likely to be sampled for the subsequent model training. Extensive experiment results showed that the SIG-PSBO method can significantly shorten the search time and reduce the classification error rates compared to the original PSBO method.

Automatic machine learning, Sample importance sampling, Principal component analysis, Progressive sampling

Sufen Chen, Xueqiang Zeng. Sample Importance Guided Progressive Sampling-Based Bayesian Optimization for Automatic Machine Learning. Academic Journal of Computing & Information Science (2022), Vol. 5, Issue 1: 32-39. https://doi.org/10.25236/AJCIS.2022.050107.

[1] Su, M., Liang, B., Ma, S., Xiang, C., Zhang, C., Wang, J. (2021) Automatic Machine Learning Method for Hyper-Parameter Search. Journal of Physics: Conference Series 2021, 1802, 032082 (8pp), doi:10.1088/1742-6596/1802/3/032082.

[2] Kanjilal, R., Uysal, I. (2021) The Future of Human Activity Recognition: Deep Learning or Feature Engineering? Neural Process. Lett. 53, 561–579, doi: 10.1007/s11063-020-10400-x.

[3] Gemp, I., Theocharous, G., Ghavamzadeh, M. (2017) Automated Data Cleansing through Meta-Learning. In Proceedings of the Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pp. 4760–4761.

[4] He, X., Zhao, K., Chu, X. AutoML: (2021) A Survey of the State-of-the-Art. Knowl. Based Syst. 212, 106622, doi:10.1016/j.knosys.2020.106622.

[5] Bezrukavnikov, O., Linder, R. (2021) A Neophyte With AutoML: Evaluating the Promises of Automatic Machine Learning Tools. CoRR 2021, abs/2101.05840.

[6] Kotthoff, L., Thornton, C., Hoos, H.H., Hutter, F., Leyton-Brown, K. (2019) Auto-WEKA: Automatic Model Selection and Hyperparameter Optimization in WEKA. In Automated Machine Learning - Methods, Systems, Challenges, pp. 81–95.

[7] Feurer, M., Klein, A., Eggensperger, K., Springenberg, J.T., Blum, M., Hutter, F. (2019) Auto-sklearn: Efficient and Robust Automated Machine Learning. In Automated Machine Learning - Methods, Systems, Challenges, pp. 113–134.

[8] Zoph, B., Le, Q.V. Neural Architecture Search with Reinforcement Learning. (2017) In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.

[9] Kanter, J.M., Veeramachaneni, K. (2015) Deep Feature Synthesis: Towards Automating Data Science Endeavors. In Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015, Campus des Cordeliers, Paris, France, October 19-21, 2015, pp. 1–10.

[10] Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K. (2013) Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms. In Proceedings of the The 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, Chicago, IL, USA, August 11-14, 2013, pp. 847–855.

[11] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2012) Scikit-Learn: Machine Learning in Python. CoRR 2012, abs/1201.0490.

[12] Zeng, X., Luo, G. (2017) Progressive Sampling-Based Bayesian Optimization for Efficient and Automatic Machine Learning Model Selection. Health Inf. Sci. Syst. 5, 2, doi: 10.1007/s13755-017-0023-z.

[13] Ros, F., Guillaume, S. (2021) A Progressive Sampling Framework for Clustering. Neurocomputing 2021, 450, 48–60, doi:10.1016/j.neucom.

[14] Zhang, T., Yang, B. Big Data Dimension Reduction Using PCA. (2016) In Proceedings of the 2016 IEEE International Conference on Smart Cloud, SmartCloud 2016, New York, NY, USA, November 18-20, 2016, pp. 152–157.

[15] Wang, S.-H., Huang, S.-Y., Chen, T.-L. (2020) On Asymptotic Normality of Cross Data Matrix-Based PCA in High Dimension Low Sample Size. J. Multivar. Anal. 175, doi:10.1016/j.jmva.2019.104556.

[16] Rosell, J., Suárez, R., Pérez, (2013) A. Path Planning for Grasping Operations Using an Adaptive PCA-Based Sampling Method. Auton. Robots 35, 27–36, doi: 10.1007/s10514-013-9332-5.

[17] Grabowski, S., Kowalski, T.M. (2021) Algorithms for All-Pairs Hamming Distance Based Similarity. Softw. Pract. Exp. 51, 1580–1590, doi:10.1002/spe.2978.

[18] Rao, R.B., Fung, G. (2008) On the Dangers of Cross-Validation. An Experimental Evaluation. In Proceedings of the Proceedings of the SIAM International Conference on Data Mining, SDM 2008, April 24-26, 2008, Atlanta, Georgia, USA, pp. 588–596.

[19] Marcot, B.G., Hanea, A.M. (2021) What Is an Optimal Value of k in K-Fold Cross-Validation in Discrete Bayesian Network Analysis? Comput. Stat. 36, 2009–2031, doi: 10.1007/s00180-020-00999-9.

[20] Witten, I.H., Frank, E., Hall, M.A. (2011) Data Mining: Practical Machine Learning Tools and Techniques, 3rd Edition., Morgan Kaufmann, Elsevier, p. 629, ISBN 9780123748560.