Academic Journal of Computing & Information Science, 2023, 6(5); doi: 10.25236/AJCIS.2023.060512.
Chengge Duan1, Minze Wang2, Xin Lu2, Junming Wang3
1Suzhou Public Security Bureau, Suzhou, Jiangsu, China
2School of Social Computing, Xi'an Jiaotong-Liverpool University, Suzhou, Jiangsu, China
3The Third Research Institute of Ministry of Public Security, Shanghai, China
In today's Internet age, phishing attacks are a common means of cyberattacks. Most existing URL-based anti-phishing technologies are simple and effective, but lagging, while machine learning and deep learning-based approaches can effectively improve detection efficiency. This study advocates the use of TF-IDF for website data preprocessing followed by a random forest model to achieve phishing website feature classification. The final experimental results show that the model accuracy of the random forest algorithm based on machine learning to judge phishing websites is high and the anti-phishing capability is superior.
Phishing Website Detection; Random Forest Model; TF-IDF; Machine Learning
Chengge Duan, Minze Wang, Xin Lu, Junming Wang. A phishing website detection system based on machine learning methods. Academic Journal of Computing & Information Science (2023), Vol. 6, Issue 5: 91-94. https://doi.org/10.25236/AJCIS.2023.060512.
[1] Kim Y G, Cho S, Lee J S, et al. Method for Evaluating the Security Risk of a Website Against Phishing Attacks.[J]. Lecture Notes in Computer Science, 2008, 5075:21-31.
[2] Zhang Y, Hong J I, Cranor L F. Cantina: a content-based approach to detecting phishing web sites [C]// Proceedings of the 16th international conference on World Wide Web. ACM, 2007:639-648.
[3] Liu W, Deng X, Huang G, et al. An antiphishing strategy based on visual similarity assessment [J]. IEEE Internet Computing, 2006, 10(2): 58-65.
[4] Xia T, Chai Y, & Wang T. Improving SVM on web content classification by document formulation [C] //2012 7th International Conference on Computer Science & Education (ICCSE), 2012: 110-113.
[5] Breiman L. Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statistical Science, 2001, 16, 199-231.
[6] Breiman L, Friedman, J, Olshen R, & Stone C. Classification and Regression Trees [M]. Boca Raton, FL: CRC Press, 1984:18-58.
[7] Breiman L. Random Forests--random features [J]. Machine Learning, 1999, 45(1):5-32.
[8] Manjari K, Rousha S, et al. Extractive Text Summarization from Web pages using Selenium and TF-IDF algorithm[C]//2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184), 2020: 648-652.