Academic Journal of Computing & Information Science, 2022, 5(4); doi: 10.25236/AJCIS.2022.050403.
School of Information Management, Beijing Information Science & Technology University, Beijing, China
Web page intelligent parsing is an inevitable part of data collection. News web pages contain a lot of information with little relevance to the topic, which makes it difficult to locate the text content directly and quickly during the data collection process. This paper proposes a web page intelligent parsing algorithm based on text and symbol density. Through empirical research on the pages of mainstream news websites in China, the algorithm can quickly and accurately extract the text of news web pages.
Web page extraction; intelligent page parsing; text density; symbol density
Junyu Xie. Webpage Intelligent Parsing Algorithm Based on Text and Symbol Density. Academic Journal of Computing & Information Science (2022), Vol. 5, Issue 4: 18-21. https://doi.org/10.25236/AJCIS.2022.050403.
 Xue Mei et al. A method for fully automatic generation of web page information extraction Wrapper [J]. Journal of Chinese Information Processing,2008(01):22-29.
 Wenchao Yang et al. Adaptive Multi-Information Block Web Information Extraction Based on DOM Tree [J]. Network Security Technology & Application,2012(11):62-64.
 Marek Kowalkiewicz,Maria E. Orlowska,Tomasz Kaczmarek,Witold Abramowicz. Robust web content extraction[P]. World Wide Web,2006.
 MEHTA B, NARVEKAR M. DOM tree based approach for web content extraction[C]// 2015 International Conference on Communication, Information & Computing Technology. Mumbai: IEEE, 2015: 1-6.
 K.Nethra,J. Anitha,G. Thilagavathi. WEB CONTENT EXTRACTION USING HYBRID APPROACH [J]. ICTACT Journal on Soft Computing,2014,4(2).
 Chengjie Sun et al. Research on the Method of Information Extraction of Web Page Text Based on Statistics [J]. Journal of Chinese Information Processing,2004(05):17-22.
 Yongxin Wang et al. An Efficient LCS Algorithm [J]. Journal of Nanyang Institute of Technology, 2013 (6) :67-70.