Academic Journal of Computing & Information Science, 2022, 5(4); doi: 10.25236/AJCIS.2022.050403.

Webpage Intelligent Parsing Algorithm Based on Text and Symbol Density


Junyu Xie

Junyu Xie

School of Information Management, Beijing Information Science & Technology University, Beijing, China


Web page intelligent parsing is an inevitable part of data collection. News web pages contain a lot of information with little relevance to the topic, which makes it difficult to locate the text content directly and quickly during the data collection process. This paper proposes a web page intelligent parsing algorithm based on text and symbol density. Through empirical research on the pages of mainstream news websites in China, the algorithm can quickly and accurately extract the text of news web pages.


Web page extraction; intelligent page parsing; text density; symbol density

Junyu Xie. Webpage Intelligent Parsing Algorithm Based on Text and Symbol Density. Academic Journal of Computing & Information Science (2022), Vol. 5, Issue 4: 18-21. https://doi.org/10.25236/AJCIS.2022.050403.


