Welcome to Francis Academic Press

The Frontiers of Society, Science and Technology, 2024, 6(8); doi: 10.25236/FSST.2024.060815.

Analysis of Unstructured Document Data Extraction Technology

Author(s)

Shiguang Sun1, Jiaxin Ye2

Corresponding Author:
Shiguang Sun
Affiliation(s)

1School of Innovation and Entrepreneurship, Liaoning University, Shenyang, Liaoning, China

2Asia-Australia Business College, Liaoning University, Shenyang, Liaoning, China

Abstract

In real life, a considerable quantity of unstructured documents exists. These unstructured documents are characterized by low structure, high proportion of data storage, abundant information volume and redundant data, which make the management of such data extremely complex. A large amount of valuable data is encompassed within unstructured documents. The key to extract unstructured document information lies in overcoming the difficulties of traditional information extraction techniques, such as, the limitation of model generalization ability, the complexity of context understanding, the difficulty of data annotation, and etc. How to analyze the data of unstructured documents and automatically extract the useful data information therefrom presents challenges to the technologies in the domains of pattern recognition, machine learning, and deep learning. This paper examines the structural characteristics of unstructured documents. It collates and analyzes the existing technologies and methods for extracting information from such documents, with the objective of better serving the digitization of archives.

Keywords

Unstructured Document, DLA, Deep Learning, RNN, CNN

Cite This Paper

Shiguang Sun, Jiaxin Ye. Analysis of Unstructured Document Data Extraction Technology. The Frontiers of Society, Science and Technology (2024), Vol. 6, Issue 8: 84-88. https://doi.org/10.25236/FSST.2024.060815.

References

[1] Wang Zhiyu, Zhao Shumei. Unstructured data management electronic file analysis. Journal of Archival Science, 2014 (5): 5. DOI: CNKI: SUN: DAXT. 0.2014-05-015.

[2] Zhang Haoyue. Layout Analysis and Table Extraction of Unstructured Documents. Beijing Jiaotong University, 2019.

[3] Zhang Pengfei. Research on Layout Analysis Algorithm of Unstructured Documents. University of Electronic Science and Technology of China [2024-08-21].

[4] Liang Yande. Research on task interference prediction method of Cloud Data Center based on massive log. Beijing University of Technology, 2021.

[5] Xi Jianfei, Wang Zhiying, Zou Wenjing, et al. Unstructured tabular document data extraction method based on Deep learning. Microcomputer Applications, 2022, 38 (2): 4.

[6] Zhang Yunzhen and Tang Wei. Document structural transformation in the process of data processing research. Journal of electronic and information technology, 2021, 005 (002): P. 186-187. The DOI: 10.19772 / j.carol carroll nki. 2096-4455.2021.2.082.

[7] Li Yixin, Zou Yajun, Ma Quanwen. Based on feature extraction and document block image classification algorithm of machine learning. Journal of signal processing, 2019 (5): 11. DOI: 10.16798 / j.i SSN. 1003-0530.2019.05.003.

[8] Gao Yang, Huang Heyan, Lu Chi. A document topic vector extraction method based on deep learning: CN201810748564.1.CN108984526A [2024-08-21].