Publication Entry

[Back]

Diploma and Master Theses (authored and supervised):

M. Aigner:
"Extraction of Tabular Information from PDF Documents - A graph-based Approach";
Supervisor: P. Knees, F. Preis; Institut für Information Systems Engineering, 2020; final examination: 2020-11-16.

English abstract:

A structured content representation is the foundation for several systems, be it for an automated processing of documents when considering intelligent workflows or for general queries in the World Wide Web. However, vast amounts of information are stored in unstructured formats. A wide-spread example of such a format is the Portable Document Format (PDF). Therefore, a lot of unlocked potential lies in the derivation of structural characteristics from unstructured file formats. Important information in documents is often stored in tables, which provide a quite dense representation of content. However, detecting tables is a difficult task because of an infeasible amount of possible different visual or structural representations. Nowadays, most state-of-the-art systems for this task use Neural Networks which need a high amount of training data and have difficulties with handling out-of-the-box data. In this work, a system is developed with the aim of detecting and extracting tables from PDFs under a generalization over all kinds of layouts. This goal is directly related to the research question, namely how graph-based and unsupervised approaches compare to recently developed Neural Networks for table recognition. In this work, textual content is extracted directly from PDF documents. Then, a Visibility Graph from the extracted layout elements is created, indicating their relationships. On this graph, a clustering procedure is employed which segments a page into logical groups. From the resulting segments, tables are then classified using multimodal features. The evaluation shows that the developed system performs well on different and unseen layouts, irrespective of the visual representations of tables. This can be achieved without specific parameter tweaks, even though these can lead to even better performances on known table layouts and the system provides a possibility for such adaptions.

German abstract:

Eine strukturierte Repräsentation von dem Inhalt von Dokumenten ist die Basis für viele Systeme, ob es sich nun um eine automatisierte Verarbeitung von Dokumenten mithilfe von intelligenten Workflows oder generell um Datenabfragen im World Wide Web handelt. Jedoch ist eine Vielzahl an Dokumenten in unstrukturierten Formaten abgespeichert. Ein weitverbreitetes Beispiel eines solchen Formates ist das Portable Document Format (PDF). Daher liegt ein großes, unerschlossenes Potential in der Ableitung von strukturierter Information aus unstrukturierten Dateiformaten. Große Anteile von wichtiger Information sind hierbei in Tabellen hinterlegt, da diese eine sehr dichte Repräsentation von Inhalten darstellen. Durch eine nicht greifbare Anzahl an verschiedenen Gestaltungen ist die Erkennung von tabellarischen Strukturen eine schwierige Aufgabe. Heutzutage basieren die meisten dafür angewendeten Systeme auf neuronalen Netzen, welche große Trainingssets voraussetzten und ebenfalls Schwierigkeiten hinsichtlich Daten zeigen, welche darin nicht präsent sind. In dieser Arbeit wird ein System entwickelt, welches den Zweck erfüllt, Tabellen aus PDFs sowohl zu erkennen als auch zu extrahieren und dabei über die verschiedensten Gestaltungen zu generalisieren. Dieses Ziel bezieht sich direkt auf die Forschungsfrage, nämlich inwieweit die Leistung von auf Graphen und auf unüberwachtes Lernen basierender Ansätze im Gegensatz zu neuronalen Netzen für Tabellenerkennungen vergleichbar ist. In dieser Arbeit werden textuelle Elemente dabei direkt aus PDF Dokumenten extrahiert. Darauffolgend wird ein Sichtbarkeitsgraph erstellt, welcher die Beziehungen zwischen diesen Elementen darstellt. Auf diesen Graph wird anschließend ein Clustering Algorithmus angewendet, welcher die Seiten eines Dokuments in logische Segmente unterteilt. Aus den resultierenden Segmenten werden Tabellen mithilfe von multimodalen Attributen klassifiziert. Eine Evaluation der Ergebnisse gibt Aufschluss darüber, dass das entwickelte System eine gute Erkennung auf verschiedene und auch nicht im Trainingsprozess eingeschlossene Formate bietet, ganz egal in welcher Form Tabellen visuell kommuniziert sind. Dies wurde ohne speziellen Parameteradaptionen erreicht, wobei solche unterstützt werden und dementsprechend zu besseren Ergebnissen für bestimmte Tabellenformate führen können.

Keywords:

Table Detection; Table Extraction; Layout Analysis; Document Segmentation; Graph-based Approach

"Official" electronic version of the publication (accessed through its Digital Object Identifier - DOI)

http://dx.doi.org/10.34726/hss.2020.79244

Electronic version of the publication:

https://repositum.tuwien.at/bitstream/20.500.12708/16174/2/Extraction%20of%20Tabular%20Information%20from%20PDF%20Documents%20-%20A%20graph-based%20Approach.pdf

Created from the Publication Database of the Vienna University of Technology.