Talks and Poster Presentations (with Proceedings-Entry):
"Object-Level Document Analysis of PDF Files";
Talk: The 9th ACM Symposium on Document Engineering (DocEng'09),
- 2009-09-18; in: "Proc. of the 2009 ACM Symposium on Document Engineering",
U. M. Borghoff, B. Chidlovskii, S. Rönnau (ed.);
The PDF format is commonly used for the exchange of documents on the Web and there is a growing need to understand and extract or repurpose data held in PDF documents. Many systems for processing PDF files use algorithms designed for scanned documents, which analyse a page based on its bitmap representation. We believe this approach to
be inefficient. Not only does the rasterization step cost processing
time, but information is also lost and errors can be introduced.
Inspired primarily by the need to facilitate machine extraction of data from PDF documents, we have developed methods to extract textual and graphic content directly from the PDF content stream and represent it as a list of"objects" at a level of granularity suitable for structural understanding of the document. These objects are then grouped into lines, paragraphs and higher-level logical structures using a novel bottom-up segmentation algorithm based on visual
perception principles. Experimental results demonstrate the
viability of our approach, which is currently used as a basis
for HTML conversion and data extraction methods.
Project Head Reinhard Pichler:
GraphWrap - Graph-Based Wrapping from PDF Documents
Created from the Publication Database of the Vienna University of Technology.