Talks and Poster Presentations (with Proceedings-Entry):
"GraphWrap-A System for Interactive Wrapping of PDF Documents Using Graph Matching Techniques";
Poster: The 9th ACM Symposium on Document Engineering (DocEng'09),
- 2009-09-18; in: "Proc. of the 2009 ACM Symposium on Document Engineering",
U. M. Borghoff, B. Chidlovskii, S. Rönnau (ed.);
We present GraphWrap, a novel and innovative approach to wrapping PDF documents. The PDF format is often used to publish large amounts of structured data, such as product specifications, measurements, prices or contact information. As the PDF format is unstructured, it is very
difficult to use this data in machine processing applications.
Wrapping is the process of navigating the data source, semiautomatically extracting the data and transforming it into
a structured form. GraphWrap enables a non-expert user to create such data extraction programs for almost any PDF file in an intuitive
and interactive manner. We show how a wrapper can be created
by selecting an example instance and interacting with the graph representation to set conditions and choose which data items to extract. In the background, the corresponding instances are found using an algorithm based on subgraph isomorphism. The resulting wrapper can then be run on other pages and documents which exhibit a similar visual structure.
Project Head Reinhard Pichler:
GraphWrap - Graph-Based Wrapping from PDF Documents
Created from the Publication Database of the Vienna University of Technology.