Talks and Poster Presentations (with Proceedings-Entry):
K. Duretec, A. Rauber, C. Becker:
"A Text Extraction Software Benchmark Based on a Synthesized Dataset";
Talk: 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL),
Toronto, ON, Canada;
- 2017-06-23; in: "2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL)",
Text extraction plays an important function for data processing workflows in digital libraries. For example, it is a crucial prerequisite for evaluating the quality of migrated textual documents. Complex file formats make the extraction process error-prone and have made it very challenging to verify the correctness of extraction components. Based on digital preservation and information retrieval scenarios, three quality requirements in terms of effectiveness of text extraction tools are identified: 1) is a certain text snippet correctly extracted from a document, 2) does the extracted text appear in the right order relatively to other elements and, 3) is the structure of the text preserved. A number of text extraction tools is available fulfilling these three quality requirements to various degrees. However, systematic benchmarks to evaluate those tools are still missing, mainly due to the lack of datasets with accompanying ground truth. The contribution of this paper is two-fold. First we describe a dataset generation method based on model driven engineering principles and use it to synthesize a dataset and its ground truth directly from a model. Second, we define a benchmark for text extraction tools and complete an experiment to calculate performance measures for several tools that cover the three quality requirements. The results demonstrate the benefits of the approach in terms of scalability and effectiveness in generating ground truth for content and structure of text elements.
"Official" electronic version of the publication (accessed through its Digital Object Identifier - DOI)
Created from the Publication Database of the Vienna University of Technology.