Publication Entry

[Back]

Talks and Poster Presentations (with Proceedings-Entry):

B. Krüpl, M. Herzog, W. Gatterbauer:
"Using Visual Cues for Extraction of Tabular Data from Arbitrary HTML Documents";
Poster: The 14^th International World Wide Web Conference (WWW2005), Chiba, Japan; 2005-05-10 - 2005-05-14; in: "Special Interest Tracks and Posters", ACM, (2005), ISBN: 1-59593-051-5; 1000 - 1001.

English abstract:

We describe a method to extract tabular data from web pages. Rather than just analyzing the DOM tree, we also exploit visual cues in the rendered version of the document to extract data from tables which are not explicitly marked with an HTML {\tt table} element. To detect tables, we rely on a variant of the well-known X-Y cut algorithm as used in the OCR community. We implemented the system by directly accessing Mozilla's box model that contains the positional data for all HTML elements of a given web page.

Online library catalogue of the TU Vienna:

http://aleph.ub.tuwien.ac.at/F?base=tuw01&func=find-c&ccl_term=AC05936167

Related Projects:

Project Head Reinhard Pichler:
Know it All and Know it Right: High Quality Mining in the Web

Created from the Publication Database of the Vienna University of Technology.