[Zurück]


Vorträge und Posterpräsentationen (mit Tagungsband-Eintrag):

B. Krüpl, M. Herzog, W. Gatterbauer:
"Using Visual Cues for Extraction of Tabular Data from Arbitrary HTML Documents";
Poster: The 14th International World Wide Web Conference (WWW2005), Chiba, Japan; 10.05.2005 - 14.05.2005; in: "Special Interest Tracks and Posters", ACM, (2005), ISBN: 1-59593-051-5; S. 1000 - 1001.



Kurzfassung englisch:
We describe a method to extract tabular data from web pages. Rather than just analyzing the DOM tree, we also exploit visual cues in the rendered version of the document to extract data from tables which are not explicitly marked with an HTML {\tt table} element. To detect tables, we rely on a variant of the well-known X-Y cut algorithm as used in the OCR community. We implemented the system by directly accessing Mozilla's box model that contains the positional data for all HTML elements of a given web page.


Online-Bibliotheks-Katalog der TU Wien:
http://aleph.ub.tuwien.ac.at/F?base=tuw01&func=find-c&ccl_term=AC05936167



Zugeordnete Projekte:
Projektleitung Reinhard Pichler:
Know it All and Know it Right: High Quality Mining in the Web


Erstellt aus der Publikationsdatenbank der Technischen Universität Wien.