Publication Entry

[Back]

Diploma and Master Theses (authored and supervised):

A. Mehlführer:
"Web Scraping: A Toll Evaluation";
Supervisor: C. Huemer; Institut für Softwaretechnik und Interaktive Systeme, 2009.

English abstract:

The WWW grows to one of the most important communication and information medium. Companies can use the Internet for various tasks. One possible application area is the information acquisition. As the Internet provides such a huge amount of information it is necessary to distinguish between relevant and irrelevant data. Web scrapers can be used to gather defined content from the Internet. A lot of scrapers and crawlers are available. Hence we decided to accomplish a case study with a company. We analyze available programs which are applicable in this domain.

The company is the leading provider of online gaming entertainment. They offer sports betting, poker, casino games, soft games and skill games. For the sports betting platform they use data about events (fixtures/results) which are partly supplied by extern feed providers. Accordingly, there is a dependency on providers. Another problem is that smaller sports and minor leagues are not covered by the providers. This approach requires cross checks which are manual checks with websites if provider data differs and the definition of a primary source (source which is used in case of different data from providers) have to be done by data input user. Data about fixtures and results should be delivered by a company owned application.

This application should be a domain-specific web crawler. It gathers information from well defined sources. This means bwin will not depend on other providers and they can cover more leagues. The coverage of the data feed integration tool will increase. Furthermore, it eliminates the cross checks between different sources. The first aim of this master thesis is to compose a functional specification for the usage of robots to gather event data via Internet and integrate the gathered information into the existing systems. Eight selected web scrapers will be evaluated and checked based on a benchmark catalogue which is created according to the functional specification. The catalogue and the selection are conceived to be reused for further projects of the company. The evaluation should result in a recommendation which web scraper fits best the requirements of the given domain.

German abstract:

Das WWW wurde zu einem der wichtigsten Kommunikations- und Informationsmedium. Firmen können das Internet für die verschiedensten Anwendungen einsetzen. Ein mögliches Einsatzgebiet ist die Informationsbescha ffung. Da das Internet aber eine so große Menge an Information anbietet, ist es notwendig zwischen relevant und irrelevant zu unterscheiden. Web Scraper können verwendet werden um nur bestimmte Inhalte aus dem Internet zu sammeln. Es gibt sehr viele verschiedene Scraper daher haben wir uns entschieden ein Case Study mit einem Unternehmen durchzuführen. Wir haben verschiedene Programme analysiert, die in diesem Bereich einsetzbar wären.

Das Unternehmen ist ein führender Anbieter von online Spielunterhaltung. Sie bieten Sportwetten, Poker, Casinospiele, Soft- und Skillgames an. Für die Sportwettenplattform werden unter anderem Ereignissdaten von externen Feedprovidern verwendet. Daraus resultiert eine starke Abhängigkeit von diesen Providern. Ein weiteres Problem ist, dass weniger populäre Sportarten und Unterligen von diesen Providern größtenteils nicht zur Verfügung gestellt werden. Durch Dateninkonsistenzen der einzelnen Feedprovider sind aufwändige, händische Überprüfungen notwendig. Die Daten sollten idealerweise von einem unternehmensinternen Programm bereitgestellt werden, so das die Abhängigkeit gelöst werden kann.

Eine solche Möglichkeit bietet der Einsatz eines bereichsspezi schen Web Scraper. Dieser durchsucht das Internet automatisiert nach bestimmten Informationen. Für das Unternehmen würde das also bedeuten, dass sie eine größere Abdeckung der Ligen anbieten könnten und nicht mehr von anderen Providern abhängig sind. Das Datenteam soll sich so die Zeit für die manuellen Überprüfungen einsparen. Das erste Ziel der Diplomarbeit ist es, eine funktionelle Spezi kation zu erstellen für die Verwendung eines Web Crawlers. Acht Web Crawler werden dann ausgewählt und weiter bewertet. Die Bewertung erfolgt
anhand des auf die funktionale Spezi kation aufbauenden Benchmark Katalogs. Der Katalog ist so konzipiert, dass er auch für weitere Projekte im Unternehmen verwendet werden kann. Die Evaluierung liefert als Ergebnis eine Empfehlung, welcher Web Crawler sich am besten für den Einsatz in diesem Bereich eignet.

Created from the Publication Database of the Vienna University of Technology.