S. Bashir, A. Rauber:
"Improving Retrievability and Recall by Automatic Corpus Partitioning";
Transactions on Large-Scale Data- and Knowledge Centered Systems,
With increasing volumes of data, much effort has been devoted
to finding the most suitable answer to an information need. However,
in many domains, the question whether any specific information
item can be found at all via a reasonable set of queries is essential. This
concept of Retrievability of information has evolved into an important
evaluation measure of IR systems in recall-oriented application domains.
While several studies evaluated retrieval bias in systems, solid validation
of the impact of retrieval bias and the development of methods to
counter low retrievability of certain document types would be desirable.
This paper provides an in-depth study of retrievability characteristics
over queries of different length in a large benchmark corpus, validating
previous studies. It analyzes the possibility of automatically categorizing
documents into low and high retrievable documents based on document
properties rather than complex retrievability analysis. We furthermore
show, that this classification can be used to improve overall retrievability
of documents by treating these classes as separate document corpora,
combining individual retrieval results. Experiments are validated on 1.2
million patents of the TREC Chemical Retrieval Track.
Information Retrieval, Text Classification, Retrievability, Corpus Partitioning, Patent Retrieval
Elektronische Version der Publikation:
Erstellt aus der Publikationsdatenbank der Technischen Universitšt Wien.