A. Taha:

"Addressing metric challenges: Bias and Selection - Efficient Computation - Hubness Explanation and Estimation";

Supervisor, Reviewer: A. Rauber, J. Benois-Pineau; Software Technology and Interactive Systems, 2015; oral examination: 2015-12-22.

In machine learning, data mining, and information retrieval, a feature space is an important construct, in which the underlying objects are represented as feature vectors, providing the base infrastructure required to build data models, which are the core of information retrieval and machine learning. These models are based on the relations between feature vectors, which are measured by metrics. Metrics, such as distances and similarities, are defined to reflect the relations between the individual feature vectors (e.g. distance between two vectors) or between groups of these vectors (e.g. similarity between classifications or images). However, there are three main problems in this regard:

The first is that metrics measure particular aspects of these relations, which means that different metrics represent different views of reality. This means that different metrics have different sensitivities and biases to particular properties, aspects, and contexts, which imposes the demand for bias and sensitivity measurement that enables defining a formal way for selecting the most suitable metrics depending on the nature of the underlying objects and the subjective user goals.

Regarding the metric bias and sensitivity, we provide in a first step a comprehensive discussion and analysis of 20 metrics for evaluating 3D medical image segmentation. Based on this analysis, we provide guidelines for metric selection based on the properties of the individual metrics, the properties of the segmentations, and the segmentation goal. In a second step, we propose a novel formal method that automatically measures the bias of metrics, based on the properties of the underlying objects and constraints determined by the user preferences. Based on this method, we provide a formal method for selecting evaluation metrics.

The second problem is efficiency of calculating computationally intensive metrics, like the Hausdorff distance, especially when comparing two point sets with huge size. One example of such a case is calculating the Hausdorff distance between medical volumes (3D images). Such volumes could have up to 100 Mio 3D pixels, e.g. whole body medical volumes. Metrics that are defined to calculate distances between all pairs of points become extremely inefficient when they are applied to such volumes.

Concerning the calculation efficiency of computationally intensive metrics, we propose a novel algorithm for calculating the exact Hausdorff distance in linear time. This algorithm is general and does not put any assumption on the underlying objects being compared.

The third problem is related to the curse of dimensionality, caused by the difficulties in relation to high dimensional feature spaces, including sparsity, distance concentration, and hubness. These difficulties can also be seen as a special case of metric bias (asymptotic bias) arising when dimensionality is sufficiently increased. Some of the state-of-the-art methods deal with these difficulties by using feature selection, which aims to reduce the dimensionality of the feature space by restricting the model to a subset of the features.

Regarding high dimensional spaces, we propose a novel explanation of the cause of hubness (a common aspect of the curse of dimensionality) that is based on sparsity and distance concentration. Based on this explanation, we derive a novel estimator of hubness in linear time using statistics of the distance distribution. We also provide a method for hubness reduction, based on this explanation.

http://publik.tuwien.ac.at/files/PubDat_247742.pdf

Created from the Publication Database of the Vienna University of Technology.