• Keine Ergebnisse gefunden

2.5 Geometry of High-Dimensional Small Sample Size Scenarios

2.5.2 Distance Concentration

Another well-known e�ect is that if dimensionality is increased towards in�nity, a�nite set of points takes a speci�c deterministic topology. In the limit, the points are located on the vertices of a regular simplex [H���et al.,����], i.e. all samples have nearly the same distances to the origin as well as among each other, and they are pairwise orthogonal. �is is referred to as distance concentration. Additionally, zero-mean samples taken from a Gaussian distribution are commonly not located near the origin.�ese properties were shown for multivariate standard normal distributions with zero mean and identity covariance matrix but hold under much weaker assumptions as shown in [A��et al.,����]. Here, the authors derive a condition such that a�xed size dataset behaves as if it was drawn from a distribution with identity covariance matrix ford→∞.�is condition is based on the sphericity measure

ε=�∑di=λi d∑di=λi

whereλidenotes theith eigenvalue of the covariance matrix. If the eigenvalues are su�ciently di�used, i.e. if

d→∞lim d·ε= lim

d→∞

�∑di=λi

di=λi →�

then the dataset will show the same unintuitive behaviour as datasets with the identity covariance matrix (see Figure�.�for an example using random normal distributed data with identity covariance matrix).�us, any method that relies on measuring distances between data points may become meaningless. Nearest neighbour based methods have been analysed with respect to such distance concentration with application to high-dimensional databases [A�������

et al.,����a, B����et al.,����]. In such applications, we seek for a given query data point the data point with minimum distance. However, as dimensionality increases the distance to the nearest and to the farthest data point become more and more equal [B����et al.,����] due to distance concentration — even in cases where the dimensions are correlated or the variance of the newly added dimensions converges to zero.�us, nearest neighbour methods may become

�� ��� ���� �����

Figure�.�: Distance concentration in high-dimensional spaces. �e e�ects of distance concentration can be reproduced in a very simple way. Here we sampledn=��

data points normally distributed from low to high-dimensional spaces and plotted the mean (solid) and the extreme values (dashed) for various properties averaged over���runs. �e distances to the origin (top le�) as well as the pairwise distances (top right) concentrate, all pairwise angles (bottom le�) converge to��, and the eigenvalues of the covariance matrix (bottom right) converge to�. �us, distances, angles, and eigenvalues all become the same, although the data was sampled randomly.

meaningless or unstable from��to��dimensions upwards.

Most nearest neighbour methods apply theEuclidean normas the distance measure, however, other metrics are possible and in�uence the meaningfulness in high-dimensional spaces

[A�-������et al.,����a].�eLp-norm

is more susceptible to distance concentration for large values ofp.�us, the best choice with respect to meaningfulness in high-dimensional spaces would bep=�, o�en referred to as the Manhattan metric. Even values between�and�could be used, however, suchfractional distance measuresare no longer a metric in the strict mathematical sense as the triangle inequality is not ful�lled. However, theoretical and empirical results show, that using fractional distance measures improves the performance of nearest neighbour methods signi�cantly at least on uniformly distributed data [A�������et al.,����a]. Distance concentration in fractional distance measures may be quanti�ed in terms ofrelative concentration. Letxbe a random vector with each feature drawn from some distributionF.�en,

RVF,p=

�var���x��p� E���x��p

is a measure of the relative concentration of the norm. Low values indicate a high degree of concentration, high values correspond to a wider distribution of distances.�us, all distributions andLpmetrics are prone to distance concentration [F�������et al.,����] as

d→∞lim

�var���x��p� E���x��p� =�.

However, the impact depends on the distributionF, and the choice ofpneeds to be validated for each dataset individually. In total, nearest neighbour methods are prone to the phenomenon of distance concentration, however, there is some evidence that using theL-norm for measuring distances relaxes this phenomenon to some extend.

2.5.3 Hubness

Distance concentration is closely related tohubness— another high-dimensional artefact that may a�ect machine learning methods. Hubness refers to the e�ect that in high-dimensional spaces some data points occur more frequently among the nearest neighbours than others.

Given a datasetD,Nk(x)refers to the number of timesxis among theknearest neighbours of all other points inD. In low-dimensional scenarios,Nkconverges to a Poisson distribution with meank, while in the high-dimensional case the distribution ofNkbecomes skewed with a long tail to the right [R����������et al.,����].�us some data points —hubs— occur much more frequent in the list of the knearest neighbours than others. Hubs have a high tendency to be close to the mean of the data distribution, in multimodal distributions they appear close the mean of the unimodal distribution components. Hubness may occur even a�er

dimensionality reduction if a distance preserving method is used and the number of features exceeds the intrinsic dimensionality. Bad hubs, i.e. hubs with a high probability not having the same class label as the query point, describe the boundary of the classes and thus have a signi�cant impact on classi�cation performance. However, their contribution depends on the induction algorithm. A k-nearest-neighbour classi�er can signi�cantly be improved if the contribution of these bad hubs is downweighted as the classi�er aims to describe the interior of a class and not its borderline. In contrast, a support vector machine models the separation surface between the classes and, thus, removing bad hubs causes a signi�cant performance drop.