• Keine Ergebnisse gefunden

To summarize: based on our findings we want to state that, of course, the choice of the parameters affects the efficiency of the online phase. An inappropriate selection of parameters might lead to increased runtimes. For instance if the parameter is chosen wrongly, the number of generated mi-croclusters may increase which in turn leads to a higher number of necessary distance computation when assigning incoming data objects to existing mi-croclusters. Note that the increased computational costs do not necessarily lead to a significany improvement for the overall clustering quality. On the other hand, it might happen that the combination of the and buff_size parameters are chosen in such a way that microclusters do not initialize.

However, this strongly depends on the data distribution that is given by the underlying data generating process.

Detecting Linear Correlated Clusters on Streams using Parameter Space Transform

The work presented in this chapter is going to appear in the Proceedings of the 24th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2020.

5.1 Introduction

Data clustering is an established and widely used technique for explorative data analysis and in general a decent approach for tackling many unsuper-vised problems. However, when facing high-dimensional data, particularly clustering algorithms quickly reach the boundaries of their usefulness as most of these methods are not designed to deal with the curse of dimensionality.

Due to inherent sparsity in high-dimensional data, distances between objects tend to become meaningless since the distances between any two objects mea-sured in the full dimensional space tend to become the same for all pairs of objects. This is a serious problem for most clustering algorithms since they mostly rely on distance calculations to distinguish similar from dissimilar objects. Furthermore, clusters often appear within lower dimensional sub-spaces with the subsub-spaces including various dimensions and dimensionalities.

Therefore, it may not be useful to search for clusters in the full dimensional data space since even if clusters are detected they hardly retrieve any in-sights for the user due to lack of interpretability. To overcome those issues, several subspace clustering algorithms have been developed in the past. All subspace clustering algorithms generally have the objective to (1) identify

meaningful subspaces and (2) detect clusters within these subspaces. They can be categorized into two groups: projected clustering and oriented sub-space clustering algorithms (cf. Section 3.1). Projected clustering algorithms restrict themselves to the detection of axis-parallel subspace clusters. Ori-ented subspace clustering algorithms allow the combination of features (i.e., the original dimensions of the data space) to identify new (and interesting) dimensions which may form a lower dimensional subspace in which clus-ters can be identified easily by applying conventional clustering approaches1. While clusters found in projected subspaces are generally easier to interpret for the end-user, oriented subspace clustering provides a better clustering in many applications since the assumption that features are independent from each other (as inherently assumed by projected clustering approaches) usu-ally does not hold.

In this chapter, we present a novel oriented subspace clustering algorithm that is able to detect arbitrarily oriented subspace clusters in data streams.

As discussed in the previous chapter, data streams implicate the challenge that the data cannot be stored entirely and hence there is a general demand for suitable data handling strategies for clustering algorithms such that the data can be processed within a single scan. In contrast to theCorrStream algorithm from Chapter 4, the method presented here relies on the Hough transform and finds global, arbitrarily oriented subspace clusters rather than local correlation clusters derived from neighborhood sets. Renouncing the usage of neighborhood sets has the big advantage of being less dependent from outliers, resp. noise data, as they might appear in vast numbers within neighborhoods sets, especially when considering high-dimensional data. In general, the method presented here can be understood as a streaming variant of the CASH algorithm [3] which has been designed for robust correlation clustering in static data. However, when looking for relevant subspaces, CASH performs a top-down, grid-based data space division strategy with the idea to prune sparse grid cells. This is inappropriate when considering a data stream where the data distribution may change over time. On the other hand, dense grid cells may become sparse and thus irrelevant over time, too. Therefore, we propose a batched variant that is able to deal with those challenges. The key idea is to load chunks of data into memory, deriving so-called concepts as summary structures and applying a decay mechanism to downgrade the relevance of stale data. Our experimental evaluation demon-strates the usefulness of the presented method and shows that the used heap space is drastically reduced without losses in terms of runtime and accuracy.

It is noteworthy that the complexity discussion even states that the runtime

1These are typically partitioning or density-based clustering techniques.

can be reduced for sufficiently large datasets.

The remainder of the chapter is as follows. In Section 5.2 we recapitu-late the basic principles behind correlation clustering using Hough transform since our proposed method heavily relies on the techniques used there for identifying correlation clusters. Then we propose our algorithm that is able to follow this paradigm in a streaming environment in Section 5.3. The ex-perimental evaluation is shown in Section 5.4, and Section 5.5 finally gives a discussion on our findings and concludes the chapter.

5.2 Correlation Clustering Using Parameter Space Transformation

Hough Transform

The Hough Transformation originally has been introduced as a useful tool for image processing in [218]. The basic idea is to translate every pixel from image space into a straight line in parameter space, and every intersection of two lines in parameter space means that these two points are on a straight line in image space. This way, one can determine straight lines and, more broadly, linear segments in the data space simply by detecting areas where many lines intersect in parameter space, which is an important task in image processing.

More formally, for a data pointp= (x, y)in a two-dimensional data spaceD, one can define a parameter space, once again two dimensional, P with axes m and t such that that p is represented by the straight line t = −xm+y, with xbeing the negative slope andy denoting the axis intercept. This way, if there is a point S = (sx, sy) ∈ P where several parameter space lines li

intersect, there is a common line y = sxx+sy in D that goes through all of the inverse image points of the lines li. The idea is visualized in Figure 5.1. Given the three data points A, B and C in data space (left), we can transform them into linear functions in parameter space (right). Since A, B and C are perfectly correlated, the lines in parameter space intersect at one point in parameter space. Reconstructing this point in the data space retrieves a linear function on which all data pointsA,B andClie (grey line).

However, using straight lines in parameter space has a significant draw-back, namely that the slopes of the lines are unbounded, i.e., they may be infinite, and therefore the possible intersection points of lines in the param-eter space can not be controlled. A solution to this has been introduced in [83], where polar coordinates are being used in the parameter space. In fact, using angles and radii for parameters, and trigonometrical functions

in-(a) Cartesian space (b) Parameter space

Figure 5.1: Left: data space, right: parameter space

stead of straight lines effectively avoids the problem stated above. For a data point p = (x, y) in a two-dimensional data space the point is mapped into parameter space by the trigonometrical function δ =xcosα+ysinα, with axesα andδ. Note that the free parameter αis bounded within the interval [0, π). These functions will be called sinusoid in the following. Analogously to the originally proposed parameter space, it holds that if such sinusoids intersect in a point S = (αs, δs) in parameter space P, the inverse image points lie on a straight linelS in the image space. Again, the corresponding line in image space can easily be derived from S since δs is the distance of the line to the origin, αs corresponds to the angle between the x axis of D and the perpendicular line of lS that intersects the origin. Similar to above, Figure 5.2 depicts the procedure when using the parameter space spanned by the parameters of the polar coordinate representation. Considering the right image, the functions are sinusoids rather than linear functions this time.

Using Hough for CASH

The sinusoidal parameterization function can be extended for the case where we need to deal with higher dimensional data. Given a d-dimensional data space D ⊆ Rd and a point p= (p1, ..., pd)T ∈ D, the parameterization func-tion fp : [0, π)d−1 →R is defined as

fp1, ..., αd−1) =

d

X

i=1

pi·

i−1

Y

j=1

sin(αj)

!

·cos(αi),

with αd = 0. This corresponds directly to the generalized polar coordinates of a vector [3]. The basis of the resulting parameter space is defined by

(a) Cartesian space (b) Parameter space

Figure 5.2: Left: data space, right: parameter space

the d−1 angles α1, ..., αd−1 and δ = fp1, ..., αd−1). Precisely, this means that any point p from data space can be mapped to some function in the d -dimensional parameter space with each point finally being represented by the angles of the normal vectors defining the hyperplanes in Hessian normal form, i.e., α1, ..., αd−1, and their distances to the origin, i.e., δ. In particular, this also means that a point S = (α1, ..., αd−1, δ)in parameter space stands for a d−1dimensional hyperplane in the data space, while a function in parameter space corresponds to all possible d−1dimensional hyperplanes that contain the data point p. The following important conclusion can be made: if the parametrization functions of two data points intersect in parameter space, the intersection point represents a hyperplane in data space containing both points. The same is true for any amount of points.

Taking this into account, intuition already suggests the idea ofCASH, i.e, search for dense areas in the parameter space in order to find data points with common hyperplanes. A dense area is an area where many parametrization functions intersect each other, or to relax this a little, small partitions of parameter space which are intersected by many parameterization functions2. Recalling that CASH employs a top-down grid-based space partitioning strategy, the density criterion also introduces the first input parameter of the algorithm: minPoints orm. It specifies how many intersections are required doe a partition of the parameter space to be considered dense. The other user-specified input parameter is the size of the partitions of the parame-ter space, i.e., maxSplits or s. Since it is nearly impossible to calculate all

2Note that intersection points correspond to perfectly correlated subspace clusters.

However, by relaxing this restriction to small areas in the parameter space, we allow the algorithm to identify correlation clusters that are not perfectly correlated, too.

possible intersection points of all parametrization functions, it is necessary to discretize the parameter space by a grid. The resulting grid cells, also called cuboids, act as the small partitions of parameter space. Obviously, it is necessary to know the range of all axes for such a grid-based strategy.

For theα angles, the bounds are 0and π. For the δ-axis, the minimum and maximum is equal to the minimum and the maximum over all extrema of all parametrization functions. Since every parametrization function fp is a sinusoid with period of 2π, there is a global extremum α˜ = ( ˜α1, ...,α˜d−1) in [0, π)d−1, which can be calculated by using the Hessian matrix of fp. De-pending on whether this is a maximum or a minimum, the opposite extreme value of the sinusoid on the domain can be calculated (see [3] for details). Fi-nally, given the global minimum and maximum for everyfp, the boundaries [dmin, dmax)of the δ-axis are defined by

dmax = max

p∈D

max

α∈[0,π)˜ d−1fp( ˜α)

,

and dmin = min

p∈D

min

α∈[0,π)˜ d−1fp( ˜α)

,

and we can consider the domain of parameter space as[0, π)d−1×[dmin, dmax]. If this domain is discretized into a grid of cuboids, the next step is to determine whether a sinusoid fp intersects a cuboid C. This can be done by calculating the minimumfpmin(C)and maximumfpmax(C)of the sinusoid within the boundaries of the cuboid, and checking if the resulting interval overlaps the δ-interval of the cuboid [dCmin, dCmax]. If both conditions are met, the corresponding parametrization function intersects the cuboid C. However, this procedure may become computationally expensive, especially in higher dimensional data where grid-based approaches generally tend to be inefficient in terms of computational cost. In fact, searching for dense grid cells in a d-dimensional data space exhaustively would require (2n)d, with n being the amount of splits per dimension, grid cells to be examined.

Therefore,CASH uses a division strategy that prunes grid cells to decrease the search space and hence reducing the complexity.

Division strategy

Starting with the full dimensional cuboidC ∈[0, π)d−1×[dCmin, dCmax]⊆Rd, and the pre-defined divison orderδ, α1, ..., αd−1, the algorithm first splits the cuboid into two halves along theδ-dimension. For the resulting two cuboids, the number of intersecting sinusoids is calculated. If the intersection count of one of the cuboids is higher than the input parameter m, the cuboid is

divided recursively by the division order. If the count of intersections is above m for both cuboids, the second cuboid is pushed into a queue. If the number of intersections is less or equal thanm, the cuboid can be discarded.

Once a cuboid reaches the maximum split threshold s, the data objects of the corresponding sinusoids are considered to form a subspace cluster and the sinusoids are no further considered for the recursive search within the other cuboids. If the division process of a cuboid terminates (either because the maximum split threshold is reached or the cuboid is discarded due to sparsity) the next cuboid is taken from the queue and split recursively.

Finding lower dimensional clusters and hierarchies of clus-ters

Having found a cluster, respectively a dense cuboid C ⊆ Rd after s divi-sions, means that the corresponding points form a cluster within a (d−1) -dimensional subspace. However, the cluster might either be lower--dimensional or another lower-dimensional cluster can be embedded within the found sub-space cluster. Therefore, the sinusoids that form the (d− 1)-dimensional cluster are transformed back into the data space and projected onto the or-thonormal basis that can be derived from cuboidC. Somewhat more precise, given the boundary intervals ofC, the normal vector of the corresponding hy-perplane in polar coordinates is defined by the means of the cuboid’s bound-ary intervals. This normal vector is transformed back into the Cartesian data space and finally a(d−1)-dimensional orthonormal basis, which defines the subspace hyperplane3, is derived from it. To detect subspace clusters of even lower dimensions, the CASH algorithm is performed on the resulting (d−1)-dimensional data set recursively until no more cluster can be found.

It is worth to note that this procedure creates an implicit dimensional cluster hierarchy.

Im Dokument Unsupervised learning on social data (Seite 65-72)