• Keine Ergebnisse gefunden

Stream Clustering

Im Dokument Unsupervised learning on social data (Seite 40-44)

Mining data streams generally has gained lots of attraction in recent years as more and more data is produced continuously by a plethora of different devices or mechanisms, e.g., mobile phones, sensors, or user behavior mon-itoring mechanisms. As this thesis partially focuses on oriented subspace clustering algorithms for high-throughput data, we briefly review established methods on how to deal with potentially unbounded sequences of data, i.e., data streams, in general. Then, related work in the field of data stream clus-tering, especially subspace clustering on data streams, is reviewed in detail.

In the broad majority of presented techniques, data stream clustering consists of an online and a separate offline step. While the online step, also called data abstraction step, tackles the main challenge that stream algo-rithms typically face, i.e., handling the data in a single scan properly, the offline phase usually aims at partitioning the data. Precisely, the online step processes the data stream by aggregating the yet seen data into summary structures such that the key statistics needed for the offline step are kept.

One important aspect here is that more recent data is usually more impor-tant for the knowledge discovery process since obviously, in most applications, e.g., machine monitoring or market/trend analysis within data provided by social media platforms, stale data quickly looses relevance for the task at hand. This particularly holds in applications where the underlying data generating processes reveal non-stationary distributions and hence exhibit gradually appearing concept drifts, respectively suddenly appearing concept shifts. Note that the mechanism of “forgetting” stale data is calledaging. To make the algorithmic procedure concentrating on up-to-date data, several distinct window approaches have been proposed. Given a data stream, the landmark window model picks an object of the stream as landmark element either depending on time or number of objects seen since the previous land-mark element. The landland-mark elements are basically used to partition the

data stream into disjoints data chunks. Typically, the landmark approach proceeds as follows: once a new landmark is reached, all objects that are kept in the current window are aggregated within some summary structure and subsequently can be discarded. Then, new incoming objects are stored until the next landmark appears. If this happens, again, relevant statistics of the current bunch are aggregated and the raw data objects are discarded, and so on. The next commonly used approach is the sliding window model.

The idea is to load a certain amount of data objects into a data queue that employs the first in first out concept. Once the queue is filled, the summary structure keeping the relevant statistics from the data chunk is computed and kept in memory. As new objects arrive and added to the queue, the firstly inserted (and oldest) objects are removed, and the summary structure is updated accordingly. Another well-known technique is the damped win-dow model, where each data object is associated with a weighting factor that degrades over time. This leads to the desired effect that the older objects become, the lower their weights become, which finally corresponds to the idea that more recent objects are considered more important. A widely used weighting function is the exponential decay

f(t) = 2−λt,

with t denoting the current time stamp and λ > 0 being the decay rate. It holds that the lower the decay rate λ, the more importance is given to stale data, and vice versa.

For summarizing relevant statistics of data provided by streams, a com-mon technique used in many of the related works is the so-called(clustering) feature vector (CF) data structure, originally presented in connection with the BIRCH algorithm [278]. The key characteristic of this data structure (and similar abstractions that follow the same objective) are its incremental-ity and additivincremental-ity. Incrementalincremental-ity is the property that a feature vector can be updated by inserting new data objects; additivity means that two disjoint feature vectors can be merged into a new feature vector by summing up the single components. In case of BIRCH, a CF vector is composed of the num-ber of data objects, the linear sum of the data objects and the squared sum of data objects. Importantly, these components fulfill the required properties incrementality and additivity and additionally allow the computation of the centroid, the radius and the diameter of the corresponding cluster. Those measures in turn can be used in the offline step to define the final clustering model. TheScalable k-means algorithm [53] makes use of the idea of CF vec-tors by using those aggregation structures to compress the statistics of data objects that are unlikely to change their cluster membership. This way, they enable an efficient way to compute k-means clustering on very large data

sets. This idea is carried on inSingle-pass k-means [93] by compressing fully disjoint data chunks, and keeping only relevant k-means statistics in mem-ory. CluStream[14] extends the originally proposed form of clustering feature vectors by incorporating temporal information. The resulting data structure is called microcluster and the proposed algorithm employs a k-means based clustering on the microclusters. Another method that uses a slightly ma-nipulated variant of the original structure of CFs can be found in [280]. In [60], the authors propose a density-based counterpart to CluStream which is called DenStream. By following the density-based clustering paradigm, they manage to get rid of the parameter k that requires to pre-define the number of microclusters by the user. Two similar approaches that mainly focus on the slightly related task of anytime clustering, i.e., stream clustering approaches that are capable of providing (possibly less accurate) results to the user whenever clustering results are requested, have been presented in [145, 114].

A somewhat less sophisticated, but in many applications sufficient, com-pression technique that is widely employed for stream clustering in general is to only keep track of the cluster representatives. The basic idea is to represent entire chunks of data solely in form of cluster representatives, e.g., cluster centroids. The STREAM algorithm [104] loads a user specified amount of data objects into memory and represents each chunk by 2k representatives employing a k-medoid like approach on the data. After having a certain number of representatives, these are further clustered into2k representatives and the algorithm proceeds processing newly incoming data arriving from the stream. Likewise, the streaming variant of the LocalSearch algorithm [198] uses such a divide and conquer processing scheme in combination with k-medoids fully in-memory.

Further techniques to deal with the challenge of summarizing data streams are the coreset tree structure as proposed in [8] and the family of methods that use a grid-based summary strategy, e.g., as used in [61, 66]. However, the stream clustering algorithms presented in this thesis use variants of the former two methods and thus we refer to the survey in [235] for an overview of previously presented methods using the latter two summary techniques.

Subspace Clustering on Data Streams

As we focus on subspace clustering on data streams in this thesis, the follow-ing provides a somewhat more detailed review of related works tacklfollow-ing the problem of clustering high-dimensional data streams.

The first work able to cluster high-dimensional data streams properly was proposed in [15]. TheHPStream algorithm is a projected subspace clustering

method. Precisely, it is ak-means based approach that uses an adopted form of CF vectors to represent relevant cluster statistics. This data structure not only fulfills the additivity and incrementality properties, but also has the property of temporal multiplicity since the CF vectors, respectively fading cluster structures, underly an exponential decay over time. The key idea behind the stream clustering scheme is that each fading cluster structure is associated with a binary vector containing1entries for preferred dimensions, and0otherwise. If a new data object arrives from the data stream,HPStream uses these binary vectors to calculate the projected distance to the closest cluster structure for assigning the data object. To make the assignment more robust, the projected clustering is iteratively refined by refining the preferred dimensions for each cluster at least temporally for incoming data objects. Data objects are only absorbed if they lie within a specific radius of the chosen cluster. It is also noteworthy that the entire stream clustering process requires an initialization step to create an initial set of clusters.

TheIncPreDeConalgorithm [147] is an incremental version of the density-based projected clustering algorithm PreDeCon. Although IncPreDeCon is designed to handle dynamic data in the sense that the projected clustering can be adopted incrementally, it does not support any form of aging and hence cannot deal with streaming data directly. Nonetheless, this algorithm already presents a solution to the incrementality property and is therefore a preliminary solution towards the density-based projected stream clustering algorithms PreDeConStream [115] and HDDStream [197]. Both PreDeCon-Stream andHDDStream have been developed simultaneously and rely on the basic ideas of the PreDeCon, respectively IncPreDeCon, algorithm. During the online phase, both methods aggregate incoming data objects within dif-ferent microcluster structures, i.e., core, potential and outlier microcluster, and retrieve the final clustering by following (slightly different) variants of the density-based clustering scheme proposed in [41] during the offline phase.

The SiblingTree method presented [202] is a grid-based subspace clus-tering approach that aims at detecting all low-dimensional clusters in all subspaces. The idea is to start monitoring the data distributions in one-dimensional subspaces, maintaining the grid cells in a so-called sibling list, and splitting cells once they reach a certain density. Splitting the grid cells corresponds to considering higher dimensional subspaces, and by doing this iteratively, the algorithm actually builds up a tree like structure in order to support an efficient maintenance during the online phase.

A vaguely related work that mainly focuses on the task of feature selection in high-dimensional data streams can be found in [123]. By maintaining low-rank approximations of the observed data coming from the data stream the presented approach identifies the most interesting features. This can

be interpreted as a global approach to projected subspace clustering, but however neither detects locally dense subspace clusters nor is able to retrieve precise information about different subsets of dimensions in which subspace clusters may exist.

Im Dokument Unsupervised learning on social data (Seite 40-44)