• Keine Ergebnisse gefunden

Finding clusters in parallel universes

N/A
N/A
Protected

Academic year: 2022

Aktie "Finding clusters in parallel universes"

Copied!
6
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

FINDING CLUSTERS IN PARALLEL UNIVERSES

DAVID PATTERSON and MICHAEL R. BERTHOLD Tripos Inc., Data Analysis Institute

601 Gateway Blvd., Suite 720, South San l<rancisco, CA 94080, USA eMail:

{pat.berthold}@tripos.com

Abstract

Many clustering algorithms have been proposed in re- cent years. Most methods operate in an iterative man- ner and aim to optimize a specific energy function. We pres~nt an algorithm that directly find., a set ofC1l1S- ter centers based on an analysis of the distribution of patterns in the local neighborhood of each potential duster center through the use of so-called Neighbor- grams. In addition this analysis can be carried out in several feature spaces in parallel, effectively finding the optimal set of features for each duster indepen- dently. We denlOnstrate how the algorithm works on an artificial data set and show its usefulness using a well-known benchmark data set.

Keywords

Clustering, multiple feature spaces, neighborgrams.

1 Introduction

Clustering has been one of the most used methods for the analysis of large data sets over the past decade[1].

Most algorithms attempt to find an optimal set of dus- ter centers byadjusting their position and size itera- tively, through subsequent small adjt.lstments of the parameters. Many variants of such algorithms exist, the most prominent example is probably Kohonen's Linear Vector Quantization [2]. Many variants using imprecise notions of cluster membership have also been proposed [3].

Adjusting the duster parameters iteratively (usu- ally by means of a gradient descent procedure) makes much sense in the case of vast amounts of data for positive and negative examples. Approaches that try to find duster centers more directly have also been proposed; many examples can be found in algorithms that train local basis function networks, such as a con- structive training algorithm for Probabilistic Neural Networks [4]. This algorithm iterates over the training

123

instances and whenever it needs a new cluster to model a newly encountered pattern, a new duster center is introduced at its position. Such an approach depends on the order of training examples and is therefore not guaranteed to find an optimal set of cluster centers either.

However, if the data set is highly skewed and the focus of the analysis aims to extract a model that ex- plains the minority dass with respect to all other ex- amples, a more direct approach can be taken. Such data is available in many applications, a promient ex- ampleisdrug discovery research where the focus lies on the identification of few promising active compounds in a vast collection of available chemic!J,1 structures that mostly show no activity or are otherwise regarded as useless.

In this paper we describe an algorithm that finds an optimal set of cluster centers directly by computing so- calledNeighborgramsfor each positive example. These neighborgrams summarize the' distribution of positive and negative examples in the vicinity of each positive example. We can then extract the set of best clus- ter centers from the set of neighborgrams, according to some optimality criterion. The algorithm relies on the fact that the class of interest (the positive exam- pies) has a reasonably small size (usually up to several thousand examples), whereas the opposing examples can be as numerous as desired.

In addition we describe an extension of this algo- rithm to find clusters in parallel universes. This be- comes enormously useful in cases where different ways to describe entities exist. In drug discovery, for exam- pie, various ways to describe moleeules are used and it is often unclear, which of these descriptors are op- timal for any given task. It is therefore desirable if the clustering algorithm does not require that a cer- tain descriptor is chosen a-priori. Clustering in par- allel universes solves this problem by finding cluster centers in these different descriptors spaces, in effect parallelizing feature (space) selection with clustering.

First publ. in: E-systems and e-man for cybernetics in cyberspace (Vol. 1) : 2001 IEEE International Conference on Systems, Man, and Cybernetics, Tucson, Arizona, October 7 - 10, 2001. Piscataway, NJ : IEEE Service Center, 2001, pp. 123-128

Konstanzer Online-Publikations-System (KOPS) URL: http://www.ub.uni-konstanz.de/kops/volltexte/2008/6646/

URN: http://nbn-resolving.de/urn:nbn:de:bsz:352-opus-66464

(2)
(3)
(4)
(5)
(6)

Referenzen

ÄHNLICHE DOKUMENTE

The main aim of this study is to analyze the underlying genetic network of the early expression of the segment polarity genes Tc-wingless and Tc-hedgehog at the ocular

Crowdsourcing and Mobile Technology to Support Flood Disaster Risk Reduction.. Linda See, Ian McCallum, Wei Liu, Reinhard Mechler, Adriana Keating, Stefan Hochrainer- Stigler,

Furthermore, to further characterize the potential of the PMR for mechanism prediction, we performed PMR analy- sis of 767 neuroactive compounds covering 14 different receptor

The algorithm computes an approximation of the Gaussian cumulative distribution function as defined in Equation (1). The values were calculated with the code taken

The outcome of the algorithm are clusters distributed over different parallel universes, each modeling a particular, potentially overlapping, subset of the data and a set of

1 shows the values for number of households (brown), median income (violett) and household languages (green) on state level (left and right) and drilled down for the state New

Second, some sequence alignment tools align multiple target sequences to certain candidate positions of the query sequences.. The two inter-sequence alignment layouts both

The article “Parallel optimization of the ray-tracing algorithm based on the HPM model”, written by Wang Jun-Feng, Ding Gang-Yi, Wang Yi-Ou, Li Yu-Gang, and Zhang Fu-Quan,