• Keine Ergebnisse gefunden

K-means clustering

N/A
N/A
Protected

Academic year: 2022

Aktie "K-means clustering"

Copied!
4
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

K-means clustering

by Philipp Düren

1 Derivation of the algorithm's essential idea

A random variableX with values inRexhibiting two clusters is modelled by assuming that it has a probability distribution that is a mixture of two Gaussians. For initial purposes we assume the samples from the individual Gaussians to be labelled accordingly, i.e. each sample is in state= 1 or= 2, whereis called thelabel of the sample. Then the generation process ofX is as follows:

1. Choose either the rst or the second Gaussian by using the (given) distribution of the label, P(=i), i= 1;2. For brevity, we writeP(=i) =pi, where p1+p2= 1.

2. Generate a sample according to this Gaussian using the common probability distribution function.

Hence the density of the random variableX is:

fXj(x) = X

k=1 2

pk 1 22 p exp

¡(x¡k)2 22

= X

k=1 2

P(=k)fXj;=k(x)

where = (1; 2; )is the parameter vector and we denote = (1,2).

We would like to get a grip on the probability distribution of the parameters1and2(we regard as xed). For that matter, we assume aprior distributionfand accept the following black box Bayesian formulas for densities:

If Y andZ are continuously distributed according to their densitiesfY and fZ, then

fYjZ=fZjYfY

fZ

If W is discretely distributed with discrete probabilitiesP(W=wi) =iand Y is contin- uously distributed according to its density fY and also the conditional density fYjW is known, then

P(W=wijY =y) =fYjW=wi(y)i

fY(y) Then we can write

P(= 1j;X=x) = fXj;=1(x)p1

fXj(x)

= P(= 1)fXj;=1(x) P

k=1

2 P(=k)fXj;=k

= 1

1 +expx(

2¡1)

2 +12¡22

22 +logp

2

p1

P(= 2j;X=x) = 1 1 +exp

¡x(2¡21)¡122¡222¡logp

2

p1

For brevity we denote

pkjx P(=kj;X=x)

1

(2)

Now we assume that the parameters k are unknown and we wish to infer them from the sample fxngn=1N . We can derive

fjXn=fxngn=1N (1; 2) = fXnj=(1;2)(fxngn=1N )f(1; 2) fXn(fxngn=1N )

and

fXnj=(1;2)(fxngn=1N ) =Y

n=1 N

fXj(xn):

We can reason that the most probable guess foris the maximum of the product in the numerator of the fraction above. If we assume a non-committal prior distribution f(1; 2), we need to maximize the conditional density of Xn given , i.e. the likelihood of . It will be easier to maximize the natural logarithm of this quantity and we denote for brevity

L() log(fXnj=(1;2)(fxngn=1N )):

To nd the maximum of L, we use the Newton method on the gradient ofL. For that we have to nd the gradient and the Hessian of Lrst.

@

@k

L() = X

n=1 N

pkjxnxn¡k

2

@2

@k2L() = ¡X

n=1 N

pkjxn 1 2+X

n=1 N @

@kpkjxnxn¡k

2

¡X

n=1 N

pkjxn 1 2

where we will use the approximation in the last line and the Hessian is thus (approximately) the 22diagonal matrix

H ¡X

n=1 N

pkjxn 1 2Id2

The Newton method thus is

0 = ¡H¡1 X

n=1 N

pkjxnxn¡k

2

!

k=1

= + P

n=1

N pkjxnxn

P

n=1

N pkjxn ¡ P

n=1

N pkjxn P

n=1 N pkjxn

= P

n=1

N pkjxnxn

P

n=1 N pkjxn

Intuitively, this means putting the new cluster centers to the probabilistically weighted center of mass of all data points. The weighing is according to responsibility of a cluster for a data point, i.e. data points that are regarded as unrelated to a cluster will not have much inuence for its new center.

The algorithm consists of two parts: First, we need to calculate the likelihood that the data set is a result of the current guess of the parameters. This means getting the values of all pkjn. We can interpret the pkjn asresponsibilities: Of course, p1jn and p2jny add to 1, so if one of them is near 1, we say that this cluster takeshigh responsibility forxn. This step is also calledassignment as we fuzzily assign clusters (we will in general not have p1jn= 1and p2jn= 0 for most samples, so the responsibility is fuzzy).

Then we need to update our current guess forby the formula above. This step is calledupdate.

2 Section 1

(3)

In praxis, we iterate those two steps until our system does not change anymore.

2 Improvements and Generalizations

Now, we model our data setfxngn=1N , wherexn2Rdas a result of a superposition ofKmultivariate Gaussians Yi

(i);(i)

with mean (i)<Rdand covariance matrix (i)<Rdd.

fXj(x) = X

k=1 K

pk 1 2det¡

(k) q exp

¡1 2¡

x¡(k)>

(k)¡1

¡

x¡(k)

= X

k=1 K

P(=k)fXj;=k(x)

Hence the assignment step consists of calculating pkjxn = P(=kj;X=xn)

=

pk 1

2det¡ (k) q exp¡

¡12¡

x¡(k)>

(k)¡1 ¡

x¡(k)

fXj(x)

After assigning points, the cluster sizes will change: Perhaps cluster 1 lost a lot of samples to cluster 2. This should nd its expression in the weighing of the distributions, i.e. the coecientspk. For that we rst introduce the notation

R(k)=X

n=1 N

pkjxn;

i.e. thetotal responsibilityof clusterk. Norming those numbers, we get a measure of the proportion of data the clusterkclaims for itself:

pk= R(k) P

k=1 K R(k)

Then, using the same arguments as in the simple example, the update step for the cluster centers is

(k)0= P

n=1

N pkjxnxn

R(k)

It is reasonable to adapt the covariance as well. This can be done using a standard covariance estimator for all data points, again weighed by their responsibilities:

(k)0= P

n=1

N pkjxn

xn¡(k)

xn¡(k)>

P

n=1 N pkjxn

3 Conclusion, Caveats and Citations

K-means clustering works well for a reasonable set of problems where data really comes from Gaussian distributions. Of course, a crescent shaped sample will not be modelled appropiately, neither a set of strings of data, which a human can easily make out as being clustered intuitively.

Also, K-means can blow up: Once a (k)is exactly over one data pointxn, the likelihood of that match becomes perfect, yielding a covariance matrix0. This is a typical aw of maximum likelihood methods: Overtting of data is not excluded and can lead to pathological results.

This short overview was shamelessly C&P-ed from [Mac03].

Conclusion, Caveats and Citations 3

(4)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.1 0.2 0.3 0.4 0.5 0.6 0.7

Figure 1. An example where K-means clustering will not work

Bibliography

[Mac03] David J. C. MacKay.Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003.

4 Section

Referenzen

ÄHNLICHE DOKUMENTE

With MD*Crypt, a package is provided to deploy statistical methods and computing power to solve statistical problems. Programming skills provided, MD*Crypt is highly adjustable to

The cluster centers are initialized by combining the sample mean and standard deviation, the optimal cluster centers are searched by the hybridizing particle swarm

A number of studies (e.g. Knight et al. 2012, Rosnick and Weisbrot 2006) have found that shorter work hours are associated with lower greenhouse gas emissions and therefore

Cohomology Constructivism Relativization by internalization Internalizing higher direct images Flabby objects In the effective topos.. How not to

Based on his interpretations of Habermas’ theory of communi- cative action, Brookfield explains how, due to the dialogic tradition in adult education, adult learners and

Looking at the evidence offered in the exploration of Roman prostitution, which includes infamia laws (a legally recognized status of dishonor), the lex Iulia et Papia, and the

constant but also increases somewhat with particle concentration (at a fixed amount of pyrene present in the sample; cf Table 1) This implies that the average environment experienced

in Russian higher education. Obtained information competence of the future econo- mist. As a means of formation of information competence represented funktsonalnaya task. Key