K-means clustering

(1)

K-means clustering

by Philipp Düren

1 Derivation of the algorithm's essential idea

A random variableX with values inRexhibiting two clusters is modelled by assuming that it has a probability distribution that is a mixture of two Gaussians. For initial purposes we assume the samples from the individual Gaussians to be labelled accordingly, i.e. each sample is in state= 1 or= 2, whereis called thelabel of the sample. Then the generation process ofX is as follows:

1. Choose either the rst or the second Gaussian by using the (given) distribution of the label, P(=i), i= 1;2. For brevity, we writeP(=i) =pi, where p1+p2= 1.

2. Generate a sample according to this Gaussian using the common probability distribution function.

Hence the density of the random variableX is:

fXj(x) = X

k=1 2

pk 1 2² p exp

¡(x¡k)² 2²

= X

k=1 2

P(=k)fXj;=k(x)

where = (1; 2; )is the parameter vector and we denote = (1,2).

We would like to get a grip on the probability distribution of the parameters1and2(we regard as xed). For that matter, we assume aprior distributionfand accept the following black box Bayesian formulas for densities:

If Y andZ are continuously distributed according to their densitiesfY and fZ, then

fYjZ=fZjYfY

fZ

If W is discretely distributed with discrete probabilitiesP(W=wi) =iand Y is continuously distributed according to its density fY and also the conditional density fYjW is known, then

P(W=wijY =y) =fYjW=w_i(y)i

fY(y) Then we can write

P(= 1j;X=x) = fXj;=1(x)p1

fXj(x)

= P(= 1)fXj;=1(x) P

k=1

2 P(=k)fXj;=k

= 1

1 +exp_x(

2¡₁)

² +¹²^¡²²

2² +log_p

2

p₁

P(= 2j;X=x) = 1 1 +exp

¡^x(²^¡2¹⁾¡¹²₂^¡2²²¡log_p

2

p₁

For brevity we denote

p_kjx P(=kj;X=x)

1

(2)

Now we assume that the parameters k are unknown and we wish to infer them from the sample fxngn=1N . We can derive

f_jXn=fxngn=1N (1; 2) = fXⁿj=(₁;₂)(fxngⁿ⁼¹^N )f(1; 2) fXⁿ(fxngn=1N )

and

fXⁿj=(₁;₂)(fxngⁿ⁼¹^N ) =Y

n=1 N

fXj(xn):

We can reason that the most probable guess foris the maximum of the product in the numerator of the fraction above. If we assume a non-committal prior distribution f(1; 2), we need to maximize the conditional density of Xⁿ given , i.e. the likelihood of . It will be easier to maximize the natural logarithm of this quantity and we denote for brevity

L() log(fXⁿj=(₁;₂)(fxngⁿ⁼¹^N )):

To nd the maximum of L, we use the Newton method on the gradient ofL. For that we have to nd the gradient and the Hessian of Lrst.

@

@k

L() = X

n=1 N

pkjx_nxn¡k

²

@²

@k2L() = ¡X

n=1 N

pkjx_n 1 ²+X

n=1 N @

@kpkjx_nxn¡k

²

¡X

n=1 N

pkjx_n 1 ²

where we will use the approximation in the last line and the Hessian is thus (approximately) the 22diagonal matrix

H ¡X

n=1 N

pkjx_n 1 ²Id2

The Newton method thus is

⁰ = ¡H^¡1 X

n=1 N

p_kjx_nxn¡k

²

!

k=1

= + P

n=1

N pkjx_nxn

P

n=1

N p_kjx_n ¡ P

n=1

N pkjx_n P

n=1 N p_kjx_n

= P

n=1

N pkjx_nxn

P

n=1 N pkjx_n

Intuitively, this means putting the new cluster centers to the probabilistically weighted center of mass of all data points. The weighing is according to responsibility of a cluster for a data point, i.e. data points that are regarded as unrelated to a cluster will not have much inuence for its new center.

The algorithm consists of two parts: First, we need to calculate the likelihood that the data set is a result of the current guess of the parameters. This means getting the values of all pkjn. We can interpret the p_kjn asresponsibilities: Of course, p_1jn and p_2jny add to 1, so if one of them is near 1, we say that this cluster takeshigh responsibility forxn. This step is also calledassignment as we fuzzily assign clusters (we will in general not have p1jn= 1and p2jn= 0 for most samples, so the responsibility is fuzzy).

Then we need to update our current guess forby the formula above. This step is calledupdate.

2 Section 1

(3)

In praxis, we iterate those two steps until our system does not change anymore.

2 Improvements and Generalizations

Now, we model our data setfxngn=1N , wherexn2R^das a result of a superposition ofKmultivariate Gaussians YiN¡

⁽ⁱ⁾;⁽ⁱ⁾

with mean ⁽ⁱ⁾<R^dand covariance matrix ⁽ⁱ⁾<R^d^d.

fXj(x) = X

k=1 K

pk 1 2det¡

^(k) q exp

¡1 2¡

x¡^(k)_>

^(k)_¡1

¡

x¡^(k)

= X

k=1 K

P(=k)f_Xj;=k(x)

Hence the assignment step consists of calculating p_kjx_n = P(=kj;X=xn)

=

pk ¹

2det¡ ^(k) q exp¡

¡¹₂¡

x¡^(k)_>

^(k)_¡₁ ¡

x¡^(k)

fXj(x)

After assigning points, the cluster sizes will change: Perhaps cluster 1 lost a lot of samples to cluster 2. This should nd its expression in the weighing of the distributions, i.e. the coecientspk. For that we rst introduce the notation

R^(k)=X

n=1 N

p_kjx_n;

i.e. thetotal responsibilityof clusterk. Norming those numbers, we get a measure of the proportion of data the clusterkclaims for itself:

pk= R^(k) P

k=1 K R^(k)

Then, using the same arguments as in the simple example, the update step for the cluster centers is

^(k)0= P

n=1

N p_kjx_nxn

R^(k)

It is reasonable to adapt the covariance as well. This can be done using a standard covariance estimator for all data points, again weighed by their responsibilities:

^(k)⁰= P

n=1

N pkjx_n

xn¡^(k)

xn¡^(k)_>

P

n=1 N pkjx_n

3 Conclusion, Caveats and Citations

K-means clustering works well for a reasonable set of problems where data really comes from Gaussian distributions. Of course, a crescent shaped sample will not be modelled appropiately, neither a set of strings of data, which a human can easily make out as being clustered intuitively.

Also, K-means can blow up: Once a ^(k)is exactly over one data pointxn, the likelihood of that match becomes perfect, yielding a covariance matrix0. This is a typical aw of maximum likelihood methods: Overtting of data is not excluded and can lead to pathological results.

This short overview was shamelessly C&P-ed from [Mac03].

Conclusion, Caveats and Citations 3

(4)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.1 0.2 0.3 0.4 0.5 0.6 0.7

Figure 1. An example where K-means clustering will not work

Bibliography

[Mac03] David J. C. MacKay.Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003.

4 Section