On the Nontermination of the K-Means Clustering Algorithm for Certain Data Sets

(1)

EDV in Medizin und Biologie 12 (4), 124-126, ISSN 0300-8282

@

Eugen UlmerGmbH & Co., Stuttgart i Gustav Fisch€rVerlag KG,

Stuitgart

On

The

Nontermination

ol the

K-Means

Clustering Algorithm

lor

Certain

Data

Sets

Summary

There are sone well accepted standads

_fot

the

_Iomulation

of

algoithns and

computet

ptugtams

_/BauER

and

WössNER,

t98I): (a)effcie

cy, (b) _{fuiteness of}deßoiption and (c)

fnite-ess of

ope

tion.

It

Mt

be demonstated that the populat

K-Meaß

cl

stetins

alsonüm

loops infnitely

_fot

sone data seß i.f the .omputet

yosqm

ß

only a

_fomula

,anslation of

its

na-thenanül

conecr

_form.

Therc is a

@ntqdictio

to

standad

(c) due

to

the

_linite

pecision of

@mputel

aithmen

. B6üIes

its theorctical aspecß the anicle has a

pactical

putpose.

It

thoh,s bjays

of coftatins

some publßhed

FORTMN-ptogans

so that

temination ß

achie|ed

_fot

all

data seß.

Zusammentassung

I

det

I

_fonnank

Cibt es einise allaemein akzeptie

e

Sta

-da

s

_rt)t

die

Fomulimns

wn

Alsotithmen

Md

Conputerytu-s/ann"u

fBAUER und wössNER, 1981): (a)

ElJüe

z,

(b)E

d-lichkeit det Reschrcibuttg

u

d

(c)

E

dlichkeit bei det Ausfiih-rung. Es kann gezeigt

we

en, da]l det beliebte K"Means

Cht-stet Algorithnus

_Jiit

bestimtkte Dateniötze

in

eine

Endlos-schleife

_fiillt,

wem

d6

Ptosnmn

nw

aß

einet

Fomeiübe

ra-gung clet

korckten

mathematische

Fom

hen'oryqangen ist. Solche Prcsrumme verstoßen eesen

Sbnda

(c). Neben den

theorctßche

übe

epnge

hat .tet

Aftikel

auch einen

plakn-schen Aspekt.

Iiset

eÄahen Hinweße. \9ie einige

in

Izhtbii-chem

puqEie

e

FORTMN-Ptosnmme

so abgeA

den

net

den können, daß die Endlichkeit der Ausfthlunq

_fit

ale

Da-teßärze

gewihieistet

itt.

Melhod

Th€

k'meansd8onthm

(MACQUEEN, 1967) is a simple non-hiemrchical n€thod, which seeks to minimize an enor

function

E,

by assigning each case

optimally to

a cluster. Many

texl

books on clü$enng contain complet€

FORTLAN

codes

(AN-DERBERG, 1973; HaRrrcaN, 1975; HERDEN ddSrEINHAUSEN. 1979i SpArH. 1975. 19801 STETNHAUSEN. 1977).

A cluster with nurnber p is defined as a nonempty set of

indi-ces

CpCi l,

. .

.,N

_I.

It

contains Ne objects

_!

(v'ith XNp:

N,

EDv ii Mdiri Dd Biürodc

rßs

i € C").

Ihe

length m ofa partion ofthes€ N objects is defined as

d collecrion

olm

nonempty sets of indices. _'uchas

(1a)

Ci

u

C,

u

...

u C-

_-

{

1,...,N

}

(lb)

CjnCk:

At+k:g:

enpty

s€t)

A partition of length m is charact€rized by a m€mbeßhip-vec-tor p oflength N.

Ifpi

:

j th€n object i belones to clusterj.

In k-neans the number m is flxed andp has to be pr€selected. The algorithm tries to change the sets ofindices C _r,. . . _{, Cm}in a st€p-wise manner so that the sets remain non€mpty and the er-ror funclion reaches a minimum

rtc. c,:Sr,=f

.

_{[" -f,:t,l}

'"

_{t=t t}

_ti

I

_{i2 rE\}

ivith:

xi

!j

:

row vector ofdata values ofobject

i

-

row vectorwith group means (centroid

If

cluster p contains Ne cas€s the data valu€s can be a.ranged in a Np x . data _{matrix Ie}(r

:

number

ofvariablet.

The

centro-I

r.l)

_t;

"

I

'

*"

(l'

_on€9

-

ros

tecrorsirh

\e

Th€ dispersion of the p-th cluster can be measüred by the va-dation a.oünd the

entroid

(4)

Ei

:,i:

(!-5)'(!-r,r

:

Y;tr-N"r;+

Now a new case k is assiened to the old clusterp on a trial and enor basis. The updaied €entuoid is

6)_

I; :

old centroid

,

I

,, ,.

.xl

=

daraveclorof

lo*

:

_1N"+

_tr-

rrrplpr1k'

-r;+

:

n€w centroid and tfte updated €rror (dispeßion _{of cluster}p

- )

(6)

n3.

:

13

₊

r

:

₄

₊

Ofua;,

N-:

Ei

+

_*+

(!-rp)

_Gi-ID)

E3

:

dispeßionofoldclust€r

(2)

Mtuus. Nonlenination of the K-Means Clustering Atgorirhn

E;+

_-

dispeßion ofnew clust€r

dfo

:

square ofdistance

frori

datä point

kto

old centroid

The updating

ofth€

error is analogouq wh€n a cäse k l€ayes

the cluster q. The new enor is

^^N

_{'' "q}

No-t-'4

_^

N_

- r;-N;r(!-\qr'

qt-tq)

Tte

case k (k

:

l,

. . . _,I.0is reassigned to oluster p,

if

(8a)

E;*

<

Eid

where E3ü

:

q

+

Ea

Ei."_E;,_ri

, rFi

_-

Bi

+

rri-

A,

orequivalently

rf

(8b)

B-A

<

o

(8c)

B<A

Reallocation of objects is

firished

if no k, p, q can b€ found so that (8) is true. Th€ condition (8) can be regarded as the core

of

If

the condition (8) is plogrammed

fgr

a

digital conputer

(e. _s.in FORTRAN lik€ : IF

(B .LT. A)

. . . ₎k-neans

will

nor

t€rminate for cenain data s€ts as

lvill

be shown now.

The nontermlnatlon

ol

k-m€ans

Suppos€wehaveadatasetforwhichA

:

B.Sothereisthema-thematical equality

,o, Ir ;r : -I-

,r

'

ND+l

''e

N"-l

_rc

and k-meansshodd nor a\sign k toclusterp

ifl

isin q.

lfse

fur-rherassumethat Ne

-

5andNq

_-

ö

rhe €q ua liry is nor

viola-56

(l0)

_;di,:

i

di.

Howev€r

ifth€

precision

ofthe

diSital computer is e.s. 6 d€ci-mals, it caiculates %

_-

0.833 333. Now we have the

_!!99!e!E

rllt

0.813J33

die

<

Tabl€ I

.

Sample data s€t for

ternination

test of k-m€ans

tion

intericiihal

rcpresentation of rhe computer The consequence of this is lhe ugly r€sult, that case k is being shifted between cluster p and q without end I The only m€thod to avoid

inlinite

looping is to reformulare th€

..ndition

(8).

An

opportunity could b€ the following FORTRAN statement

(12)

rF((A-

B).cr

EPS)...

with EPS as a small nachin€ depend€nt constänt.

Example

To check our statement, w€ used a small data set (table I ₎and a program published by SpArH (1975, 1980). Ther€ are

N

:

40

cas€s

in

R:. Th€

partition

lenelh is

in

:

10. The m€mbeßhip vectoi

p

was pr€selected

lik€

column 3

in table

l.

The

ero.

function

E'z is 6807.67 and cannot be reduc€d

futher

object

k

:

40 is shifted back and forth between cluster 6 and 7.

EDvinM.nizinüdfiob4r.4/1931

ANDERBERG, M. R., ( 1973) : Cluster Analysis for Applications,

New York: Academic Press.

BauER, F. L. und H. WössNER, (1981): AlSorithmische

Spra-che und

hogrannentwicklung,

B€rlin: Spnng€r Verlae. HaRrIcaN, J.

4.,

(1975): Clust€ring Algorithms, New

York:

John Wiley & Sons.

p -

membshipvecror

x0)

_-

fißt@odinttein

R, x{2)

_-

second @ordinate in R,

*(l)

(2)

_P I 2 3 5 6 8 9 t0

ll

t2 l3 t4

l5

l6

\7 l8 l9 20 2t 22 23 24 25 26 27 28 29

3l

32 33 14 35 36 37 38 39 40 202.0 201.0 200.0 199.0 198.0 298.0 299.O 300.0 301.0 302.0 402.0 401.0 400.0 399.0 t98.0 298.0

29.0

300.0 301.0 302.0 398.0 399.0 400.0 401.0 402.O 29E.0 299.0 300.0 301.0 302.0 398.0 399.0 400.0 401.0 402.0 250.0 350.0 250_0 200.0 350.0 398.0 399.0 400.0 401.0 402.0 398_0 399.0 400.0 401.0 402.0 398.0 t99.0 400.0 401.0 402.0 298.0 299.0 300.0 301.0 302.0 298.0 299.4 300.0 30t.0 302,0 202.0 201.0 200.0 199.0 198.0 202_0 201.0 200.0 199.0 198.0 350.0 350.0 250.0 200.0 250.0

l0

(3)

EDMm

EDP

tn

Dledtcine

and

Btofogy

Versuchsplanung, Daten-sammlung und

biometrische

Auswertung in Baologie,

klinischer

und

experimenteller

Medizin u nd

Agrarwissenschallen

Schriltleiter:

Prof. Dr. H. Geldel,

Stutlgart

Guslav

Fbcher

Verlag,

Stuttgarl

Verleg

Eugen Ulmer,

Slutlgarl

lnhalt

Compuleranw€ndungen

Dediziertes Sysiem auf Kleinrechnerbasis lür eine Universitätsblutbank

Glättungsmethoden f Ür Ooslsberechnun-gen mit der Monte

carlo

Methode Kriterien

zur

optimalen

Klassenauswahl

bei

computergestützten

Mustererken-Slallslische

Verl.hren

Simuliane Vergleiche

sukzessiver

Diff€-renzen beobachteler Mittelwerte

Parameler Selection Criteria and Error of Prediction in lvlulliple Regression Analysis of Arrh).thmic Art€rial Pressure Pulses Zur Berechnung eff izienter Gesamtmittel bei der Auswerlung von Versuchsserien

Programmlnlormstion

On

the

Nontermination

of lhe

K-Means Clustering Algorithm ior Cerlain Data Sels

l{ächrlchten

und

Borichte

Buchbesprechungen

A

Dedicated

Minicomputer System

for

a Blood Bank at LJniversity Hospilals

Doss Calculations Using The Monte Carlo Method : Smoothing Procedures

Crileria of optimum choice of categories

for

computeraided pati€rn r€cognltion

proce-Simultaneous comparisons f or successive ditferences of sample means

Paramet€rauswahlkriterien

und

Vorhersa-gef ehler bei dsr multiplen Regressionsana-lyse arrhythmischer

arte

oller Druckpulse

The

Computation

of

Efficient

Over-All

[,4eans for the Evaluation ofSeries of Trials

Über die Nichtendlichkeit des K-Means Clu-sterAlgorithmusf ür bsstimmte Datensätze

G. K.Wolf

c.Schokn€cht.

A.Sadat-Khonsari K.Brodda, C.F.Hess T. Royen

R.Dutter

H.Rundfeldl 100 10€ 111 115 120 '124 126 126 C.Möbus

ah Dohmhbtoi dBion*o

Fded ch Eb€dsrn3€,56m wuppeturr:

Fd ü

G Hdmh,- PGieF ?nos6r.7om

_$

scftribibr(66iüonich m sinn. das Prcsssßcht): PbL Dr H G6id6r Racheizanrrun

rcadrc lHohsnhsim), Fosnach 700s€..

rd

(oir)

o"dmrhriffirir

r.r r@ ).507.

2-n-ns707, sia1

06Jd

osuc

zs'qsru;Fh e

N

tsuhr aza'Fn

i"."M.l.k'".'fubolaon'sL'.'

,.^;mcnmen

;,;

ce;ne

ve^eb

'T

d. La

F6ur"1 ds o

"l-

ub,

s$'rs

oa va'p I

vs' aoan u'io

o.

vcia'3' cda

'

so*

Gm;b trhen

zm.rd

den'

i€n eisch€i d.m BöRnwßin d6

so ßr i(j' i.d6 kop ene Brail ehe MäAe im Beraq iorosro. e6cbinrviemar jdhnich. D€r Eszussprcis

sarerrnsen nehms isd6 Buchhan

obersicßänibi aus d€m G€samrss

dorm"; rnol o'.a

ci"n

"aLT

*h'n.@le1rllLbß'F'3ldD'.Mouikfipl'

Dlaftl9'nd€'i$il'orldooFl"n

nul,Jhs"EcsFbnfil@n@1

am End.

s6

Baira6.'nd d.' N.

neniarsDisßGbLrunc rnsrilur, Ftna u.w) und eine Fnaw

adr*

anzuseben ob €'dtr'e'odntoßlanoaz.Lhn

nmmdn

€'&d

riö.r€n bbnnLn

.;Md.

.n

n'm,'e en

(4)