EDV in Medizin und Biologie 12 (4), 124-126, ISSN 0300-8282
@
Eugen UlmerGmbH & Co., Stuttgart i Gustav Fisch€rVerlag KG,Stuitgart
On
The
Nontermination
ol the
K-Means
Clustering Algorithm
lor
Certain
Data
Sets
Summary
There are sone well accepted standads
fot
theIomulation
of
algoithns and
computetptugtams
/BauERand
WössNER,t98I): (a)effcie
cy, (b) fuiteness of deßoiption and (c)fnite-ess of
ope
tion.
It
Mt
be demonstated that the populatK-Meaß
cl
stetinsalsonüm
loops infnitelyfot
sone data seß i.f the .omputetyosqm
ß
only afomula
,anslation of
itsna-thenanül
conecrform.
Therc is a@ntqdictio
tostandad
(c) dueto
thelinite
pecision of
@mputelaithmen
. B6üIes
its theorctical aspecß the anicle has apactical
putpose.It
thoh,s bjaysof coftatins
some publßhedFORTMN-ptogans
so thattemination ß
achie|edfot
all
data seß.Zusammentassung
I
det
I
fonnank
Cibt es einise allaemein akzeptiee
Sta-da
srt)t
dieFomulimns
wn
AlsotithmenMd
Conputerytu-s/ann"u
fBAUER und wössNER, 1981): (a)ElJüe
z,(b)E
d-lichkeit det Reschrcibuttgu
d
(c)E
dlichkeit bei det Ausfiih-rung. Es kann gezeigtwe
en, da]l det beliebte K"MeansCht-stet Algorithnus
Jiit
bestimtkte Dateniötzein
eineEndlos-schleife
fiillt,
wemd6
Ptosnmn
nw
aß
einetFomeiübe
ra-gung cletkorckten
mathematischeFom
hen'oryqangen ist. Solche Prcsrumme verstoßen eesenSbnda
(c). Neben dentheorctßche
übe
epnge
hat .tetAftikel
auch einen plakn-schen Aspekt.Iiset
eÄahen Hinweße. \9ie einigein
Izhtbii-chempuqEie
e
FORTMN-Ptosnmme
so abgeAden
net
den können, daß die Endlichkeit der Ausfthlunq
fit
ale
Da-teßärzegewihieistet
itt.
Melhod
Th€
k'meansd8onthm
(MACQUEEN, 1967) is a simple non-hiemrchical n€thod, which seeks to minimize an enorfunction
E,
by assigning each caseoptimally to
a cluster. Manytexl
books on clü$enng contain complet€
FORTLAN
codes(AN-DERBERG, 1973; HaRrrcaN, 1975; HERDEN ddSrEINHAUSEN. 1979i SpArH. 1975. 19801 STETNHAUSEN. 1977).
A cluster with nurnber p is defined as a nonempty set of
indi-ces
CpCi l,
. ..,N
I.It
contains Ne objects!
(v'ith XNp:
N,
EDv ii Mdiri Dd Biürodc
rßs
i € C").
Ihe
length m ofa partion ofthes€ N objects is defined asd collecrion
olm
nonempty sets of indices. 'uch as(1a)
Ci
u
C,
u
...
u C-
-
{1,...,N
}(lb)
CjnCk:
At+k:g:
enpty
s€t)A partition of length m is charact€rized by a m€mbeßhip-vec-tor p oflength N.
Ifpi
:
j th€n object i belones to clusterj.In k-neans the number m is flxed andp has to be pr€selected. The algorithm tries to change the sets ofindices C r, . . . , Cm in a st€p-wise manner so that the sets remain non€mpty and the er-ror funclion reaches a minimum
rtc. c,:Sr,=f
.
[" -f,:t*,*l
'"
t=t t
ti
Ii2 rE\
ivith:
xi
!j
:
row vector ofdata values ofobjecti
-
row vectorwith group means (centroidIf
cluster p contains Ne cas€s the data valu€s can be a.ranged in a Np x . data matrix Ie (r:
numberofvariablet.
Thecentro-I
r.l)
t;
"
I'
*"
(l'
on€9-
ros
tecrorsirh
\e
Th€ dispersion of the p-th cluster can be measüred by the va-dation a.oünd the
entroid
(4)
Ei
:,i:
(!-5)'(!-r,r
:
Y;tr-N"r;+
Now a new case k is assiened to the old clusterp on a trial and enor basis. The updaied €entuoid is
6)_
I; :
old centroid,
I
,, ,.
.xl
=
daraveclorof
lo*
:
1N"+
tr-rrrplpr1k'
-r;+
:
n€w centroid and tfte updated €rror (dispeßion of cluster p- )
(6)
n3.
:
13+
r
:
4
+
Ofua;,
N-:
Ei
+
*+
(!-rp)
Gi-ID)
E3
:
dispeßionofoldclust€r
Mtuus. Nonlenination of the K-Means Clustering Atgorirhn
E;+
-
dispeßion ofnew clust€rdfo
:
square ofdistancefrori
datä pointkto
old centroidThe updating
ofth€
error is analogouq wh€n a cäse k l€ayesthe cluster q. The new enor is
^^N
'' "q
No-t-'4
_^
N_- r;-N;r(!-\qr'
qt-tq)
Tte
case k (k:
l,
. . . ,I.0 is reassigned to oluster p,if
(8a)
E;*
<
Eid
where E3ü
:
q
+
EaEi."_E;,_ri
, rFi
-
Bi
+
rri-
A,
orequivalentlyrf
(8b)
B-A
<
o(8c)
B<A
Reallocation of objects is
firished
if no k, p, q can b€ found so that (8) is true. Th€ condition (8) can be regarded as the coreof
If
the condition (8) is plogrammedfgr
adigital conputer
(e. s. in FORTRAN lik€ : IF
(B .LT. A)
. . . ) k-neanswill
nort€rminate for cenain data s€ts as
lvill
be shown now.The nontermlnatlon
ol
k-m€ansSuppos€wehaveadatasetforwhichA
:
B.Sothereisthema-thematical equality,o, Ir ;r : -I-
,r
'
ND+l
''e
N"-l
_rcand k-meansshodd nor a\sign k toclusterp
ifl
isin q.lfse
fur-rherassumethat Ne-
5andNq
-
ö
rhe €q ua liry is norviola-56
(l0)
;di,:
i
di.
Howev€r
ifth€
precisionofthe
diSital computer is e.s. 6 d€ci-mals, it caiculates %-
0.833 333. Now we have the!!99!e!E
rllt
0.813J33die
<
Tabl€ I
.
Sample data s€t forternination
test of k-m€anstion
intericiihal
rcpresentation of rhe computer The consequence of this is lhe ugly r€sult, that case k is being shifted between cluster p and q without end I The only m€thod to avoidinlinite
looping is to reformulare th€..ndition
(8).An
opportunity could b€ the following FORTRAN statement(12)
rF((A-
B).cr
EPS)...
with EPS as a small nachin€ depend€nt constänt.
Example
To check our statement, w€ used a small data set (table I ) and a program published by SpArH (1975, 1980). Ther€ are
N
:
40cas€s
in
R:. Th€partition
lenelh isin
:
10. The m€mbeßhip vectoip
was pr€selectedlik€
column 3in table
l.
Theero.
function
E'z is 6807.67 and cannot be reduc€dfuther
objectk
:
40 is shifted back and forth between cluster 6 and 7.EDvinM.nizinüdfiob4r.4/1931
ANDERBERG, M. R., ( 1973) : Cluster Analysis for Applications,
New York: Academic Press.
BauER, F. L. und H. WössNER, (1981): AlSorithmische
Spra-che und
hogrannentwicklung,
B€rlin: Spnng€r Verlae. HaRrIcaN, J.4.,
(1975): Clust€ring Algorithms, NewYork:
John Wiley & Sons.p -
membshipvecrorx0)
-
fißt@odinttein
R, x{2)-
second @ordinate in R,*(l)
(2)
P I 2 3 5 6 8 9 t0ll
t2 l3 t4l5
l6
\7 l8 l9 20 2t 22 23 24 25 26 27 28 293l
32 33 14 35 36 37 38 39 40 202.0 201.0 200.0 199.0 198.0 298.0 299.O 300.0 301.0 302.0 402.0 401.0 400.0 399.0 t98.0 298.029.0
300.0 301.0 302.0 398.0 399.0 400.0 401.0 402.O 29E.0 299.0 300.0 301.0 302.0 398.0 399.0 400.0 401.0 402.0 250.0 350.0 250_0 200.0 350.0 398.0 399.0 400.0 401.0 402.0 398_0 399.0 400.0 401.0 402.0 398.0 t99.0 400.0 401.0 402.0 298.0 299.0 300.0 301.0 302.0 298.0 299.4 300.0 30t.0 302,0 202.0 201.0 200.0 199.0 198.0 202_0 201.0 200.0 199.0 198.0 350.0 350.0 250.0 200.0 250.0l0
l0
EDMm
EDP
tn
Dledtcine
and
Btofogy
Versuchsplanung, Daten-sammlung und
biometrische
Auswertung in Baologie,klinischer
undexperimenteller
Medizin u ndAgrarwissenschallen
Schriltleiter:
Prof. Dr. H. Geldel,
Stutlgart
GuslavFbcher
Verlag,Stuttgarl
Verleg
Eugen Ulmer,Slutlgarl
lnhalt
Compuleranw€ndungen
Dediziertes Sysiem auf Kleinrechnerbasis lür eine Universitätsblutbank
Glättungsmethoden f Ür Ooslsberechnun-gen mit der Monte
carlo
Methode Kriterienzur
optimalen
Klassenauswahlbei
computergestützten
Mustererken-Slallslische
Verl.hren
Simuliane Vergleiche
sukzessiver
Diff€-renzen beobachteler MittelwerteParameler Selection Criteria and Error of Prediction in lvlulliple Regression Analysis of Arrh).thmic Art€rial Pressure Pulses Zur Berechnung eff izienter Gesamtmittel bei der Auswerlung von Versuchsserien
Programmlnlormstion
On
the
Nonterminationof lhe
K-Means Clustering Algorithm ior Cerlain Data Selsl{ächrlchten
undBorichte
Buchbesprechungen
A
Dedicated
Minicomputer Systemfor
a Blood Bank at LJniversity HospilalsDoss Calculations Using The Monte Carlo Method : Smoothing Procedures
Crileria of optimum choice of categories
for
computeraided pati€rn r€cognltionproce-Simultaneous comparisons f or successive ditferences of sample means
Paramet€rauswahlkriterien
und
Vorhersa-gef ehler bei dsr multiplen Regressionsana-lyse arrhythmischerarte
oller DruckpulseThe
Computation
of
Efficient
Over-All[,4eans for the Evaluation ofSeries of Trials
Über die Nichtendlichkeit des K-Means Clu-sterAlgorithmusf ür bsstimmte Datensätze
G. K.Wolf
c.Schokn€cht.
A.Sadat-Khonsari K.Brodda, C.F.Hess T. RoyenR.Dutter
H.Rundfeldl 100 10€ 111 115 120 '124 126 126 C.Möbusah Dohmhbtoi dBion*o
Fded ch Eb€dsrn3€,56m wuppeturr:Fd ü
G Hdmh,- PGieF ?nos6r.7om$
scftribibr(66iüonich m sinn. das Prcsssßcht): PbL Dr H G6id6r Racheizanrrun
rcadrc lHohsnhsim), Fosnach 700s€..
rd
(oir)
o"dmrhriffirir
r.r r@ ).507.2-n-ns707, sia1
06Jd
osuczs'qsru;Fh e
N
tsuhr aza'Fni"."M.l.k'".'fubolaon'sL'.'
,.^;mcnmen
;,;
ce;ne
ve^eb'T
d. La
F6ur"1 ds o"l-
ub,s$'rs
oa va'p Ivs' aoan u'io
o.
vcia'3' cda'
so*
Gm;b trhenzm.rd
den'i€n eisch€i d.m BöRnwßin d6
so ßr i(j' i.d6 kop ene Brail ehe MäAe im Beraq iorosro. e6cbinrviemar jdhnich. D€r Eszussprcis
sarerrnsen nehms isd6 Buchhan
obersicßänibi aus d€m G€samrss
dorm"; rnol o'.a
ci"n
"aLT*h'n.@le1rllLbß'F'3ldD'.Mouikfipl'
Dlaftl9'nd€'i$il'orldooFl"n
nul,Jhs"EcsFbnfil@n@1
am End.
s6
Baira6.'nd d.' N.neniarsDisßGbLrunc rnsrilur, Ftna u.w) und eine Fnaw
adr*
anzuseben ob €'dtr'e'odntoßlanoaz.Lhnnmmdn