• Keine Ergebnisse gefunden

On the Nontermination of the K-Means Clustering Algorithm for Certain Data Sets

N/A
N/A
Protected

Academic year: 2021

Aktie "On the Nontermination of the K-Means Clustering Algorithm for Certain Data Sets"

Copied!
4
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

EDV in Medizin und Biologie 12 (4), 124-126, ISSN 0300-8282

@

Eugen UlmerGmbH & Co., Stuttgart i Gustav Fisch€rVerlag KG,

Stuitgart

On

The

Nontermination

ol the

K-Means

Clustering Algorithm

lor

Certain

Data

Sets

Summary

There are sone well accepted standads

fot

the

Iomulation

of

algoithns and

computet

ptugtams

/BauER

and

WössNER,

t98I): (a)effcie

cy, (b) fuiteness of deßoiption and (c)

fnite-ess of

ope

tion.

It

Mt

be demonstated that the populat

K-Meaß

cl

stetins

alsonüm

loops infnitely

fot

sone data seß i.f the .omputet

yosqm

ß

only a

fomula

,anslation of

its

na-thenanül

conecr

form.

Therc is a

@ntqdictio

to

standad

(c) due

to

the

linite

pecision of

@mputel

aithmen

. B6üIes

its theorctical aspecß the anicle has a

pactical

putpose.

It

thoh,s bjays

of coftatins

some publßhed

FORTMN-ptogans

so that

temination ß

achie|ed

fot

all

data seß.

Zusammentassung

I

det

I

fonnank

Cibt es einise allaemein akzeptie

e

Sta

-da

s

rt)t

die

Fomulimns

wn

Alsotithmen

Md

Conputerytu-s/ann"u

fBAUER und wössNER, 1981): (a)

ElJüe

z,

(b)E

d-lichkeit det Reschrcibuttg

u

d

(c)

E

dlichkeit bei det Ausfiih-rung. Es kann gezeigt

we

en, da]l det beliebte K"Means

Cht-stet Algorithnus

Jiit

bestimtkte Dateniötze

in

eine

Endlos-schleife

fiillt,

wem

d6

Ptosnmn

nw

einet

Fomeiübe

ra-gung clet

korckten

mathematische

Fom

hen'oryqangen ist. Solche Prcsrumme verstoßen eesen

Sbnda

(c). Neben den

theorctßche

übe

epnge

hat .tet

Aftikel

auch einen

plakn-schen Aspekt.

Iiset

eÄahen Hinweße. \9ie einige

in

Izhtbii-chem

puqEie

e

FORTMN-Ptosnmme

so abgeA

den

net

den können, daß die Endlichkeit der Ausfthlunq

fit

ale

Da-teßärze

gewihieistet

itt.

Melhod

Th€

k'meansd8onthm

(MACQUEEN, 1967) is a simple non-hiemrchical n€thod, which seeks to minimize an enor

function

E,

by assigning each case

optimally to

a cluster. Many

texl

books on clü$enng contain complet€

FORTLAN

codes

(AN-DERBERG, 1973; HaRrrcaN, 1975; HERDEN ddSrEINHAUSEN. 1979i SpArH. 1975. 19801 STETNHAUSEN. 1977).

A cluster with nurnber p is defined as a nonempty set of

indi-ces

CpCi l,

. .

.,N

I.

It

contains Ne objects

!

(v'ith XNp:

N,

EDv ii Mdiri Dd Biürodc

rßs

i € C").

Ihe

length m ofa partion ofthes€ N objects is defined as

d collecrion

olm

nonempty sets of indices. 'uch as

(1a)

Ci

u

C,

u

...

u C-

-

{

1,...,N

}

(lb)

CjnCk:

At+k:g:

enpty

s€t)

A partition of length m is charact€rized by a m€mbeßhip-vec-tor p oflength N.

Ifpi

:

j th€n object i belones to clusterj.

In k-neans the number m is flxed andp has to be pr€selected. The algorithm tries to change the sets ofindices C r, . . . , Cm in a st€p-wise manner so that the sets remain non€mpty and the er-ror funclion reaches a minimum

rtc. c,:Sr,=f

.

[" -f,:t*,*l

'"

t=t t

ti

I

i2 rE\

ivith:

xi

!j

:

row vector ofdata values ofobject

i

-

row vectorwith group means (centroid

If

cluster p contains Ne cas€s the data valu€s can be a.ranged in a Np x . data matrix Ie (r

:

number

ofvariablet.

The

centro-I

r.l)

t;

"

I

'

*"

(l'

on€9

-

ros

tecrorsirh

\e

Th€ dispersion of the p-th cluster can be measüred by the va-dation a.oünd the

entroid

(4)

Ei

:,i:

(!-5)'(!-r,r

:

Y;tr-N"r;+

Now a new case k is assiened to the old clusterp on a trial and enor basis. The updaied €entuoid is

6)_

I; :

old centroid

,

I

,, ,.

.xl

=

daraveclorof

lo*

:

1N"+

tr-

rrrplpr1k'

-r;+

:

n€w centroid and tfte updated €rror (dispeßion of cluster p

- )

(6)

n3.

:

13

+

r

:

4

+

Ofua;,

N-:

Ei

+

*+

(!-rp)

Gi-ID)

E3

:

dispeßionofoldclust€r

(2)

Mtuus. Nonlenination of the K-Means Clustering Atgorirhn

E;+

-

dispeßion ofnew clust€r

dfo

:

square ofdistance

frori

datä point

kto

old centroid

The updating

ofth€

error is analogouq wh€n a cäse k l€ayes

the cluster q. The new enor is

^^N

'' "q

No-t-'4

_^

N_

- r;-N;r(!-\qr'

qt-tq)

Tte

case k (k

:

l,

. . . ,I.0 is reassigned to oluster p,

if

(8a)

E;*

<

Eid

where E3ü

:

q

+

Ea

Ei."_E;,_ri

, rFi

-

Bi

+

rri-

A,

orequivalently

rf

(8b)

B-A

<

o

(8c)

B<A

Reallocation of objects is

firished

if no k, p, q can b€ found so that (8) is true. Th€ condition (8) can be regarded as the core

of

If

the condition (8) is plogrammed

fgr

a

digital conputer

(e. s. in FORTRAN lik€ : IF

(B .LT. A)

. . . ) k-neans

will

nor

t€rminate for cenain data s€ts as

lvill

be shown now.

The nontermlnatlon

ol

k-m€ans

Suppos€wehaveadatasetforwhichA

:

B.Sothereisthema-thematical equality

,o, Ir ;r : -I-

,r

'

ND+l

''e

N"-l

_rc

and k-meansshodd nor a\sign k toclusterp

ifl

isin q.

lfse

fur-rherassumethat Ne

-

5andNq

-

ö

rhe €q ua liry is nor

viola-56

(l0)

;di,:

i

di.

Howev€r

ifth€

precision

ofthe

diSital computer is e.s. 6 d€ci-mals, it caiculates %

-

0.833 333. Now we have the

!!99!e!E

rllt

0.813J33

die

<

Tabl€ I

.

Sample data s€t for

ternination

test of k-m€ans

tion

intericiihal

rcpresentation of rhe computer The consequence of this is lhe ugly r€sult, that case k is being shifted between cluster p and q without end I The only m€thod to avoid

inlinite

looping is to reformulare th€

..ndition

(8).

An

opportunity could b€ the following FORTRAN statement

(12)

rF((A-

B).cr

EPS)...

with EPS as a small nachin€ depend€nt constänt.

Example

To check our statement, w€ used a small data set (table I ) and a program published by SpArH (1975, 1980). Ther€ are

N

:

40

cas€s

in

R:. Th€

partition

lenelh is

in

:

10. The m€mbeßhip vectoi

p

was pr€selected

lik€

column 3

in table

l.

The

ero.

function

E'z is 6807.67 and cannot be reduc€d

futher

object

k

:

40 is shifted back and forth between cluster 6 and 7.

EDvinM.nizinüdfiob4r.4/1931

ANDERBERG, M. R., ( 1973) : Cluster Analysis for Applications,

New York: Academic Press.

BauER, F. L. und H. WössNER, (1981): AlSorithmische

Spra-che und

hogrannentwicklung,

B€rlin: Spnng€r Verlae. HaRrIcaN, J.

4.,

(1975): Clust€ring Algorithms, New

York:

John Wiley & Sons.

p -

membshipvecror

x0)

-

fißt@odinttein

R, x{2)

-

second @ordinate in R,

*(l)

(2)

P I 2 3 5 6 8 9 t0

ll

t2 l3 t4

l5

l6

\7 l8 l9 20 2t 22 23 24 25 26 27 28 29

3l

32 33 14 35 36 37 38 39 40 202.0 201.0 200.0 199.0 198.0 298.0 299.O 300.0 301.0 302.0 402.0 401.0 400.0 399.0 t98.0 298.0

29.0

300.0 301.0 302.0 398.0 399.0 400.0 401.0 402.O 29E.0 299.0 300.0 301.0 302.0 398.0 399.0 400.0 401.0 402.0 250.0 350.0 250_0 200.0 350.0 398.0 399.0 400.0 401.0 402.0 398_0 399.0 400.0 401.0 402.0 398.0 t99.0 400.0 401.0 402.0 298.0 299.0 300.0 301.0 302.0 298.0 299.4 300.0 30t.0 302,0 202.0 201.0 200.0 199.0 198.0 202_0 201.0 200.0 199.0 198.0 350.0 350.0 250.0 200.0 250.0

l0

l0

(3)

EDMm

EDP

tn

Dledtcine

and

Btofogy

Versuchsplanung, Daten-sammlung und

biometrische

Auswertung in Baologie,

klinischer

und

experimenteller

Medizin u nd

Agrarwissenschallen

Schriltleiter:

Prof. Dr. H. Geldel,

Stutlgart

Guslav

Fbcher

Verlag,

Stuttgarl

Verleg

Eugen Ulmer,

Slutlgarl

lnhalt

Compuleranw€ndungen

Dediziertes Sysiem auf Kleinrechnerbasis lür eine Universitätsblutbank

Glättungsmethoden f Ür Ooslsberechnun-gen mit der Monte

carlo

Methode Kriterien

zur

optimalen

Klassenauswahl

bei

computergestützten

Mustererken-Slallslische

Verl.hren

Simuliane Vergleiche

sukzessiver

Diff€-renzen beobachteler Mittelwerte

Parameler Selection Criteria and Error of Prediction in lvlulliple Regression Analysis of Arrh).thmic Art€rial Pressure Pulses Zur Berechnung eff izienter Gesamtmittel bei der Auswerlung von Versuchsserien

Programmlnlormstion

On

the

Nontermination

of lhe

K-Means Clustering Algorithm ior Cerlain Data Sels

l{ächrlchten

und

Borichte

Buchbesprechungen

A

Dedicated

Minicomputer System

for

a Blood Bank at LJniversity Hospilals

Doss Calculations Using The Monte Carlo Method : Smoothing Procedures

Crileria of optimum choice of categories

for

computeraided pati€rn r€cognltion

proce-Simultaneous comparisons f or successive ditferences of sample means

Paramet€rauswahlkriterien

und

Vorhersa-gef ehler bei dsr multiplen Regressionsana-lyse arrhythmischer

arte

oller Druckpulse

The

Computation

of

Efficient

Over-All

[,4eans for the Evaluation ofSeries of Trials

Über die Nichtendlichkeit des K-Means Clu-sterAlgorithmusf ür bsstimmte Datensätze

G. K.Wolf

c.Schokn€cht.

A.Sadat-Khonsari K.Brodda, C.F.Hess T. Royen

R.Dutter

H.Rundfeldl 100 10€ 111 115 120 '124 126 126 C.Möbus

ah Dohmhbtoi dBion*o

Fded ch Eb€dsrn3€,56m wuppeturr:

Fd ü

G Hdmh,- PGieF ?nos6r.7om

$

scftribibr(66iüonich m sinn. das Prcsssßcht): PbL Dr H G6id6r Racheizanrrun

rcadrc lHohsnhsim), Fosnach 700s€..

rd

(oir)

o"dmrhriffirir

r.r r@ ).507.

2-n-ns707, sia1

06Jd

osuc

zs'qsru;Fh e

N

tsuhr aza'Fn

i"."M.l.k'".'fubolaon'sL'.'

,.^;mcnmen

;,;

ce;ne

ve^eb

'T

d. La

F6ur"1 ds o

"l-

ub,

s$'rs

oa va'p I

vs' aoan u'io

o.

vcia'3' cda

'

so*

Gm;b trhen

zm.rd

den'

i€n eisch€i d.m BöRnwßin d6

so ßr i(j' i.d6 kop ene Brail ehe MäAe im Beraq iorosro. e6cbinrviemar jdhnich. D€r Eszussprcis

sarerrnsen nehms isd6 Buchhan

obersicßänibi aus d€m G€samrss

dorm"; rnol o'.a

ci"n

"aLT

*h'n.@le1rllLbß'F'3ldD'.Mouikfipl'

Dlaftl9'nd€'i$il'orldooFl"n

nul,Jhs"EcsFbnfil@n@1

am End.

s6

Baira6.'nd d.' N.

neniarsDisßGbLrunc rnsrilur, Ftna u.w) und eine Fnaw

adr*

anzuseben ob €'dtr'e'odntoßlanoaz.Lhn

nmmdn

€'&d

riö.r€n bbnnLn

.;Md.

.n

n'm,'e en

(4)

Referenzen

ÄHNLICHE DOKUMENTE

Recall that the k-means++ augments the k-means algorithm by choosing the initial cluster centers according to the D 2 metric, and not uniformly at random from the data.. Overall,

For each optimal center o ∈ O, let s o denote its closest heuristic center in S. Note that each optimal center is captured by exactly one heuristic center, but each heuristic center

Perhaps the most popular heuristic used for this problem is Lloyd’s method, which consists of the following two phases: (a) “Seed” the process with some initial centers (the

Dijkstra iterates as long as the expanded nodes correspond to tree edges and delivers a solution path only when a node corresponding to a sidetrack edge is expanded.. Blind K

As the next testing method, 15 pairs of source and target coordinates were chosen randomly and for each of them trajectories were computed using our proposed method, Google, OSRM

In this work we describe novel methods for effective subspace clustering on complex data including high-dimensional vector spaces (Section 2), imperfect data (Section 3), and graph

The behavior of the Hopfield model depends on the so called loading factor a = p/N which is the ratio between the stored patterns and the number of neurons of the system.. By

The cluster centers are initialized by combining the sample mean and standard deviation, the optimal cluster centers are searched by the hybridizing particle swarm