• Keine Ergebnisse gefunden

The Effectiveness of Lloyd-type Methods for the k-Means Problem

N/A
N/A
Protected

Academic year: 2022

Aktie "The Effectiveness of Lloyd-type Methods for the k-Means Problem"

Copied!
18
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

The Effectiveness of Lloyd-type Methods for the k-Means Problem

Rafail Ostrovsky Yuval Rabani Leonard J. Schulman Chaitanya Swamy§

Abstract

We investigate variants of Lloyd’s heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to sug- gest improvements in its application. We propose and justify aclusterabilitycriterion for data sets. We present variants of Lloyd’s heuristic that quickly lead to provably near-optimal clustering solutions when applied to well-clusterable instances. This is the first performance guarantee for a variant of Lloyd’s heuristic. The provision of a guarantee on output quality does not come at the expense of speed: some of our algorithms are candidates for being faster in practice than currently used variants of Lloyd’s method. In addition, our other algorithms are faster on well-clusterable instances than recently proposed approximation algorithms, while maintaining similar guarantees on clustering quality. Our main algo- rithmic contribution is a novel probabilistic seeding process for the starting configuration of a Lloyd-type iteration.

1 Introduction

Overview. There is presently a wide and unsatisfactory gap between the practical and theoretical clustering literatures. For decades, practitioners have been using heuristics of great speed but uncertain merit; the latter should not be surprising since the problem is NP-hard in almost any formulation. However, in the last few years, algorithms researchers have made considerable innovations, and even obtained polynomial- time approximation schemes (PTAS’s) for some of the most popular clustering formulations. Yet these contributions have not had a noticeable impact on practice. Practitioners instead continue to use a variety of heuristics (Lloyd, EM, agglomerative methods, etc.) that have no known performance guarantees.

There are two ways to approach this disjuncture. The most obvious is to continue developing new tech- niques until they are so good—down to the implementations—that they displace entrenched methods. The other is to look toward popular heuristics and ask whether there are reasons that justify their extensive use, but elude the standard theoretical criteria; and in addition, whether theoretical scrutiny suggests improve- ments in their application. This is the approach we take in this paper.

As in other prominent cases [47, 41], such an analysis typically involves some abandonment of the worst-case inputscriterion. (In fact, part of the challenge is to identify simple conditions on the input, that allow one to prove a performance guarantee of wide applicability.) Our starting point is the notion that (as discussed in [45]) one should be concerned withk-clustering data that possesses ameaningfulk-clustering.

rafail@cs.ucla.eduComputer Science Department, University of California at Los Angeles, 90095, USA. Supported in part by IBM Faculty Award, Xerox Innovation Group Award, a gift from Teradata, Intel equipment grant, and NSF Cybertrust grant no. 0430254.

rabani@cs.technion.ac.il. Computer Science Department, Technion — Israel Institute of Technology, Haifa 32000, Israel. Part of this work was done while visiting UCLA and Caltech. Supported in part by ISF 52/03, BSF 2002282, and the Fund for the Promotion of Research at the Technion.

schulman@caltech.edu. Caltech, Pasadena, CA 91125. Supported in part by NSF CCF-0515342, NSA H98230-06-1- 0074, and NSF ITR CCR-0326554.

§cswamy@math.uwaterloo.ca. Dept. of Combinatorics and Optimization, Univ. Waterloo, Waterloo, ON N2L 3G1.

Research supported partially by NSERC grant 32760-06. Work done while the author was a postdoctoral scholar at Caltech.

(2)

What does it mean for the data to have a meaningfulk-clustering? Here are two examples of settings where one would intuitivelynotconsider the data to possess a meaningfulk-clustering. If nearly optimum cost can be achieved by two very differentk-way partitions of the data then the identity of the optimal partition carries little meaning (for example, if the data was generated by random sampling from a source, then the optimal cluster regions might shift drastically upon resampling). Alternatively, if a near-optimal k-clustering can be achieved by a partition into fewer thankclusters, then that smaller value ofkshould be used to cluster the data. If near-optimalk-clusterings are hard to find only when they provide ambiguous classification or marginal benefit (i.e., in the absence of a meaningfulk-clustering), then such hardness should not be viewed as an acceptable obstacle to algorithm development. Instead, the performance criteria should be revised.

Specifically, we consider thek-meansformulation of clustering: given a finite setX⊆Rd, findkpoints (“centers”) to minimize the sum over all points x ∈ X of the squared distance betweenx and the center to which it is assigned. In an optimal solution, each center is assigned the data in its Voronoi region and is located at the center of mass of this data. Perhaps the most popular heuristic used for this problem is Lloyd’s method, which consists of the following two phases: (a) “Seed” the process with some initial centers (the literature contains many competing suggestions of how to do this); (b) Iterate the followingLloyd stepuntil the clustering is “good enough”: cluster all the data in the Voronoi region of a center together, and then move the center to the centroid of its cluster.

Although Lloyd-style methods are widely used, to our knowledge, there is no known mathematical analysis that attempts to explain or predict the performance of these heuristics. In this paper, we take the first step in this direction. We show thatif the data is well-clusterableaccording to a certain “clusterability”

or “separation” condition (that we introduce and discuss below),then various Lloyd-style methods do indeed perform well and return a provably near-optimal clustering. Our contributions are threefold:

(a) We introduce a separation condition and justify it as a reasonable abstraction of well-clusterability for the analysis ofk-means clustering algorithms. Our condition is simple, and abstracts a notion of well-clusterability alluded to earlier: letting∆2k(X)denote the cost of an optimalk-means solution of inputX, we say thatX is-separated fork-meansif∆2k(X)/∆2k−1(X) ≤ 2. (A similar condition fork= 2was used for`22edge-cost clustering in [45].)

Our motivation for proposing this condition is that a significant drop in thek-clustering cost is already used by practitioners as a diagnostic for choosing the value ofk([14]§10.10). Furthermore, we show that: (i) The data satisfies our separation condition if and only if it satisfies the other intuitive notion of well clusterability suggested earlier, namely that any two low-costk-clusterings disagree on only a small fraction of the data; and (ii) The condition is robust under noisy (even adversarial) perturbation of the data. In Section 5 we prove rigorous versions of (i) and (ii).

(b) We present a novel and efficient sampling process for seeding Lloyd’s method with initial centers, which allows us to prove the effectiveness of these methods.

(c) We demonstrate the effectiveness of (our variants of) the Lloyd heuristic under the separation con- dition. Specifically: (i) Our simplest variant uses only the new seeding procedure, requires asingle Lloyd-type descent step, and achieves a constant-factor approximation in time linear in |X|. This algorithm has success probability exponentially small in k, but we show that (ii) a slightly more complicated seeding process based on our sampling procedure yields a constant-factor approximation guarantee with constant probability, again in linear time. Since only one run of seeding+descent is re- quired in both algorithms, these are candidates for beingfaster in practicethan currently used Lloyd variants, which are used with multiple re-seedings and many Lloyd steps per re-seeding. (iii) We also give a PTAS by combining our seeding process with a sampling procedure of Kumar, Sabharwal and Sen [30], whose running time is linear in|X|and exponential ink. This PTAS is significantly faster, and also simpler, than the PTAS of Kumar et al. [30] (applying the separation condition to both algorithms; the latter does not run faster under the condition).

(3)

Literature and problem formulation. LetX ⊆Rdbe the given point set andn=|X|. In thek-means problem, the objective is to partitionX into kclustersX¯1, . . . ,X¯k and assign each point in every cluster X¯i to a common center¯ci ∈ Rd, so as to minimize the “k-means cost”Pk

i=1

P

x∈X¯ikx−c¯ik2, where k.kdenotes the`2 norm. We let∆2k(X)denote the optimumk-means cost. Observe that given the centers

¯

c1, . . . ,¯ck, it is easy to determine the best clustering corresponding to these centers: cluster X¯i simply consists of all pointsx∈Xfor which¯ciis the nearest center (breaking ties arbitrarily). Conversely given a clusteringX¯i, . . . ,X¯k, the best centers corresponding to this clustering are obtained by setting¯ci to be the center of mass (centroid) of clusterXi, that is, settingc¯i = |X¯1

i|·P

x∈X¯ix. It follows that both of these properties simultaneously hold in an optimal solution, that is,c¯iis the centroid of clusterX¯i, and each point inX¯i hasc¯ias its nearest center.

The problem of minimizing thek-means cost is one of the earliest and most intensively studied formu- lations of the clustering problem, both because of its mathematical elegance and because it bears closely on statistical estimation of mixture models ofkpoint sources under spherically symmetric Gaussian noise.

We briefly survey the most relevant literature here. The k-means problem seems to have been first con- sidered by Steinhaus in 1956 [48]. A simple greedy iteration to minimize cost was suggested in 1957 by Lloyd [32] (and less methodically in the same year by Cox [9]; also apparently by psychologists be- tween 1959-67 [49]). This and similar iterative descent methods soon became the dominant approaches to the problem [35, 33, 12, 31] (see also [19, 20, 24] and the references therein); they remain so today, and are still being improved [1, 42, 44, 28]. Lloyd’s method (in any variant) converges only to local op- tima however, and is sensitive to the choice of the initial centers [38]. Consequently, a lot of research has been directed toward seeding methods that try to start off Lloyd’s method with a good initial configura- tion [18, 29, 17, 23, 46, 5, 36, 43]. Very few theoretical guarantees are known about Lloyd’s method or its variants. The convergence rate of Lloyd’s method has recently been investigated in [10, 22, 2] and in particular, [2] shows that Lloyd’s method can require a superpolynomial number of iterations to converge.

Thek-means problem isNP-hard even fork = 2[13]. Recently there has been substantial progress in developing approximation algorithms for this problem. Matouˇsek [34] gave the first PTAS for this problem, with running time polynomial inn, for a fixedkand dimension. Subsequently a succession of algorithms have appeared [40, 4, 11, 15, 16, 21, 30] with varying runtime dependency onn,kand the dimension. The most recent of these is the algorithm of Kumar, Sabharwal and Sen [30], which presents a linear time PTAS for a fixed k. There are also various constant-factor approximation algorithms for the relatedk-median problem [26, 7, 6, 25, 37], which also yield approximation algorithms fork-means, and have running time polynomial inn,kand the dimension; recently Kanungo et al. [27] adapted thek-median algorithm of [3]

to obtain a(9 +)-approximation algorithm fork-means.

However, none of these methods match the simplicity and speed of the popular Lloyd’s method. Re- searchers concerned with the runtime of Lloyd’s method bemoan the need fornnearest-neighbor compu- tations in each descent step [28] ! Interestingly, the last reference provides a data structure that provably speeds up the nearest-neighbor calculations of Lloyd descent steps, under the condition that the optimal clusters are well-separated. (This is unrelated to providing performance guarantees for the outcome.) Their data structure may be used in any Lloyd-variant, including ours, and is well suited to the conditions under which we prove performance of our method; however, ironically, it may not be worthwhile to precompute their data structure since our method requires so few descent steps.

2 Preliminaries

We use the following notation throughout. For a point setS, we usectr(S)to denote the center of mass of S. Let partitionX1∪ · · · ∪Xk=Xbe an optimalk-means clustering of the inputX, and letci= ctr(Xi) andc = ctr(X). So ∆2k(X) = Pk

i=1

P

x∈Xikx−cik2 = Pk

i=121(Xi). Letni = |Xi|,n = |X|, and

(4)

ri2 = 21n(Xi)

i , that is, ri2 is the “mean squared error” in clusterXi. DefineDi = minj6=ikcj −cik. We assume throughout thatX is-separated for k-means, that is, ∆2k(X) ≤ 22k−1(X), where0 < ≤ 0 with0being a suitably small constant. We use the following basic lemmas quite frequently.

Lemma 2.1 For everyx,P

y∈Xkx−yk2 = ∆21(X) +nkx−ck2. HenceP

{x,y}⊆Xkx−yk2 =n∆21(X).

Lemma 2.2 Consider any set S ⊆ Rd and any partitionS1∪S2 ofS withS1 6= ∅. Lets, s1, s2 denote respectivelyctr(S),ctr(S1),ctr(S2). Then, (i)∆21(S) = ∆21(S1) + ∆21(S2) + |S1|S|||S2|ks1−s2k2, and (ii) ks1−sk2|S|21(S)·|S|S2|

1|.

Proof : Leta=|S1|andb=|S2|=|S| − |S1|. We have

21(S) = X

x∈S1

kx−sk2+ X

x∈S2

kx−ck2

= ∆21(S1) +aks1−sk2

+ ∆21(S2) +bks2−sk2

(by Lemma 2.1)

= ∆21(S1) + ∆21(S2) + a+bab · ks1−s2k2.

The second equality follows from Lemma 2.1 by noting thatsis also the center of mass of the point set where apoints are located at s1 andb points are located ats2, and so the optimal 1-means cost of this point set is given by aks1 −sk2 +bks2 − sk2. This proves part (i). Part (ii) follows by substituting ks1−sk=ks1−s2k ·b/(a+b)in part (i) and dropping the∆21(S1)and∆21(S2)terms.

3 The 2-means problem

We first consider the 2-means case. We assume that the inputX is-separated for 2-means. We present an algorithm that returns a solution of cost at most 1 +f()

22(X)in linear time, for a suitably defined function f that satisfies lim→0f() = 0. An appealing feature of our algorithm is its simplicity, both in description and analysis. In Section 4, where we consider the k-means case, we will build upon this algorithm to obtain both a linear time constant-factor (of the form1 +f()) approximation algorithm and a PTAS with running time exponential ink, but linear inn, d.

The chief algorithmic novelty in our 2-means algorithm is anon-uniformsampling process to pick two seed centers. Our sampling process is very simple:we pick the pairx, y∈Xwith probability proportional tokx−yk2. This biases the distribution towards pairs that contribute a large amount to ∆21(X) (noting thatn∆21(X) =P

{x,y}⊆Xkx−yk2). We emphasize that, as improving the seeding is the only way to get Lloyd’s method to find a high-quality clustering, the topic of picking the initial seed centers has received much attention in the experimental literature (see, e.g., [43] and references therein). However, to the best of our knowledge, this simple and intuitive seeding method is new to the vast literature on the k-means problem. By putting more weight on pairs that contribute a lot to∆21(X), the sampling process aims to pick the initial centers from thecoresof the two optimal clusters. We define the core of a cluster precisely later, but loosely speaking, it consists of points in the cluster that are significantly closer to this cluster-center than to any other center. Lemmas 3.1 and 3.2 make the benefits of this approach precise. Thus, in essence, we are able to leverage the separation condition to nearly isolate the optimal centers. Once we have the initial centers within the cores of the two optimal clusters, we show that a simple Lloyd-like step, which is also simple to analyze, yields a good performance guarantee: we consider a suitable ball around each center and move the center to the centroid of this ball to obtain the final centers. This “ball-k-means” step is adopted from Effros and Schulman [16], where it is shown that if thek-means cost of the current solution is small

(5)

compared to∆2k−1(X) (which holds for us since the initial centers lie in the cluster-cores) then a Lloyd step followed by a ball-k-means step yields a clustering of cost close to∆2k(X). In our case, we are able to eliminate the Lloyd step, and show that the ball-k-means step alone guarantees a good clustering.

1. Sampling. Randomly select a pair of points from the setXto serve as the initial centers, picking the pairx, y∈Xwith probability proportional tokx−yk2. Letˆc1,cˆ2denote the two picked centers.

2. “Ball-k-means” step. For eachˆci, consider the ball of radiuskˆc1−ˆc2k/3aroundˆci and compute the centroid¯ciof the portion ofXin this ball. Return¯c1,¯c2as the final centers.

Running time The entire algorithm runs in timeO(nd). Step 2 clearly takes onlyO(nd)time. We show that the sampling step can be implemented to run inO(nd)time. Consider the following two-step sampling procedure: (a) first pick centercˆ1 by choosing a pointx ∈ X with probability equal to

P

y∈Xkx−yk2 P

x,y∈Xkx−yk2 =

21(X) +nkx−ck2

/2n∆21(X)(using Lemma 2.1); (b) pick the second center by choosing pointy∈X with probability equal toky−ˆc1k2/ ∆21(X) +nkc−ˆc1k2

. This two-step sampling procedure is equivalent to the sampling process in step 1, that is, it picks pairx1, x2 ∈ Xwith probability P kx1−x2k2

{x,y}⊆Xkx−yk2. Each step takes onlyO(nd)time since∆21(X)can be precomputed inO(nd)time.

Analysis The analysis hinges on the important fact that under the separation condition, the radius ri of each optimal cluster is substantially smaller than the inter-cluster separationkc1−c2k(Lemma 3.1). This allows us to show in Lemma 3.2 that with high probability, each initial centercˆi lies in thecore (suitably defined) of a distinct optimal cluster, sayXi, and hencekc1−c2kis much larger than the distanceskˆci−cik fori= 1,2. Assuming thatcˆ1,ˆc2lie in the cores of the clusters, we prove in Lemma 3.3 that the ball around ˆ

cicontains only, and most of the mass of clusterXi, and therefore the centroidc¯iof this ball is very “close”

toci. This in turn implies that the cost of the clustering around¯c1,c¯2is small.

Lemma 3.1 max(r21, r22)≤ 1−22kc1−c2k2 =O(2)kc1−c2k2.

Proof : By part (i) of Lemma 2.2 we have∆21(X) = ∆22(X) + n1nn2 · kc1 −c2k2 which is equivalent to

n

n1n2 ·∆22(X) =kc1−c2k2 22(X)

21(X)−∆22(X). This implies thatr12·nn

2 +r22·nn

11−22kc1−c2k2. Letρ = 1001−22. We require thatρ < 1. We define the core of clusterXi as the setXicor =

x ∈ Xi : kx−cik2rρi2 . By Markov’s inequality,|Xicor| ≥(1−ρ)nifori= 1,2.

Lemma 3.2 Pr [{ˆc1,ˆc2} ∩X1cor6=∅and{ˆc1,cˆ2} ∩X2cor6=∅] = 1−O(ρ).

Proof : To simplify our expressions, we assume that all the points are scaled bykc 1

1−c2k(sokc1−c2k= 1).

By part (i) of Lemma 2.2, we have∆21(X) = ∆22(X) + n1nn2 · kc1 −c2k2 which implies that∆21(X) ≤

n1n2

n(1−2). Letc0i denote the center of mass of Xicor. Applying part (ii) of Lemma 2.2 (takingS = Xi and S1 = Xicor) we get thatkc0i −cik21−ρρ ·ri2. The probability of the event in the lemma isA/B where A = P

x∈X1cor

P

y∈X2corkx−yk2 = |X1cor|∆21(X2cor) +|X2cor|∆21(X1cor) +|X1cor||X2cor|kc01−c02k2, and B =P

{x,y}⊆Xkx−yk2 =n∆21(X)≤ n1−1n22. By the above bounds onkc0i−cikand Lemma 3.1, we get kc01−c02k ≥1−2q ρ

(1−ρ)(1−2). SoA= 1−O(ρ)

n1n2, andA/B= 1−O(ρ) .

So we may assume that each initial center ˆci lies in Xicor. Let dˆ= kˆc1 −ˆc2kandBi = {x ∈ X : kx−ˆcik ≤d/3}. Recall thatˆ c¯iis the centroid ofBi, and we return¯c1,c¯2 as our final solution.

(6)

Lemma 3.3 For eachi, we haveXicor⊆Bi⊆Xi. Hence,k¯ci−cik21−ρρ ·ri2.

Proof : By Lemma 3.1 and the definition ofXicor, we know thatkˆci−cik ≤θkc1−c2kfori= 1,2where θ= √

ρ(1−2)101. So 45kc dˆ

1−c2k65. For anyx∈Biwe havekx−cik ≤ d3ˆ+kˆci−cik ≤ kc1−c2 2k, so x∈Xi. Also for anyx∈Xicor,kx−ˆcik ≤2θkc1−c2k ≤ d3ˆ, sox∈Bi. Now by part (ii) of Lemma 2.2, withS =XiandS1 =Bi, we obtain thatk¯ci−cik21−ρρ ·r2i since|Bi| ≥ |Xicor|for eachi.

Theorem 3.4 The above algorithm returns a clustering of cost at most 1−ρ22(X) with probability at least 1−O(ρ)in timeO(nd), whereρ= Θ(2).

Proof : The cost of the solution is at mostP

i,x∈Xikx−¯cik2 =P

i21(Xi) +nik¯ci−cik2

1−ρ22(X).

4 The k-means problem

We now consider the k-means setting. We assume that ∆2k(X) ≤ 22k−1(X). We describe a linear time constant-factor approximation algorithm, and a PTAS that returns a(1 +ω)-optimal solution in time O 2O(k/w)nd

. The algorithms consist of various ingredients, which we describe separately first for ease of understanding, before gluing them together to obtain the final algorithm.

Conceptually both algorithms proceed in two stages. The first stage is aseeding stage, which performs the bulk of the work and guarantees that at the end of this stage there arekseed centers positioned at nearly the right locations. By this we mean that if we consider distances at the scale of the inter-cluster separation, then at the end of this stage, each optimal center has a (distinct) initial center located in close proximity

— this is precisely the leverage that we obtain from thek-means separation condition (as in the 2-means case). We shall employ three simple seeding procedures with varying time vs. quality guarantees that will exploit this condition and seed thekcenters at locations very close to the optimal centers. In Section 4.1.1, we consider a natural generalization of the sampling procedure used for the 2-means case, and show that this picks the k initial centers from the cores of the optimal clusters. This sampling procedure runs in linear time but it succeeds with probability that is exponentially small ink. In Section 4.1.2, we present a very simpledeterministicgreedy deletion procedure, where we start off with all points inX as the centers and then greedily delete points (and move centers) until there arekcenters left. The running time here is O(n3d). Our deletion procedure is similar to thereverse greedy algorithmproposed by Chrobak, Kenyon and Young [8] for thek-median problem. Chrobak et al. show that their reverse greedy algorithm attains an approximation ratio ofO(logn), which is tight up to a factor oflog logn. In contrast, for thek-means problem, if∆2k(X)≤22k−1(X), we show that our greedy deletion procedure followed by a clean-up step (in the second stage) yields a 1 +f()

-approximation algorithm.Finally, in Section 4.1.3 we combine the sampling and deletion procedures to obtain anO(nkd+k3d)-time initialization procedure. We sampleO(k) centers, which ensures that every cluster has an initial center in a slightly expanded version of the core, and then run the deletion procedure on an instance of sizeO(k)derived from the sampled points to obtain thek seed centers.

Once the initial centers have been positioned sufficiently close to the optimal centers, we can proceed in two ways in thesecond-stage(Section 4.2). One option is to use a ball-k-means step, as in 2-means, which yields a clustering of cost 1 +f()

2k(X)due to exactly the same reasons as in the 2-means case. Thus, combined with the initialization procedure of Section 4.1.3, this yields a constant-factor approximation algorithm with running timeO(nkd+k3d). The entire algorithm is summarized in Section 4.3.

The other option, which yields a PTAS, is to use a sampling idea of Kumar et al. [30]. For each initial center, we compute a list of candidate centers for the corresponding optimal cluster as follows: we sample

(7)

a small set of points uniformly at random from a slightly expanded Voronoi region of the initial center, and consider the centroid of every subset of the sampled set of a certain size as a candidate. We exhaustively search for thekcandidates (picking one candidate per initial center) that yield the least cost solution, and output these as our final centers. The fact that each optimal centercihas an initial center in close proximity allows us to argue that the entire optimal cluster Xi is contained in the expanded Voronoi region of this initial center, and moreover that|Xi| is a significant fraction of the total mass in this region. Given this property, as argued by Kumar et al. (Lemma 2.3 in [30]), a random sample from the expanded Voronoi region also (essentially) yields a random sample fromXi, which allows us to compute a good estimate of the centroid ofXi, and hence of ∆21(Xi). We obtain a(1 +ω)-optimal solution in time O 2O(k/ω)nd with constant probability. Since we incur an exponential dependence onkanyway, we just use the simple sampling procedure of Section 4.1.1 in the first-stage to pick the k initial centers. Although the running time is exponential in k, it is significantly better than the running time of O 2(k/ω)O(1)nd

incurred by the algorithm of Kumar et al.; we also obtain a simpler PTAS. Both of these features can be traced to the separation condition, which enables us to nearly isolate the positions of the optimal centers in the first stage.

Kumar et al. do not have any such facility, and therefore need to sequentially “guess” (i.e., exhaustively search) the various centroids, incurring a corresponding increase in the run time. This PTAS is described in Section 4.4.

4.1 Seeding procedures used in stage I 4.1.1 Sampling

We pick k initial centers as follows: first pick two centerscˆ1,ˆc2 as in the 2-means case, that is, choose x, y ∈ X with probability proportional tokx−yk2. Suppose we have already pickedicenterscˆ1, . . . ,ˆci

where2≤i < k. Now pick a random pointx∈Xwith probability proportional tominj∈{1,...,i}kx−ˆcjk2 and set that as centercˆi+1.

Running time The sampling procedure consists ofkiterations, each of which takesO(nd)time. This is because after sampling a new pointˆci+1, we can update the quantityminj∈{1,...,i+1}kx−ˆcjkfor each point xinO(d)time. So the overall running time isO(nkd).

Analysis Let2 ρ <1be a parameter that we will set later. As in the 2-means case, we define the core of clusterXiasXicor =

x∈Xi : kx−cik2rρ2i . We show that under our separation assumption, the above sampling procedure will pick thekinitial centers to lie in the cores of the clustersX1, . . . , Xkwith probability 1−O(ρ)k

. We also show in Lemma 4.4 that if more thank, but stillO(k), points are sampled, then withconstant probability, every cluster will contain a sampled point that lies in a somewhat larger core, that we call theouter coreof the cluster. This analysis will be useful in Section 4.1.3.

Lemma 4.1 With probability1−O(ρ), the first two centersˆc1,cˆ2 lie in the cores of different clusters, that is,Pr[S

i6=j(x∈Xicorandy∈Xjcor)] = 1−O(ρ).

Proof : The key observation is that for any pair of distinct clustersXi, Xj, the 2-means separation condition holds, that is,∆22(Xi∪Xj) = ∆21(Xi) + ∆21(Xj)≤221(Xi∪Xj). This is because

2k−1(X)≤ X

`6=i,j

21(X`) + ∆21(Xi∪Xj) = ∆2k(X) +

21(Xi∪Xj)−∆22(Xi∪Xj)

.

(8)

So∆21(Xi∪Xj)−∆22(Xi∪Xj) ≥ 12 −1

2k(X) ≥ 12 −1

22(Xi∪Xj). So using Lemma 3.2 we obtain thatP

x∈Xicor,y∈Xjcorkx−yk2 = 1−O(ρ) P

{x,y}⊆Xi∪Xjkx−yk2. Summing over all pairsi, j yields the lemma.

Now inductively suppose that the firsticenters pickedcˆ1, . . . ,ˆcilie in the cores of clustersXj1, . . . , Xji. We show that conditioned on this event, centerˆci+1lies in the core of some clusterX`where` /∈ {j1, . . . , ji} with probability1−O(ρ). Given a setSof points, we used(x, S)to denoteminy∈Skx−yk.

Lemma 4.2 Pr ˆ

ci+1 ∈S

` /∈{j1,...,ji}X`cor|ˆc1, . . . ,ˆcilie in the cores ofXj1, . . . , Xji

= 1−O(ρ).

Proof : For notational convenience, re-index the clusters so that{j1, . . . , ji} = {1, . . . , m}. Let Cˆ = {ˆc1, . . . ,ˆci}. For any cluster Xj, letpj ∈ {1, . . . , i} be the index such thatd(cj,C) =ˆ kcj −ˆcpjk. Let A = Pk

j=m+1

P

x∈Xcorj d(x,C)ˆ 2, andB = Pk j=1

P

x∈Xjd(x,C)ˆ 2. Observe that the probability of the event stated in the lemma is exactlyA/B. Letαdenote the maximum over allj ≥m+ 1of the quantity maxx∈Xcorj kx−cjk/d(cj,C). For any pointˆ x∈ Xjcor, j ≥m+ 1, we haved(x,C)ˆ ≥(1−α)d(cj,C).ˆ Note that by Lemma 3.1,α≤ /

ρ(1−2) 1−/

ρ(1−2) ≤ √ 2

ρ(1−2) <1for a small enoughρ. Therefore, A=

k

X

j=m+1

X

x∈Xjcor

d(x,C)ˆ 2

k

X

j=m+1

(1−ρ)(1−α)2njd(cj,C)ˆ 2 ≥(1−ρ−2α)

k

X

j=m+1

njd(cj,C)ˆ 2.

On the other hand, for any pointx∈Xj, j = 1, . . . , k, we haved(x,C)ˆ ≤ kx−ˆcpjk. Also note that for j= 1, . . . , m,ˆcpj lies inXjcor, sokcj−ˆcpjk ≤ rjρ. Therefore,

B ≤

k

X

j=1

X

x∈Xj

kx−ˆcpjk2

k

X

j=1

21(Xj) +njkcj−ˆcpjk2

≤ 1 +1

ρ

2k(X) +

k

X

j=m+1

njd(cj,C)ˆ 2.

Finally, for anyj=m+1, . . . k, if we assign all the points in clusterXjto the pointcˆpj, then the increase in cost is exactlynjkcj−cˆpjk2and at least∆2k−1(X)−∆2k(X). Therefore 12 −1

2k(X)≤njd(cj,C)ˆ 2 for anyj =m+ 1, . . . , k, andB ≤ 1+1−22 Pk

j=m+1njd(cj,C)ˆ 2. Comparing withAand plugging in the value ofα, we get thatA= 1−O(ρ+ρ)

B. If we setρ= Ω(2/3), we obtainA/B= 1−O(ρ).

Next, we analyze the case when more thankpoints are sampled. Letρ1 = ρ3. Define theouter core ofXi to be Xiout = {x ∈ Xi : kx−cik2ρri2

1}. Note that Xicor ⊆ Xiout. LetN = 1−5ρ2k + 2 ln(2/δ)(1−5ρ)2

where0< δ <1is a desired error tolerance. We prove in Lemma 4.3 that at every sampling step, there is a constant probability that the sampled point lies in the core of some cluster whose outer core does not contain a previously sampled point. The crucial difference between this lemma and Lemma 4.2, is that Lemma 4.2 only shows that the “good” event happensconditionedon the fact that previous samples were also “good”, whereas here we give anunconditionalbound. Using this, Lemma 4.4 shows that if we sampleN points fromX, then with some constant probability, each outer coreXioutwill contain a sampled point. The proof is based on a straightforward martingale analysis.

Lemma 4.3 Suppose that we have sampled i points {xˆ1, . . . ,xˆi} from X. Let X1, . . . , Xm be all the clusters whose outer cores contain some sampled pointxˆj. ThenPr[ˆxi+1 ∈Sk

j=m+1Xjcor]≥1−5ρ.

Proof : For i = 0,1 this follows from Lemma 4.1. We mimic the proof of Lemma 4.2. Let Cˆ = {ˆx1, . . . ,xˆi}. We haveXjout∩Cˆ 6= ∅forj = 1, . . . , mandXjout∩Cˆ =∅forj =m+ 1, . . . , k. Letα

(9)

denote the maximum over allj ≥ m+ 1 of the quantity(maxx∈Xjcorkx−cjk)/d(cj,C). Here we haveˆ α≤p

ρ1/ρ <1. Then for any pointx ∈Xjcor, j ≥m+ 1, we haved(x,C)ˆ ≥(1−α)d(cj,C)ˆ and as in Lemma 4.2,A =Pk

j=m+1

P

x∈Xjcord(x,C)ˆ 2 ≥ (1−ρ−2α)Pk

j=m+1njd(cj,C)ˆ 2. On the other hand, again arguing as in Lemma 4.2, we have B = Pk

j=1

P

x∈Xjd(x,C)ˆ 21+1−221Pk

j=m+1njd(cj,C)ˆ 2. ThereforeA/B≥1− ρ+ 2qρ

1

ρ +ρ2

1 +2

. Sinceρ13, takingρ=√

givesA/B≥1−5ρ.

Lemma 4.4 Suppose we sampleN pointsxˆ1, . . . ,xˆN fromXusing the above sampling procedure. Then, Pr[∀j= 1, . . . , k, there exists somexˆi ∈Xjout]≥1−δ.

Proof : LetYt be a random variable that denotes the number of clusters that do not contain a sampled point in their outer cores, aftertpoints have been sampled. We want to boundPr[YN > 0]. Consider the following random walk on the line withWtdenoting the (random) position afterttime steps:W0 =k, and Wt+1 =Wtwith probability5ρandWt−1with probability1−5ρ. Notice thatPr[YN >0]≤Pr[WN >0], because as long as Wt > 0, any outcome that leads to a left move in the random walk can be mapped to an outcome (in the probability space corresponding to the sampling process) where the outer core of a new cluster is hit by the currently sampled point. So we bound Pr[WN > 0]. DefineZt = Yt+t(1−5ρ).

Then E

Zt+1|Z1, . . . , Zt

≤ Zt, so Z0, Z1, . . . forms a supermartingale. Clearly |Zt+1 −Zt| ≤ 1 for all t. So by Azuma’s inequality (see, e.g., [39]),Pr[ZN −Z0 > p

2Nln(2/δ)] ≤ δ which implies that WN ≤k+p

2Nln(2/δ)−N(1−5ρ)with probability at least1−δ. Plugging the value ofN shows that N(1−5ρ)−p

2Nln(2/δ)≥k.

Corollary 4.5 (i) If we samplekpointsˆc1, . . . ,cˆk, then with probability 1−O(ρ)k

, whereρ= Ω(2/3), for eachithere is a distinct centercˆi ∈Xicor, that is,kˆci−cik ≤ri/√

ρ.

(ii) If we sample N points xˆ1, . . . ,xˆN, where N = 1−5ρ2k + 2 ln(2/ρ)(1−5ρ)2 andρ = √

, then with probability 1−O(ρ), for eachithere is a distinct pointxˆi ∈Xiout, that is,kˆxi−cik ≤ri/p

ρ3.

4.1.2 Greedy deletion procedure

We maintain a set of centersCˆ that are currently used to clusterX. For any pointx ∈Rd, letR(x) ⊆X denote the points ofX in the Voronoi region of x(given the set of centers C). We refer toˆ R(x) as the Voronoi set ofx. InitializeCˆ ←X. Repeat the following steps until|C|ˆ =k.

B1. ComputeT = cost of clusteringXaround the centers inCˆ = P

x∈Cˆ

P

y∈R(x)ky−xk2. Also for everyx∈C, computeˆ Tx =cost of clusteringXaroundCˆ\ {x}=P

z∈C\{x}ˆ

P

y∈R−x(z)ky−zk2, whereR−x(z)denotes the Voronoi set ofzgiven the center setCˆ\ {x}.

B2. Pick the centery∈Cˆfor whichTx−Tis minimum and setCˆ ←Cˆ\ {y}.

B3. Recompute the Voronoi setsR(x) = R−y(x) ⊆ X for each (remaining) center x ∈ C. Now weˆ

“move” the centers to the centroids of their respective (new) Voronoi sets, that is, for every setR(x), we updateCˆ←Cˆ\ {x} ∪ {ctr(R(x))}.

Running time There aren−kiterations of the B1-B3 loop. Each iteration takesO(n2d)time: computing T and the setsR(x)for eachxtakes O(n2d) time and we can then compute eachTxinO(|R(x)|d)time (since while computingT, we can also compute for each point its second-nearest center inC). Thereforeˆ the overall running time isO(n3d).

(10)

Analysis Letρbe a parameter such thatρ≤ 101, /p

ρ(1−2)≤ 141. Recall thatDi= minj6=ikcj−cik.

Defined2i = ∆2k(X)/ni. We will use a different notion of a cluster-core here, but the notion will still capture the fact that the core consists of points that are quite close to the cluster-center compared to the inter-cluster distance, and contains most of the mass of the cluster. LetB(x, r) = {y ∈ Rd : kx−yk ≤ r}denote the ball of radiusrcentered atx. Define thekernelofXito be the ballZi =B(ci, di/√

ρ)and the core of XiasXicor =Xi∩Zi. Observe thatri ≤di, so by Markov’s inequality|Xicor| ≥ (1−ρ)ni. Also, since

2k−1(X)−∆2k(X)≤niDi2we have thatd2i ≤Di2·1−22. Therefore,Xicor =X∩Zi. We prove that, at the start of every iteration, for everyi, there is a (distinct) centerx∈Cˆthat lies inZi. (*) Clearly (*) holds at the beginning, since Cˆ = X and Xicor 6= ∅ for every cluster Xi. First we show (Lemma 4.6) that ifx∈Cˆis the only center that lies in a slightly enlarged version of the ballZifor somei, thenxis not deleted . Lemma 4.7 then makes the crucial observation that even after a centeryis deleted, if the new Voronoi regionR−y(x)of a centerx∈Cˆcaptures points fromXicor, thenR−y(x)cannot “extend”

too far into some other clusterXi, that is, forx0 ∈R−y(x)∩Xj wherej 6=i,ky−cikis not much larger thanky−cjk. It will then follow that invariant (*) is maintained.

Lemma 4.6 Suppose (*) holds at the start of an iteration, andx ∈ Cˆ is the only center inB ci,4dρi for some clusterXi, thenx∈Cˆ after step B2.

Proof : Since property (*) holds, we also know thatx ∈Zi and soXicor ⊆ R(x)∩Xi. Ifxis deleted in step B2 then all points inXicorwill be reassigned to a center at least4dρi away fromci. So the cost-increase Tx−T is at leastA = 5(1−ρ)ρ ·nid2i = 5(1−ρ)ρ ·∆2k(X). Now since|C|ˆ > k, there is somej (jcould be i) such that the Voronoi region ofcj (with respect to the optimal center-set) contains at least two centers fromC. We will show that deleting one of these centers will be less expensive than deletingˆ x. Letz` ∈Cˆ be the center closest to c` for ` = 1, . . . , k. Note that z` ∈ Z`. Lety ∈ C, yˆ 6= zj be another center in the Voronoi region ofcj. Suppose we deletey. We can upper bound the cost-increaseTy −T by the cost-increase due to the reassignment where we assign all points inRy∩X`toz`for`= 1, . . . , k. For any x0 ∈R(y)∩X` we havekx0−z`k ≤ kx0−c`k+kc`−z`k ≤ kx0−c`k+d`ρ. For`6=j, we also have D`≤ kcj−yk+ky−c`k ≤2ky−c`ksincecjis closer toythanc`, and

ky−c`k ≤ kx0−c`k+kx0−yk ≤ kx0−c`k+kx0−z`k ≤2kx0−c`k+kc`−z`k ≤2kx0−c`k+ d`

√ρ.

Therefore,D` ≤4kx0−c`k+2dρ` which implies that

1−2 2ρ

d` ≤4kx0−c`k. Combining this with the bound onkx0−z`k, we get that for`6=j,kx0−z`k ≤βkx0−c`kwhereβ= 1 +√ 4

ρ(1−2)−2. Hence, the cost-increase of the reassignment is at most

B =

k

X

`=1

X

x0∈R(y)∩X`

kx0−z`k2

≤ X

x0∈R(y)∩Xj

2

kx0−cjk2+ d2j ρ

+X

`6=j

X

x0∈R(y)∩X`

β2kx0−c`k2

≤ max(2, β2)∆2k(X) +2ρ·njd2j =

max(2, β2) +2ρ

2k(X).

Anyρsatisfying the bounds stated in Section 4.1.2 ensures thatA > B(sinceβ < 43 andρ < 37). Thus,x is not the cheapest center to delete, which completes the proof.

(11)

Lemma 4.7 Suppose centery∈Cˆis deleted in step B2. Letx∈Cˆ\ {y}be such thatR−y(x)∩Xjcor 6=∅ for somej. Then for anyx0 ∈R−y(x)∩X`, `6= jwe havekx0−cjk ≤ kx0−c`k+ max(d`+6djρ,4d`+3dj).

Proof : Suppose thatylies in the Voronoi region of centerci (wrt. optimal centers). LetCˆ0 = ˆC\ {y}.

There must be a centerzi ∈Cˆ0such thatkzi−cik ≤ 4dρi. Ify /∈Zi, this follows from property (*) otherwise this follows from Lemma 4.6. For any`6=i, we know by property (*) that there is some centerz`∈Cˆ0that lies inZ`. Letx00be a point inR−y(x)∩Xjcor. Then,

kx−cjk ≤ kx−x00k+kx00−cjk ≤ kx00−zjk+kx00−cjk ≤ kzj−cjk+ 2kx00−cjk ≤ kzj−cjk+2dj

√ρ.

Now considering the pointx0, we have

kx0−cjk ≤ kx0−xk+kx−cjk ≤ kx0−z`k+kx−cjk ≤ kx0−c`k+kz`−c`k+kx−cjk

≤ kx0−c`k+kz`−c`k+kzj−cjk+2dj

√ρ.

Ifj =i, then we get that,kx0 −cik ≤ kx0−c`k+ d`+6dρ i. For any otherj, we have thatkx0 −cjk ≤ kx0−c`k+4d`+3dρ j (since it could be that`=i).

Lemma 4.8 Suppose that property (*) holds at the beginning of some iteration in the deletion phase. Then (*) also holds at the end of the iteration, i.e., after step B3.

Proof : Suppose that we delete center y ∈ Cˆ that lies in the Voronoi region of centerci (wrt. optimal centers) in step B2. Let Cˆ0 = ˆC \ {y} andR0(x) = R−y(x) for any x ∈ Cˆ0. Fix a cluster Xj. Let S ={x∈ Cˆ0 : R0(x)∩Xjcor 6=∅}andY =S

x∈SR0(x). We show that there is some setR0(x), x∈Cˆ0 whose centroid ctr(R0(x)) lies in the ball Zj, which will prove the lemma. By Lemma 4.7 and noting thatd2`1−22 ·D2` for every`, for anyx0 ∈ Y ∩X` where` 6= j, we have kx0 −cjk ≤ kx0−c`k+

ρ(1−2) ·max(D`+ 6Dj,4D`+ 3Dj). AlsoDj, D` ≤ kcj −c`k ≤ kx0−cjk+kx0−c`k. Substituting for Dj, D` we get that ky− cjk ≤ βky −c`k where β = 1+7/

ρ(1−2) 1−7/

ρ(1−2). Using this we obtain that A=P

x0∈Y kx0−cjk2 ≤β2Pk

`=1

P

x0∈Y∩X`kx0−c`k2 ≤β22k(X). We also have A=X

x∈S

X

x0∈R0(x)

ky−cjk2 =X

x∈S

21(R0(x))+|R0(x)|kctr(R0(x))−cjk2

≥ |Y|min

x∈Skctr(R0(x))−cjk2.

SinceXjcor ⊆Y we have|Y| ≥(1−ρ)nj, so we obtain thatminx∈Skctr(R0(x))−cjk ≤ 1−ρβ ·di. The bounds onρensure that 1−ρρβ2 ≤1, so thatminx∈Skctr(R0(x))−cjk ≤ djρ.

Corollary 4.9 After the deletion phase, for everyi, there is a centercˆi ∈Cˆwithkˆci−cik ≤ √

ρ(1−2)·Di.

Referenzen

ÄHNLICHE DOKUMENTE

– All the considered classification algorithms, both supervised and non, agree that by using the RMS of tremor as the only feature, only events belonging to class P can be

Transformation problem – deriving prices from values and providing a the- ory of profits as arising from surplus values – and possible solutions to the problem have received

While these reductions in individual care raise the probability of a disaster, increases in the number of people and improvements in automation, in and of themselves, lower

Here, we also have that the system with the Fourier model for heat conduction may show exponential stability in bounded domains (in the case of equal wave speeds of the two

Bianchi, L., Knowles, J., Bowler, N.: Local search for the probabilistic traveling salesman problem: Correction to the 2-p-opt and 1-shift algorithms.. Chervi, P.: A

In Section 2 we apply several TSP construction heuristics to the PTSP, consider Jaillet’s Almost Nearest Neighbor Heuristic [10], and introduce two new construction heuristics

Formally, we consider in this work the K-staged two-dimensional cutting stock problem with variable sheet size (K2CSV) in which we are given a set of n E rectangular element types E

We only have to take the existing status values from the previous evaluation, calculate the number of spare firefighters freeff and set B init to those vertices burnt at time t − 1