The Effectiveness of Lloyd-type Methods for the k-Means Problem

(1)

The Effectiveness of Lloyd-type Methods for the k-Means Problem

Rafail Ostrovsky^∗ Yuval Rabani^† Leonard J. Schulman^‡ Chaitanya Swamy^§

Abstract

We investigate variants of Lloyd’s heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to sug- gest improvements in its application. We propose and justify aclusterabilitycriterion for data sets. We present variants of Lloyd’s heuristic that quickly lead to provably near-optimal clustering solutions when applied to well-clusterable instances. This is the first performance guarantee for a variant of Lloyd’s heuristic. The provision of a guarantee on output quality does not come at the expense of speed: some of our algorithms are candidates for being faster in practice than currently used variants of Lloyd’s method. In addition, our other algorithms are faster on well-clusterable instances than recently proposed approximation algorithms, while maintaining similar guarantees on clustering quality. Our main algorithmic contribution is a novel probabilistic seeding process for the starting configuration of a Lloyd-type iteration.

1 Introduction

Overview. There is presently a wide and unsatisfactory gap between the practical and theoretical clustering literatures. For decades, practitioners have been using heuristics of great speed but uncertain merit; the latter should not be surprising since the problem is NP-hard in almost any formulation. However, in the last few years, algorithms researchers have made considerable innovations, and even obtained polynomial- time approximation schemes (PTAS’s) for some of the most popular clustering formulations. Yet these contributions have not had a noticeable impact on practice. Practitioners instead continue to use a variety of heuristics (Lloyd, EM, agglomerative methods, etc.) that have no known performance guarantees.

There are two ways to approach this disjuncture. The most obvious is to continue developing new tech- niques until they are so good—down to the implementations—that they displace entrenched methods. The other is to look toward popular heuristics and ask whether there are reasons that justify their extensive use, but elude the standard theoretical criteria; and in addition, whether theoretical scrutiny suggests improvements in their application. This is the approach we take in this paper.

As in other prominent cases [47, 41], such an analysis typically involves some abandonment of the worst-case inputscriterion. (In fact, part of the challenge is to identify simple conditions on the input, that allow one to prove a performance guarantee of wide applicability.) Our starting point is the notion that (as discussed in [45]) one should be concerned withk-clustering data that possesses ameaningfulk-clustering.

∗rafail@cs.ucla.eduComputer Science Department, University of California at Los Angeles, 90095, USA. Supported in part by IBM Faculty Award, Xerox Innovation Group Award, a gift from Teradata, Intel equipment grant, and NSF Cybertrust grant no. 0430254.

†rabani@cs.technion.ac.il. Computer Science Department, Technion — Israel Institute of Technology, Haifa 32000, Israel. Part of this work was done while visiting UCLA and Caltech. Supported in part by ISF 52/03, BSF 2002282, and the Fund for the Promotion of Research at the Technion.

‡schulman@caltech.edu. Caltech, Pasadena, CA 91125. Supported in part by NSF CCF-0515342, NSA H98230-06-1- 0074, and NSF ITR CCR-0326554.

§cswamy@math.uwaterloo.ca. Dept. of Combinatorics and Optimization, Univ. Waterloo, Waterloo, ON N2L 3G1.

Research supported partially by NSERC grant 32760-06. Work done while the author was a postdoctoral scholar at Caltech.

(2)

What does it mean for the data to have a meaningfulk-clustering? Here are two examples of settings where one would intuitivelynotconsider the data to possess a meaningfulk-clustering. If nearly optimum cost can be achieved by two very differentk-way partitions of the data then the identity of the optimal partition carries little meaning (for example, if the data was generated by random sampling from a source, then the optimal cluster regions might shift drastically upon resampling). Alternatively, if a near-optimal k-clustering can be achieved by a partition into fewer thankclusters, then that smaller value ofkshould be used to cluster the data. If near-optimalk-clusterings are hard to find only when they provide ambiguous classification or marginal benefit (i.e., in the absence of a meaningfulk-clustering), then such hardness should not be viewed as an acceptable obstacle to algorithm development. Instead, the performance criteria should be revised.

Specifically, we consider thek-meansformulation of clustering: given a finite setX⊆R^d, findkpoints (“centers”) to minimize the sum over all points x ∈ X of the squared distance betweenx and the center to which it is assigned. In an optimal solution, each center is assigned the data in its Voronoi region and is located at the center of mass of this data. Perhaps the most popular heuristic used for this problem is Lloyd’s method, which consists of the following two phases: (a) “Seed” the process with some initial centers (the literature contains many competing suggestions of how to do this); (b) Iterate the followingLloyd stepuntil the clustering is “good enough”: cluster all the data in the Voronoi region of a center together, and then move the center to the centroid of its cluster.

Although Lloyd-style methods are widely used, to our knowledge, there is no known mathematical analysis that attempts to explain or predict the performance of these heuristics. In this paper, we take the first step in this direction. We show thatif the data is well-clusterableaccording to a certain “clusterability”

or “separation” condition (that we introduce and discuss below),then various Lloyd-style methods do indeed perform well and return a provably near-optimal clustering. Our contributions are threefold:

(a) We introduce a separation condition and justify it as a reasonable abstraction of well-clusterability for the analysis ofk-means clustering algorithms. Our condition is simple, and abstracts a notion of well-clusterability alluded to earlier: letting∆²_k(X)denote the cost of an optimalk-means solution of inputX, we say thatX is-separated fork-meansif∆²_k(X)/∆²_k−1(X) ≤ ². (A similar condition fork= 2was used for`²₂edge-cost clustering in [45].)

Our motivation for proposing this condition is that a significant drop in thek-clustering cost is already used by practitioners as a diagnostic for choosing the value ofk([14]§10.10). Furthermore, we show that: (i) The data satisfies our separation condition if and only if it satisfies the other intuitive notion of well clusterability suggested earlier, namely that any two low-costk-clusterings disagree on only a small fraction of the data; and (ii) The condition is robust under noisy (even adversarial) perturbation of the data. In Section 5 we prove rigorous versions of (i) and (ii).

(b) We present a novel and efficient sampling process for seeding Lloyd’s method with initial centers, which allows us to prove the effectiveness of these methods.

(c) We demonstrate the effectiveness of (our variants of) the Lloyd heuristic under the separation condition. Specifically: (i) Our simplest variant uses only the new seeding procedure, requires asingle Lloyd-type descent step, and achieves a constant-factor approximation in time linear in |X|. This algorithm has success probability exponentially small in k, but we show that (ii) a slightly more complicated seeding process based on our sampling procedure yields a constant-factor approximation guarantee with constant probability, again in linear time. Since only one run of seeding+descent is re- quired in both algorithms, these are candidates for beingfaster in practicethan currently used Lloyd variants, which are used with multiple re-seedings and many Lloyd steps per re-seeding. (iii) We also give a PTAS by combining our seeding process with a sampling procedure of Kumar, Sabharwal and Sen [30], whose running time is linear in|X|and exponential ink. This PTAS is significantly faster, and also simpler, than the PTAS of Kumar et al. [30] (applying the separation condition to both algorithms; the latter does not run faster under the condition).

(3)

Literature and problem formulation. LetX ⊆R^dbe the given point set andn=|X|. In thek-means problem, the objective is to partitionX into kclustersX¯₁, . . . ,X¯_k and assign each point in every cluster X¯i to a common center¯ci ∈ R^d, so as to minimize the “k-means cost”Pk

i=1

P

x∈X¯ikx−c¯ik², where k.kdenotes the`₂ norm. We let∆²_k(X)denote the optimumk-means cost. Observe that given the centers

¯

c₁, . . . ,¯c_k, it is easy to determine the best clustering corresponding to these centers: cluster X¯_i simply consists of all pointsx∈Xfor which¯ciis the nearest center (breaking ties arbitrarily). Conversely given a clusteringX¯i, . . . ,X¯_k, the best centers corresponding to this clustering are obtained by setting¯ci to be the center of mass (centroid) of clusterX_i, that is, settingc¯_i = _|_X_¯¹

i|·P

x∈X¯ix. It follows that both of these properties simultaneously hold in an optimal solution, that is,c¯iis the centroid of clusterX¯i, and each point inX¯_i hasc¯_ias its nearest center.

The problem of minimizing thek-means cost is one of the earliest and most intensively studied formulations of the clustering problem, both because of its mathematical elegance and because it bears closely on statistical estimation of mixture models ofkpoint sources under spherically symmetric Gaussian noise.

We briefly survey the most relevant literature here. The k-means problem seems to have been first con- sidered by Steinhaus in 1956 [48]. A simple greedy iteration to minimize cost was suggested in 1957 by Lloyd [32] (and less methodically in the same year by Cox [9]; also apparently by psychologists between 1959-67 [49]). This and similar iterative descent methods soon became the dominant approaches to the problem [35, 33, 12, 31] (see also [19, 20, 24] and the references therein); they remain so today, and are still being improved [1, 42, 44, 28]. Lloyd’s method (in any variant) converges only to local op- tima however, and is sensitive to the choice of the initial centers [38]. Consequently, a lot of research has been directed toward seeding methods that try to start off Lloyd’s method with a good initial configuration [18, 29, 17, 23, 46, 5, 36, 43]. Very few theoretical guarantees are known about Lloyd’s method or its variants. The convergence rate of Lloyd’s method has recently been investigated in [10, 22, 2] and in particular, [2] shows that Lloyd’s method can require a superpolynomial number of iterations to converge.

Thek-means problem isNP-hard even fork = 2[13]. Recently there has been substantial progress in developing approximation algorithms for this problem. Matouˇsek [34] gave the first PTAS for this problem, with running time polynomial inn, for a fixedkand dimension. Subsequently a succession of algorithms have appeared [40, 4, 11, 15, 16, 21, 30] with varying runtime dependency onn,kand the dimension. The most recent of these is the algorithm of Kumar, Sabharwal and Sen [30], which presents a linear time PTAS for a fixed k. There are also various constant-factor approximation algorithms for the relatedk-median problem [26, 7, 6, 25, 37], which also yield approximation algorithms fork-means, and have running time polynomial inn,kand the dimension; recently Kanungo et al. [27] adapted thek-median algorithm of [3]

to obtain a(9 +)-approximation algorithm fork-means.

However, none of these methods match the simplicity and speed of the popular Lloyd’s method. Re- searchers concerned with the runtime of Lloyd’s method bemoan the need fornnearest-neighbor compu- tations in each descent step [28] ! Interestingly, the last reference provides a data structure that provably speeds up the nearest-neighbor calculations of Lloyd descent steps, under the condition that the optimal clusters are well-separated. (This is unrelated to providing performance guarantees for the outcome.) Their data structure may be used in any Lloyd-variant, including ours, and is well suited to the conditions under which we prove performance of our method; however, ironically, it may not be worthwhile to precompute their data structure since our method requires so few descent steps.

2 Preliminaries

We use the following notation throughout. For a point setS, we usectr(S)to denote the center of mass of S. Let partitionX1∪ · · · ∪Xk=Xbe an optimalk-means clustering of the inputX, and letci= ctr(Xi) andc = ctr(X). So ∆²_k(X) = Pk

i=1

P

x∈X_ikx−cik² = Pk

i=1∆²₁(Xi). Letni = |X_i|,n = |X|, and

(4)

r_i² = ^∆²¹_n^(Xⁱ⁾

i , that is, r_i² is the “mean squared error” in clusterXi. DefineDi = minj6=ikc_j −cik. We assume throughout thatX is-separated for k-means, that is, ∆²_k(X) ≤ ²∆²_k−1(X), where0 < ≤ ₀ with0being a suitably small constant. We use the following basic lemmas quite frequently.

Lemma 2.1 For everyx,P

y∈Xkx−yk² = ∆²₁(X) +nkx−ck². HenceP

{x,y}⊆Xkx−yk² =n∆²₁(X).

Lemma 2.2 Consider any set S ⊆ R^d and any partitionS₁∪S₂ ofS withS₁ 6= ∅. Lets, s₁, s₂ denote respectivelyctr(S),ctr(S₁),ctr(S₂). Then, (i)∆²₁(S) = ∆²₁(S₁) + ∆²₁(S₂) + ^|S¹_|S|^||S²^|ks₁−s₂k², and (ii) ks₁−sk²≤ ^∆_|S|²¹^(S)·^|S_|S²^|

1|.

Proof : Leta=|S₁|andb=|S₂|=|S| − |S₁|. We have

∆²₁(S) = X

x∈S₁

kx−sk²+ X

x∈S₂

kx−ck²

= ∆²₁(S1) +aks₁−sk²

+ ∆²₁(S2) +bks₂−sk²

(by Lemma 2.1)

= ∆²₁(S₁) + ∆²₁(S₂) + _a+b^ab · ks₁−s₂k².

The second equality follows from Lemma 2.1 by noting thatsis also the center of mass of the point set where apoints are located at s1 andb points are located ats2, and so the optimal 1-means cost of this point set is given by aks₁ −sk² +bks₂ − sk². This proves part (i). Part (ii) follows by substituting ks₁−sk=ks₁−s₂k ·b/(a+b)in part (i) and dropping the∆²₁(S₁)and∆²₁(S₂)terms.

3 The 2-means problem

We first consider the 2-means case. We assume that the inputX is-separated for 2-means. We present an algorithm that returns a solution of cost at most 1 +f()

∆²₂(X)in linear time, for a suitably defined function f that satisfies lim→0f() = 0. An appealing feature of our algorithm is its simplicity, both in description and analysis. In Section 4, where we consider the k-means case, we will build upon this algorithm to obtain both a linear time constant-factor (of the form1 +f()) approximation algorithm and a PTAS with running time exponential ink, but linear inn, d.

The chief algorithmic novelty in our 2-means algorithm is anon-uniformsampling process to pick two seed centers. Our sampling process is very simple:we pick the pairx, y∈Xwith probability proportional tokx−yk². This biases the distribution towards pairs that contribute a large amount to ∆²₁(X) (noting thatn∆²₁(X) =P

{x,y}⊆Xkx−yk²). We emphasize that, as improving the seeding is the only way to get Lloyd’s method to find a high-quality clustering, the topic of picking the initial seed centers has received much attention in the experimental literature (see, e.g., [43] and references therein). However, to the best of our knowledge, this simple and intuitive seeding method is new to the vast literature on the k-means problem. By putting more weight on pairs that contribute a lot to∆²₁(X), the sampling process aims to pick the initial centers from thecoresof the two optimal clusters. We define the core of a cluster precisely later, but loosely speaking, it consists of points in the cluster that are significantly closer to this cluster-center than to any other center. Lemmas 3.1 and 3.2 make the benefits of this approach precise. Thus, in essence, we are able to leverage the separation condition to nearly isolate the optimal centers. Once we have the initial centers within the cores of the two optimal clusters, we show that a simple Lloyd-like step, which is also simple to analyze, yields a good performance guarantee: we consider a suitable ball around each center and move the center to the centroid of this ball to obtain the final centers. This “ball-k-means” step is adopted from Effros and Schulman [16], where it is shown that if thek-means cost of the current solution is small

(5)

compared to∆²_k−1(X) (which holds for us since the initial centers lie in the cluster-cores) then a Lloyd step followed by a ball-k-means step yields a clustering of cost close to∆²_k(X). In our case, we are able to eliminate the Lloyd step, and show that the ball-k-means step alone guarantees a good clustering.

1. Sampling. Randomly select a pair of points from the setXto serve as the initial centers, picking the pairx, y∈Xwith probability proportional tokx−yk². Letˆc₁,cˆ₂denote the two picked centers.

2. “Ball-k-means” step. For eachˆci, consider the ball of radiuskˆc1−ˆc2k/3aroundˆci and compute the centroid¯c_iof the portion ofXin this ball. Return¯c₁,¯c₂as the final centers.

Running time The entire algorithm runs in timeO(nd). Step 2 clearly takes onlyO(nd)time. We show that the sampling step can be implemented to run inO(nd)time. Consider the following two-step sampling procedure: (a) first pick centercˆ1 by choosing a pointx ∈ X with probability equal to

P

y∈Xkx−yk² P

x,y∈Xkx−yk² =

∆²₁(X) +nkx−ck²

/2n∆²₁(X)(using Lemma 2.1); (b) pick the second center by choosing pointy∈X with probability equal toky−ˆc₁k²/ ∆²₁(X) +nkc−ˆc₁k²

. This two-step sampling procedure is equivalent to the sampling process in step 1, that is, it picks pairx₁, x₂ ∈ Xwith probability ^P ^kx¹^−x²^k²

{x,y}⊆Xkx−yk². Each step takes onlyO(nd)time since∆²₁(X)can be precomputed inO(nd)time.

Analysis The analysis hinges on the important fact that under the separation condition, the radius ri of each optimal cluster is substantially smaller than the inter-cluster separationkc₁−c2k(Lemma 3.1). This allows us to show in Lemma 3.2 that with high probability, each initial centercˆ_i lies in thecore (suitably defined) of a distinct optimal cluster, sayXi, and hencekc₁−c2kis much larger than the distanceskˆci−cik fori= 1,2. Assuming thatcˆ1,ˆc2lie in the cores of the clusters, we prove in Lemma 3.3 that the ball around ˆ

c_icontains only, and most of the mass of clusterX_i, and therefore the centroidc¯_iof this ball is very “close”

toc_i. This in turn implies that the cost of the clustering around¯c₁,c¯₂is small.

Lemma 3.1 max(r²₁, r²₂)≤ ₁₋²₂kc₁−c₂k² =O(²)kc₁−c₂k².

Proof : By part (i) of Lemma 2.2 we have∆²₁(X) = ∆²₂(X) + ⁿ¹_nⁿ² · kc₁ −c2k² which is equivalent to

n

n1n2 ·∆²₂(X) =kc₁−c2k² ^∆²²^(X)

∆²₁(X)−∆²₂(X). This implies thatr₁²·_nⁿ

2 +r²₂·_nⁿ

1 ≤ ₁₋²₂kc₁−c2k². Letρ = ¹⁰⁰₁₋2². We require thatρ < 1. We define the core of clusterXi as the setX_i^cor =

x ∈ Xi : kx−cik² ≤ ^r_ρⁱ² . By Markov’s inequality,|X_i^cor| ≥(1−ρ)nifori= 1,2.

Lemma 3.2 Pr [{ˆc₁,ˆc₂} ∩X₁^cor6=∅and{ˆc₁,cˆ₂} ∩X₂^cor6=∅] = 1−O(ρ).

Proof : To simplify our expressions, we assume that all the points are scaled by_kc ¹

1−c₂k(sokc₁−c2k= 1).

By part (i) of Lemma 2.2, we have∆²₁(X) = ∆²₂(X) + ⁿ¹_nⁿ² · kc₁ −c2k² which implies that∆²₁(X) ≤

n1n2

n(1−²). Letc⁰_i denote the center of mass of X_i^cor. Applying part (ii) of Lemma 2.2 (takingS = X_i and S₁ = X_i^cor) we get thatkc⁰_i −c_ik² ≤ _1−ρ^ρ ·r_i². The probability of the event in the lemma isA/B where A = P

x∈X₁^cor

P

y∈X₂^corkx−yk² = |X₁^cor|∆²₁(X₂^cor) +|X₂^cor|∆²₁(X₁^cor) +|X₁^cor||X₂^cor|kc⁰₁−c⁰₂k², and B =P

{x,y}⊆Xkx−yk² =n∆²₁(X)≤ ⁿ₁₋¹ⁿ²₂. By the above bounds onkc⁰_i−c_ikand Lemma 3.1, we get kc⁰₁−c⁰₂k ≥1−2q _ρ

(1−ρ)(1−²). SoA= 1−O(ρ)

n₁n₂, andA/B= 1−O(ρ) .

So we may assume that each initial center ˆc_i lies in X_i^cor. Let dˆ= kˆc₁ −ˆc₂kandB_i = {x ∈ X : kx−ˆcik ≤d/3}. Recall thatˆ c¯iis the centroid ofBi, and we return¯c1,c¯2 as our final solution.

(6)

Lemma 3.3 For eachi, we haveX_i^cor⊆B_i⊆X_i. Hence,k¯c_i−c_ik² ≤ _1−ρ^ρ ·r_i².

Proof : By Lemma 3.1 and the definition ofX_i^cor, we know thatkˆci−cik ≤θkc₁−c2kfori= 1,2where θ= √

ρ(1−²) ≤ ₁₀¹. So ⁴₅ ≤ _kc ^d^ˆ

1−c2k ≤ ⁶₅. For anyx∈Biwe havekx−cik ≤ ^d₃^ˆ+kˆci−cik ≤ ^kc¹^−c₂ ²^k, so x∈Xi. Also for anyx∈X_i^cor,kx−ˆcik ≤2θkc₁−c2k ≤ ^d₃^ˆ, sox∈Bi. Now by part (ii) of Lemma 2.2, withS =X_iandS₁ =B_i, we obtain thatk¯c_i−c_ik²≤ _1−ρ^ρ ·r²_i since|B_i| ≥ |X_i^cor|for eachi.

Theorem 3.4 The above algorithm returns a clustering of cost at most ^∆_1−ρ²²^(X) with probability at least 1−O(ρ)in timeO(nd), whereρ= Θ(²).

Proof : The cost of the solution is at mostP

i,x∈Xikx−¯c_ik² =P

i ∆²₁(X_i) +n_ik¯c_i−c_ik²

≤ ^∆_1−ρ²²^(X⁾.

4 The k-means problem

We now consider the k-means setting. We assume that ∆²_k(X) ≤ ²∆²_k−1(X). We describe a linear time constant-factor approximation algorithm, and a PTAS that returns a(1 +ω)-optimal solution in time O 2^O(k/w)nd

. The algorithms consist of various ingredients, which we describe separately first for ease of understanding, before gluing them together to obtain the final algorithm.

Conceptually both algorithms proceed in two stages. The first stage is aseeding stage, which performs the bulk of the work and guarantees that at the end of this stage there arekseed centers positioned at nearly the right locations. By this we mean that if we consider distances at the scale of the inter-cluster separation, then at the end of this stage, each optimal center has a (distinct) initial center located in close proximity

— this is precisely the leverage that we obtain from thek-means separation condition (as in the 2-means case). We shall employ three simple seeding procedures with varying time vs. quality guarantees that will exploit this condition and seed thekcenters at locations very close to the optimal centers. In Section 4.1.1, we consider a natural generalization of the sampling procedure used for the 2-means case, and show that this picks the k initial centers from the cores of the optimal clusters. This sampling procedure runs in linear time but it succeeds with probability that is exponentially small ink. In Section 4.1.2, we present a very simpledeterministicgreedy deletion procedure, where we start off with all points inX as the centers and then greedily delete points (and move centers) until there arekcenters left. The running time here is O(n³d). Our deletion procedure is similar to thereverse greedy algorithmproposed by Chrobak, Kenyon and Young [8] for thek-median problem. Chrobak et al. show that their reverse greedy algorithm attains an approximation ratio ofO(logn), which is tight up to a factor oflog logn. In contrast, for thek-means problem, if∆²_k(X)≤²∆²_k−1(X), we show that our greedy deletion procedure followed by a clean-up step (in the second stage) yields a 1 +f()

-approximation algorithm.Finally, in Section 4.1.3 we combine the sampling and deletion procedures to obtain anO(nkd+k³d)-time initialization procedure. We sampleO(k) centers, which ensures that every cluster has an initial center in a slightly expanded version of the core, and then run the deletion procedure on an instance of sizeO(k)derived from the sampled points to obtain thek seed centers.

Once the initial centers have been positioned sufficiently close to the optimal centers, we can proceed in two ways in thesecond-stage(Section 4.2). One option is to use a ball-k-means step, as in 2-means, which yields a clustering of cost 1 +f()

∆²_k(X)due to exactly the same reasons as in the 2-means case. Thus, combined with the initialization procedure of Section 4.1.3, this yields a constant-factor approximation algorithm with running timeO(nkd+k³d). The entire algorithm is summarized in Section 4.3.

The other option, which yields a PTAS, is to use a sampling idea of Kumar et al. [30]. For each initial center, we compute a list of candidate centers for the corresponding optimal cluster as follows: we sample

(7)

a small set of points uniformly at random from a slightly expanded Voronoi region of the initial center, and consider the centroid of every subset of the sampled set of a certain size as a candidate. We exhaustively search for thekcandidates (picking one candidate per initial center) that yield the least cost solution, and output these as our final centers. The fact that each optimal centercihas an initial center in close proximity allows us to argue that the entire optimal cluster X_i is contained in the expanded Voronoi region of this initial center, and moreover that|X_i| is a significant fraction of the total mass in this region. Given this property, as argued by Kumar et al. (Lemma 2.3 in [30]), a random sample from the expanded Voronoi region also (essentially) yields a random sample fromX_i, which allows us to compute a good estimate of the centroid ofXi, and hence of ∆²₁(Xi). We obtain a(1 +ω)-optimal solution in time O 2^O(k/ω)nd with constant probability. Since we incur an exponential dependence onkanyway, we just use the simple sampling procedure of Section 4.1.1 in the first-stage to pick the k initial centers. Although the running time is exponential in k, it is significantly better than the running time of O 2^(k/ω)^O(1)nd

incurred by the algorithm of Kumar et al.; we also obtain a simpler PTAS. Both of these features can be traced to the separation condition, which enables us to nearly isolate the positions of the optimal centers in the first stage.

Kumar et al. do not have any such facility, and therefore need to sequentially “guess” (i.e., exhaustively search) the various centroids, incurring a corresponding increase in the run time. This PTAS is described in Section 4.4.

4.1 Seeding procedures used in stage I 4.1.1 Sampling

We pick k initial centers as follows: first pick two centerscˆ₁,ˆc₂ as in the 2-means case, that is, choose x, y ∈ X with probability proportional tokx−yk². Suppose we have already pickedicenterscˆ1, . . . ,ˆci

where2≤i < k. Now pick a random pointx∈Xwith probability proportional tominj∈{1,...,i}kx−ˆcjk² and set that as centercˆ_i+1.

Running time The sampling procedure consists ofkiterations, each of which takesO(nd)time. This is because after sampling a new pointˆci+1, we can update the quantityminj∈{1,...,i+1}kx−ˆcjkfor each point xinO(d)time. So the overall running time isO(nkd).

Analysis Let² ρ <1be a parameter that we will set later. As in the 2-means case, we define the core of clusterX_iasX_i^cor =

x∈X_i : kx−c_ik² ≤ ^r_ρ²ⁱ . We show that under our separation assumption, the above sampling procedure will pick thekinitial centers to lie in the cores of the clustersX1, . . . , Xkwith probability 1−O(ρ)k

. We also show in Lemma 4.4 that if more thank, but stillO(k), points are sampled, then withconstant probability, every cluster will contain a sampled point that lies in a somewhat larger core, that we call theouter coreof the cluster. This analysis will be useful in Section 4.1.3.

Lemma 4.1 With probability1−O(ρ), the first two centersˆc₁,cˆ₂ lie in the cores of different clusters, that is,Pr[S

i6=j(x∈X_i^corandy∈X_j^cor)] = 1−O(ρ).

Proof : The key observation is that for any pair of distinct clustersXi, Xj, the 2-means separation condition holds, that is,∆²₂(X_i∪X_j) = ∆²₁(X_i) + ∆²₁(X_j)≤²∆²₁(X_i∪X_j). This is because

∆²_k−1(X)≤ X

`6=i,j

∆²₁(X`) + ∆²₁(Xi∪Xj) = ∆²_k(X) +

∆²₁(Xi∪Xj)−∆²₂(Xi∪Xj)

.

(8)

So∆²₁(Xi∪Xj)−∆²₂(Xi∪Xj) ≥ ¹₂ −1

∆²_k(X) ≥ ¹₂ −1

∆²₂(Xi∪Xj). So using Lemma 3.2 we obtain thatP

x∈X_i^cor,y∈X_j^corkx−yk² = 1−O(ρ) P

{x,y}⊆X_i∪X_jkx−yk². Summing over all pairsi, j yields the lemma.

Now inductively suppose that the firsticenters pickedcˆ₁, . . . ,ˆc_ilie in the cores of clustersX_j₁, . . . , X_j_i. We show that conditioned on this event, centerˆc_i+1lies in the core of some clusterX_`where` /∈ {j₁, . . . , j_i} with probability1−O(ρ). Given a setSof points, we used(x, S)to denoteminy∈Skx−yk.

Lemma 4.2 Pr ˆ

c_i+1 ∈S

` /∈{j₁,...,ji}X_`^cor|ˆc₁, . . . ,ˆc_ilie in the cores ofX_j₁, . . . , X_j_i

= 1−O(ρ).

Proof : For notational convenience, re-index the clusters so that{j₁, . . . , ji} = {1, . . . , m}. Let Cˆ = {ˆc1, . . . ,ˆci}. For any cluster Xj, letpj ∈ {1, . . . , i} be the index such thatd(cj,C) =ˆ kc_j −ˆcpjk. Let A = Pk

j=m+1

P

x∈X^cor_j d(x,C)ˆ ², andB = Pk j=1

P

x∈Xjd(x,C)ˆ ². Observe that the probability of the event stated in the lemma is exactlyA/B. Letαdenote the maximum over allj ≥m+ 1of the quantity maxx∈X^cor_j kx−cjk/d(c_j,C). For any pointˆ x∈ X_j^cor, j ≥m+ 1, we haved(x,C)ˆ ≥(1−α)d(cj,C).ˆ Note that by Lemma 3.1,α≤ ^/

√

ρ(1−²) 1−/√

ρ(1−²) ≤ √ ²

ρ(1−²) <1for a small enoughρ. Therefore, A=

k

X

j=m+1

X

x∈X_j^cor

d(x,C)ˆ ² ≥

k

X

j=m+1

(1−ρ)(1−α)²n_jd(c_j,C)ˆ ² ≥(1−ρ−2α)

k

X

j=m+1

n_jd(c_j,C)ˆ ².

On the other hand, for any pointx∈Xj, j = 1, . . . , k, we haved(x,C)ˆ ≤ kx−ˆcpjk. Also note that for j= 1, . . . , m,ˆc_p_j lies inX_j^cor, sokc_j−ˆc_p_jk ≤ ^√^r^j_ρ. Therefore,

B ≤

k

X

j=1

X

x∈X_j

kx−ˆc_p_jk²≤

k

X

j=1

∆²₁(X_j) +n_jkc_j−ˆc_p_jk²

≤ 1 +1

ρ

∆²_k(X) +

k

X

j=m+1

n_jd(c_j,C)ˆ ².

Finally, for anyj=m+1, . . . k, if we assign all the points in clusterXjto the pointcˆpj, then the increase in cost is exactlynjkc_j−cˆpjk²and at least∆²_k−1(X)−∆²_k(X). Therefore ¹2 −1

∆²_k(X)≤njd(cj,C)ˆ ² for anyj =m+ 1, . . . , k, andB ≤ ¹⁺₁₋²^/ρ₂ Pk

j=m+1n_jd(c_j,C)ˆ ². Comparing withAand plugging in the value ofα, we get thatA= 1−O(ρ+^√_ρ)

B. If we setρ= Ω(^2/3), we obtainA/B= 1−O(ρ).

Next, we analyze the case when more thankpoints are sampled. Letρ₁ = ρ³. Define theouter core ofX_i to be X_i^out = {x ∈ X_i : kx−c_ik² ≤ _ρ^rⁱ²

1}. Note that X_i^cor ⊆ X_i^out. LetN = _1−5ρ^2k + ^{2 ln(2/δ)}_(1−5ρ)2

where0< δ <1is a desired error tolerance. We prove in Lemma 4.3 that at every sampling step, there is a constant probability that the sampled point lies in the core of some cluster whose outer core does not contain a previously sampled point. The crucial difference between this lemma and Lemma 4.2, is that Lemma 4.2 only shows that the “good” event happensconditionedon the fact that previous samples were also “good”, whereas here we give anunconditionalbound. Using this, Lemma 4.4 shows that if we sampleN points fromX, then with some constant probability, each outer coreX_i^outwill contain a sampled point. The proof is based on a straightforward martingale analysis.

Lemma 4.3 Suppose that we have sampled i points {xˆ₁, . . . ,xˆ_i} from X. Let X₁, . . . , X_m be all the clusters whose outer cores contain some sampled pointxˆj. ThenPr[ˆxi+1 ∈Sk

j=m+1X_j^cor]≥1−5ρ.

Proof : For i = 0,1 this follows from Lemma 4.1. We mimic the proof of Lemma 4.2. Let Cˆ = {ˆx1, . . . ,xî}. We haveX_jôut∩Cˆ 6= ∅forj = 1, . . . , mandX_jôut∩Cˆ =∅forj =m+ 1, . . . , k. Letα

(9)

denote the maximum over allj ≥ m+ 1 of the quantity(maxx∈X_j^corkx−c_jk)/d(c_j,C). Here we haveˆ α≤p

ρ1/ρ <1. Then for any pointx ∈X_j^cor, j ≥m+ 1, we haved(x,C)ˆ ≥(1−α)d(cj,C)ˆ and as in Lemma 4.2,A =Pk

j=m+1

P

x∈X_j^cord(x,C)ˆ ² ≥ (1−ρ−2α)Pk

j=m+1njd(cj,C)ˆ ². On the other hand, again arguing as in Lemma 4.2, we have B = Pk

j=1

P

x∈X_jd(x,C)ˆ ² ≤ ¹⁺₁₋²^/ρ₂¹Pk

j=m+1njd(cj,C)ˆ ². ThereforeA/B≥1− ρ+ 2q_ρ

1

ρ +_ρ²

1 +²

. Sinceρ₁=ρ³, takingρ=√

givesA/B≥1−5ρ.

Lemma 4.4 Suppose we sampleN pointsxˆ₁, . . . ,xˆ_N fromXusing the above sampling procedure. Then, Pr[∀j= 1, . . . , k, there exists somexˆ_i ∈X_j^out]≥1−δ.

Proof : LetY_t be a random variable that denotes the number of clusters that do not contain a sampled point in their outer cores, aftertpoints have been sampled. We want to boundPr[Y_N > 0]. Consider the following random walk on the line withWtdenoting the (random) position afterttime steps:W0 =k, and W_t+1 =W_twith probability5ρandW_t−1with probability1−5ρ. Notice thatPr[Y_N >0]≤Pr[W_N >0], because as long as W_t > 0, any outcome that leads to a left move in the random walk can be mapped to an outcome (in the probability space corresponding to the sampling process) where the outer core of a new cluster is hit by the currently sampled point. So we bound Pr[W_N > 0]. DefineZ_t = Y_t+t(1−5ρ).

Then E

Z_t+1|Z₁, . . . , Z_t

≤ Z_t, so Z₀, Z₁, . . . forms a supermartingale. Clearly |Z_t+1 −Z_t| ≤ 1 for all t. So by Azuma’s inequality (see, e.g., [39]),Pr[Z_N −Z0 > p

2Nln(2/δ)] ≤ δ which implies that W_N ≤k+p

2Nln(2/δ)−N(1−5ρ)with probability at least1−δ. Plugging the value ofN shows that N(1−5ρ)−p

2Nln(2/δ)≥k.

Corollary 4.5 (i) If we samplekpointsˆc1, . . . ,cˆ_k, then with probability 1−O(ρ)k

, whereρ= Ω(^2/3), for eachithere is a distinct centercˆ_i ∈X_i^cor, that is,kˆc_i−c_ik ≤r_i/√

ρ.

(ii) If we sample N points xˆ₁, . . . ,xˆ_N, where N = _1−5ρ^2k + ^{2 ln(2/ρ)}_(1−5ρ)2 andρ = √

, then with probability 1−O(ρ), for eachithere is a distinct pointxˆi ∈X_i^out, that is,kˆxi−cik ≤ri/p

ρ³.

4.1.2 Greedy deletion procedure

We maintain a set of centersCˆ that are currently used to clusterX. For any pointx ∈R^d, letR(x) ⊆X denote the points ofX in the Voronoi region of x(given the set of centers C). We refer toˆ R(x) as the Voronoi set ofx. InitializeCˆ ←X. Repeat the following steps until|C|ˆ =k.

B1. ComputeT = cost of clusteringXaround the centers inCˆ = P

x∈Cˆ

P

y∈R(x)ky−xk². Also for everyx∈C, computeˆ Tx =cost of clusteringXaroundCˆ\ {x}=P

z∈C\{x}ˆ

P

y∈R−x(z)ky−zk², whereR−x(z)denotes the Voronoi set ofzgiven the center setCˆ\ {x}.

B2. Pick the centery∈Cˆfor whichTx−Tis minimum and setCˆ ←Cˆ\ {y}.

B3. Recompute the Voronoi setsR(x) = R−y(x) ⊆ X for each (remaining) center x ∈ C. Now weˆ

“move” the centers to the centroids of their respective (new) Voronoi sets, that is, for every setR(x), we updateCˆ←Cˆ\ {x} ∪ {ctr(R(x))}.

Running time There aren−kiterations of the B1-B3 loop. Each iteration takesO(n²d)time: computing T and the setsR(x)for eachxtakes O(n²d) time and we can then compute eachT_xinO(|R(x)|d)time (since while computingT, we can also compute for each point its second-nearest center inC). Thereforeˆ the overall running time isO(n³d).

(10)

Analysis Letρbe a parameter such thatρ≤ ₁₀¹, /p

ρ(1−²)≤ ₁₄¹. Recall thatD_i= minj6=ikc_j−c_ik.

Defined²_i = ∆²_k(X)/n_i. We will use a different notion of a cluster-core here, but the notion will still capture the fact that the core consists of points that are quite close to the cluster-center compared to the inter-cluster distance, and contains most of the mass of the cluster. LetB(x, r) = {y ∈ R^d : kx−yk ≤ r}denote the ball of radiusrcentered atx. Define thekernelofX_ito be the ballZ_i =B(c_i, d_i/√

ρ)and the core of XiasX_i^cor =Xi∩Zi. Observe thatri ≤di, so by Markov’s inequality|X_i^cor| ≥ (1−ρ)ni. Also, since

∆²_k−1(X)−∆²_k(X)≤n_iD_i²we have thatd²_i ≤D_i²·₁₋²₂. Therefore,X_i^cor =X∩Z_i. We prove that, at the start of every iteration, for everyi, there is a (distinct) centerx∈Cˆthat lies inZ_i. (*) Clearly (*) holds at the beginning, since Cˆ = X and X_i^cor 6= ∅ for every cluster Xi. First we show (Lemma 4.6) that ifx∈Cˆis the only center that lies in a slightly enlarged version of the ballZ_ifor somei, thenxis not deleted . Lemma 4.7 then makes the crucial observation that even after a centeryis deleted, if the new Voronoi regionR−y(x)of a centerx∈Cˆcaptures points fromX_i^cor, thenR−y(x)cannot “extend”

too far into some other clusterX_i, that is, forx⁰ ∈R−y(x)∩X_j wherej 6=i,ky−c_ikis not much larger thanky−cjk. It will then follow that invariant (*) is maintained.

Lemma 4.6 Suppose (*) holds at the start of an iteration, andx ∈ Cˆ is the only center inB c_i,^4d^√_ρⁱ for some clusterX_i, thenx∈Cˆ after step B2.

Proof : Since property (*) holds, we also know thatx ∈Zi and soX_i^cor ⊆ R(x)∩Xi. Ifxis deleted in step B2 then all points inX_i^corwill be reassigned to a center at least^4d^√_ρⁱ away fromc_i. So the cost-increase T_x−T is at leastA = ^5(1−ρ)_ρ ·n_id²_i = ^5(1−ρ)_ρ ·∆²_k(X). Now since|C|ˆ > k, there is somej (jcould be i) such that the Voronoi region ofcj (with respect to the optimal center-set) contains at least two centers fromC. We will show that deleting one of these centers will be less expensive than deletingˆ x. Letz_` ∈Cˆ be the center closest to c_` for ` = 1, . . . , k. Note that z_` ∈ Z_`. Lety ∈ C, yˆ 6= z_j be another center in the Voronoi region ofcj. Suppose we deletey. We can upper bound the cost-increaseTy −T by the cost-increase due to the reassignment where we assign all points inRy∩X_`toz_`for`= 1, . . . , k. For any x⁰ ∈R(y)∩X_` we havekx⁰−z_`k ≤ kx⁰−c_`k+kc_`−z_`k ≤ kx⁰−c_`k+^√^d^`_ρ. For`6=j, we also have D`≤ kc_j−yk+ky−c`k ≤2ky−c`ksincecjis closer toythanc`, and

ky−c_`k ≤ kx⁰−c_`k+kx⁰−yk ≤ kx⁰−c_`k+kx⁰−z_`k ≤2kx⁰−c_`k+kc_`−z_`k ≤2kx⁰−c_`k+ d_`

√ρ.

Therefore,D` ≤4kx⁰−c`k+^2d^√_ρ^` which implies that

√1−² −^√²_ρ

d` ≤4kx⁰−c`k. Combining this with the bound onkx⁰−z_`k, we get that for`6=j,kx⁰−z_`k ≤βkx⁰−c_`kwhereβ= 1 +√ ⁴

ρ(1−²)−2. Hence, the cost-increase of the reassignment is at most

B =

k

X

`=1

X

x⁰∈R(y)∩X_`

kx⁰−z_`k²

≤ X

x⁰∈R(y)∩X_j

2

kx⁰−c_jk²+ d²_j ρ

+X

`6=j

X

x⁰∈R(y)∩X_`

β²kx⁰−c_`k²

≤ max(2, β²)∆²_k(X) +²_ρ·njd²_j =

max(2, β²) +²_ρ

∆²_k(X).

Anyρsatisfying the bounds stated in Section 4.1.2 ensures thatA > B(sinceβ < ⁴₃ andρ < ³₇). Thus,x is not the cheapest center to delete, which completes the proof.

(11)

Lemma 4.7 Suppose centery∈Cˆis deleted in step B2. Letx∈Cˆ\ {y}be such thatR−y(x)∩X_j^cor 6=∅ for somej. Then for anyx⁰ ∈R−y(x)∩X_`, `6= jwe havekx⁰−c_jk ≤ kx⁰−c_`k+ ^max(d^`^+6d^√^j_ρ^,4d^`^+3d^j⁾.

Proof : Suppose thatylies in the Voronoi region of centerci (wrt. optimal centers). LetCˆ⁰ = ˆC\ {y}.

There must be a centerz_i ∈Cˆ⁰such thatkz_i−c_ik ≤ ^4d^√_ρⁱ. Ify /∈Z_i, this follows from property (*) otherwise this follows from Lemma 4.6. For any`6=i, we know by property (*) that there is some centerz`∈Cˆ⁰that lies inZ_`. Letx⁰⁰be a point inR−y(x)∩X_j^cor. Then,

kx−c_jk ≤ kx−x⁰⁰k+kx⁰⁰−c_jk ≤ kx⁰⁰−z_jk+kx⁰⁰−c_jk ≤ kz_j−c_jk+ 2kx⁰⁰−c_jk ≤ kz_j−c_jk+2d_j

√ρ.

Now considering the pointx⁰, we have

kx⁰−c_jk ≤ kx⁰−xk+kx−c_jk ≤ kx⁰−z_`k+kx−c_jk ≤ kx⁰−c_`k+kz_`−c_`k+kx−c_jk

≤ kx⁰−c_`k+kz_`−c_`k+kz_j−c_jk+2dj

√ρ.

Ifj =i, then we get that,kx⁰ −c_ik ≤ kx⁰−c_`k+ ^d^`^√^+6d_ρ ⁱ. For any otherj, we have thatkx⁰ −c_jk ≤ kx⁰−c_`k+^4d^`^√^+3d_ρ ^j (since it could be that`=i).

Lemma 4.8 Suppose that property (*) holds at the beginning of some iteration in the deletion phase. Then (*) also holds at the end of the iteration, i.e., after step B3.

Proof : Suppose that we delete center y ∈ Cˆ that lies in the Voronoi region of centerc_i (wrt. optimal centers) in step B2. Let Cˆ⁰ = ˆC \ {y} andR⁰(x) = R−y(x) for any x ∈ Cˆ⁰. Fix a cluster X_j. Let S ={x∈ Cˆ⁰ : R⁰(x)∩X_j^cor 6=∅}andY =S

x∈SR⁰(x). We show that there is some setR⁰(x), x∈Cˆ⁰ whose centroid ctr(R⁰(x)) lies in the ball Z_j, which will prove the lemma. By Lemma 4.7 and noting thatd²_` ≤ ₁₋²₂ ·D²_` for every`, for anyx⁰ ∈ Y ∩X` where` 6= j, we have kx⁰ −cjk ≤ kx⁰−c`k+

√

ρ(1−²) ·max(D_`+ 6D_j,4D_`+ 3D_j). AlsoD_j, D_` ≤ kc_j −c_`k ≤ kx⁰−c_jk+kx⁰−c_`k. Substituting for Dj, D` we get that ky− cjk ≤ βky −c`k where β = ^1+7/

√ρ(1−²) 1−7/√

ρ(1−²). Using this we obtain that A=P

x⁰∈Y kx⁰−cjk² ≤β²Pk

`=1

P

x⁰∈Y∩X_`kx⁰−c_`k² ≤β²∆²_k(X). We also have A=X

x∈S

X

x⁰∈R⁰(x)

ky−c_jk² =X

x∈S

∆²₁(R⁰(x))+|R⁰(x)|kctr(R⁰(x))−c_jk²

≥ |Y|min

x∈Skctr(R⁰(x))−c_jk².

SinceX_j^cor ⊆Y we have|Y| ≥(1−ρ)n_j, so we obtain thatminx∈Skctr(R⁰(x))−c_jk ≤ ^√_1−ρ^β ·d_i. The bounds onρensure that _1−ρ^ρβ² ≤1, so thatminx∈Skctr(R⁰(x))−cjk ≤ ^√^d^j_ρ.

Corollary 4.9 After the deletion phase, for everyi, there is a centercˆ_i ∈Cˆwithkˆc_i−c_ik ≤ √

ρ(1−²)·D_i.