Separation Conditions and Well-Clusterable Data

2.4 Ward’s Algorithm

2.4.5 Separation Conditions and Well-Clusterable Data

Clustering suffers from a general gap between theoretical study and practical application;

clustering objectives are usually NP-hard to optimize, and even NP-hard to approximate to arbitrary precision. On the other hand, heuristics like Lloyd’s algorithm, which can produce arbitrarily bad solutions, are known to work well or reasonably well in practice.

One way of interpreting this situation is that data often has properties that make the problem computationally easier. Indeed, for clustering it is very natural to assume that the data has some structure – otherwise, what do we hope to achieve with our clustering?

The challenge is to find good measures of structure that characterize what makes clustering easy (but non-trivial).

Many notions ofclusterabilityhave been introduced in the literature and there are also different ways to measure the quality of a clustering. While traditionally a clustering is evaluated on the basis of an objective function (e.g., thek-means objective function), there has been an increased interest recently to study which notions of clusterability make it feasible to recover (partially) atarget clustering, sometrueclustering of the data. For this, the niceness conditions imposed on the input data are usually some form of separation condition on the clusters of the target clustering. We study the effect of five well-studied clusterability notions on the quality of the solution computed by Ward’s method.

δ-center separation and α-center proximity First we study the notions ofδ-center separationand α-center proximity, which have been introduced by Ben-David and Hagh-talab [13] and Awasthi, Blum, and Sheffet [6], respectively.

Definition 2.35([13]). An inputP ⊂R^dsatisfiesδ-center separationwith respect to some target clusteringC₁, . . . , C_k if there exist centers c^∗₁, . . . , c^∗_k ∈R^d such that ||c^∗_j −c^∗_i|| ≥δ· max_`∈kmaxx∈C_`||x−c^∗_`||for alli6=j. We say the input satisfiesweakδ-center separation if for each clusterCj with j∈[k] and for alli6=j,||c^∗_j−c^∗_i|| ≥δ·max_x∈C_j||x−c^∗_j||.

Kushagra, Samadi, and Ben-David [36] show that single linkage and a pruning tech-nique are sufficient to find the target clustering under the condition that the data satisfies δ-center separation for δ≥3.

While the goal of Ben-David and Haghtalab [13] is to recover a target clustering, we focus on approximating the k-means objective function. Hence, in the following we will always assume that the target clusteringC1, . . . , C_k is an optimal k-means clustering (which we usually denote byO₁, . . . , O_k) and the centersc^∗₁, . . . , c^∗_k ∈R^dare the optimal k-means centers for this clustering. We will make this assumption also for all other notions of clusterability that are based on a target clustering and that we introduce in the following.

Definition 2.36([6]). An instanceP satisfiesα-center proximityif there exists an optimal k-means clustering O1, . . . , Ok with centers c^∗₁, . . . , c^∗_k∈R^d such that for all j 6=i, j∈[k]

and for any pointx∈C_i it holds ||x−c^∗_j|| ≥α||x−c^∗_i||.

Awasthi, Blum, Sheffet [6] introduced the notion of α-perturbation resilience and showed that it impliesα-center proximity. They show that forα≥3, the optimal cluster-ing can be recovered if the data isα-perturbation resilient. This was improved by Balcan and Liang [10] and finally by Makarychev and Makarychev [42], who show that it is pos-sible to completely recover the optimal clustering forα = 2. The latter paper shows that the results even hold for a weaker property calledmetric perturbation resilience. We show that for large enough δ and α, Ward’s method computes a 2-approximation if the data satisfiesδ-center separation orα-center proximity.

Theorem 2.37. Let P ⊂ R^d be an instance that satisfies weak (2 + 2√

2 +)-center separation or (3 + 2√

2 +)-center proximity for some > 0 and some number k of clusters. Then thek-clustering computed by Ward on P is a2-approximation with respect to thek-means objective function.

We also show that on instances that satisfy (2 + 2√

2ν+)-center separation and for which all clusters Oi and Oj in the optimal clustering satisfy |O_j| ≥ |O_i|/ν, Ward even recovers the optimal clustering.

It is interesting to note that the example proposed by Arthur and Vassilvitskii [5]

that shows that the famousk-means++ algorithm has an approximation ratio of Ω(logk) satisfies δ-center separation and α-center proximity for large values of δ and α, and has balanced clusters, i.e.,ν= 1.

Observation 2.38. There is a family of examples where k-means++ has an expected approximation ratio of Ω(logk), while Ward computes an optimal solution.

In contrast we will see that the instances that we use to prove our exponential lower bound on the approximation factor of Ward’s method (Theorem 2.19) satisfy δ-center separation andα-center proximity for δ≤1 +√

2 andα ≤1 +√

2. We will also see that even for arbitrary large δ and α there are instances that satisfy δ-center separation and α-center proximity and on which Ward’s method does not compute an optimal solution.

Strict separation property Balcan, Blum, and Vempala [9] introduce the strict sep-aration property. In [1], this property is defined as follows¹.

1In [9], the property is defined for similarity measures.

4m a

m b

5 6 c

Figure 2.12: Redrawing of Figure 16 in [11]. There are three groups of pointsA,B andC at the locationsa,bandc. The numbers indicate the sizes of the groups, the total number isn= 6m.

Definition 2.39 ([1, 9]). An instance P with target clustering C1, . . . , C_k satisfies the strict separation property if for all x, y ∈ C_i, i ∈ [k], and every z ∈ C_j for j ∈ [k] with j 6=i, ||x−y|| <||y−z||. It satisfies ν-strict separation if there is a subset of P of size at least(1−ν)|P|for which the property is satisfied.

Balcan et al. show that if the clusters in the target clustering are of sufficient size and the instance is ν-strict separated, then one can correctly classify all but νn points (correctly with respect to the target clustering). Indeed, if all points satisfy the property, then the target clustering can be completely recovered.

Balcan, Liang, and Gupta [11] have already studied Ward’s method on strictly sepa-rated instances. They present the instance shown in Figure 2.12 (In §C, Figure 16). Given the target clusteringA∪B,C, the example clearly satisfies strict separation for all points.

However, Ward will compute the clusteringA,B∪C: It will start by merging all points at the same location, resulting in three weighted pointspa, p_b, pc ata,b, and c. But then it will mergep_b andp_cbecause this is cheaper than mergingp_aandp_b: It costs ^m₂ ·6² = 18m, while the alternative merge costs ⁴₅m·5² = 20m. The resulting clustering is then judged to be very bad since it misclassifiesm= ¹₆npoints.

Now one can argue that this judgment is overly critical because Ward actually achieves its design goal on this instance: it computes a solution with minimalk-means cost. Hence, in this thesis we study the behavior of Ward when the target clustering is an optimal k-means clustering. We will see that also in this case the strict separation property does not help Ward to compute a good clustering.

ORSS-separation Another line of work on niceness conditions for clustering investi-gates conditions that help to find a low-cost clustering with respect to thek-means objec-tive function, usually a (1 +)-approximation. For this area, conditions that ensure a cost separation between different solutions are helpful. We will see that the strongest among these conditions, namely-separation [44] (we will also use the term -ORSS-separation), does not help Ward to avoid the worst-case example from Theorem 2.19.

Definition 2.40([44]). An instanceP satisfies the-ORSS-separation propertyfor some number of clusters k if opt_k(P)/opt_k−1(P)≤².

Ostrovsky et al. [44] show that a variant of Lloyd’s method with the right seeding computes a (1 +)-approximation for the k-means problem on instances that satisfy -ORSS-separation.

AS-separation For the following separation condition, it is convenient to denote the point set by a matrix A, where row i contains point A_i. Let C₁, . . . , C_k be a target

clustering forAand letµ₁, . . . , µ_kbe the corresponding centroids, i.e.,µ_i=µ(C_i). Assume thatC ∈R^n×d is a matrix where theith row contains the centroid of the cluster thatA_i belongs to. Then||A−C||²_F is thek-means cost of the clusteringC1, . . . , C_k, where|| · ||_F is the Frobenius norm. Let|| · ||denote the spectral norm.

In a seminal paper, Kumar, and Kannan [35] defined a proximity condition and showed that if all points satisfy this condition, then the target clustering can be reconstructed (and if only a fraction satisfies it, then the target clustering can be mostly recovered). The proximity condition states that the projection of a point onto the line joining its cluster centerµi with another cluster centerµj is closer toµi than toµj by at least a value ∆_ij, where ∆_ij depends on the number of points in the two clusters, and on k||A−C|| (and a big constant). Here, we consider the weaker center-based condition due to Awasthi and Sheffet [8], which was developed in follow-up work. We call it AS-center separation to distinguish it from the aboveδ-center separation.

Definition 2.41 ([8]). Let A and C be as defined above and define

∆_i= 1

p|C_i|min{√

k||A−C||,||A−C||_F}.

Then the instance A satisfies AS-center separation with respect to the target cluster-ingC₁, . . . , C_k if for alli6=j,i, j ∈[k], it holds that

||µ_i−µ_j|| ≥c(∆_i+ ∆_j) where c is a fixed constant.

Again, if all points satisfy AS-center separation, then the target clustering can be recovered [8]. We will see that the exponential lower bound instances satisfy AS-separation when the target clustering is the optimalk-means clustering.

Corollary 2.42. For any > 0, there is a family of point sets (P_d)_d∈_N with P_d ⊂ R^d that are -separated and that satisfy 1 +√

2-center separation, 1 +√

2-center proximity, the strict separation property and the AS-center separation property where Ward_k(Pd) ∈ Ω((3/2)^d·opt_k(P_d)) for k= 2^d. Furthermore, for any δ >1 and any α > 1, there exists a point set that satisfies δ-center separation and α-center proximity and for which Ward does not compute an optimal solution.

Im Dokument Theoretical Analysis of Hierarchical Clustering and the Shadow Vertex Algorithm (Seite 61-64)