Comparison to Other Clustering Objectives and Devia-

5.2 The Cost-Based Clustering Problem

5.2.2 Comparison to Other Clustering Objectives and Devia-

As mentioned already, the main difference between the cost-based clustering and other clustering problems is that we operate on two sets X andY with a cost function c: X×Y →R+ instead of a distance function d: X×X →R+

on only one space X. However, even if the optimal transport instance at hand takes place inR^D forD∈Nandc(x, y) =kx−yk^p as cost function, the objective functions are different to other clustering problems, such ask-means clustering. This means while there are established algorithms computing clusterings, they might not yield clusterings well suited for our purpose of minimizing the error bounds.

84 5.2. THE COST-BASED CLUSTERING PROBLEM For example, most clustering algorithms aggregate points, which are close together. While closeby points likely have similar distances to points in the other set, the same might be true for points which are far from each other.

The following example illustrates an extreme case of this effect achieved by symmetry.

Example 5.2.6. Suppose we have two sets with four points each as shown in Figure 5.1, left, both equipped with uniform distributions, and c(x_i, y_j) = kx_i−y_jk² for all x_i, y_j. Now suppose we want to find two clusters for X, while the trivial clustering Y ={{y₁},{y₂},{y₃},{y₄}} of Y is fixed. Most proximity-based clustering algorithms find the clusters {x₁, x₂}and {x₃, x₄}, as indicated by the ellipses in Figure 5.1, right, which we would agree with judging by geometric intuition. However, because of the symmetry, the cost matrix

C =







2 1 2 5 5 2 1 2 2 1 2 5 5 2 1 2







has two pairs of identical rows, the first and third, as well as the second and forth row. This means the clustering{x₁, x₃} and {x₂, x₄} has a gap matrix with only zero entries, which makes it the optimal clustering with respect to the objective functions in the cost-based clustering problem. In fact, since the objective value is zero, the optimal transport instance on the clusters has the same optimal cost as the original instance, since no information is lost due to the clustering.

Of course, this is a small example only constructed for the purpose of showing this effect and it is reliant on some kind of (approximative) symmetry. Usually, geometric clustering algorithms produce clusterings, which are reasonable in terms of their objective values of the gap functions in CBC. There are some situations, however, in which a symmetry effect is in place. For example, if X and Y are concentrated on lower dimensional affine subspacesA and B ofR^D, which are orthogonal to each other, optimal clusterings have concentric spherical clusters around the intersection point

x₁

x₂

x₃

x₄ y₁

y₂

y₄

Figure 5.1: This example illustrates that aggregating close points can lead to a suboptimal clustering with respect to the objective functions of the cost-based clustering problem.

between A and B. Such clusterings are typically not found by traditional geometric clustering algorithms. This is discussed further in Section 5.5.

A different error bound is given by the Wasserstein metric, if we assume we have a metric space (X, d) with probability measures µand ν with finite support in X andc(x, y) =d^p(x, y). The optimal transport cost c^∗ between µand ν is the p-th power of the Wasserstein distance between µand ν,

c^∗ =W_p^p(µ, ν).

Any clusterings of the supports of µ andν together with representatives in X for each cluster define approximative probability measures ˆµand ˆν with

c^∗ =W_p^p(ˆµ,ν).ˆ

Due to the Wasserstein metric triangle inequality, we have

|W_p(µ, ν)−W_p(ˆµ,ν)| ≤ˆ W_p(µ,µ) +ˆ W_p(ν,ν),ˆ and hence

|c^∗¹^p −c¯^∗¹^p| ≤W_p(µ,µ) +ˆ W_p(ν,ν).ˆ

The distances W_p(µ,µ) are often easily computed. Let, for example,ˆ

86 5.2. THE COST-BASED CLUSTERING PROBLEM supp(µ) = {x₁, . . . , x_N} ⊆ X with clusters X₁, . . . , X_n and representatives ˆ

x₁, . . .xˆ_n ∈X. If each pointx_i is contained in the cluster X_k with the closest representative with respect to d, then

W_p^p(µ,µ) =ˆ

k=1

xi∈X_k

µ(x_i)d^p(x_i,xˆ_k),

which is the case for the k-means algorithm or the standard grid-based aggregation for multiscale schemes on images. Note that this is an error bound between the p-th roots of the two optimal objective values, while the gap value given by the cost-based clustering provides an error bound on the absolute deviation of the optimal cost values. The next example shows a comparison of the two bounds in the grid-based case.

Example5.2.7. LetX =Y ={1, . . . , R}² ⊆R² for some even numberR∈2N with d(x, y) = kx−yk and consider the clustering X = {X₁, . . . , X_n} with n= R²/4, where four adjoining points are aggregated and the representatives ˆ

x_k of the clusters are the centers of the four clustered points. This defines ˆµ and ˆν on X for probability measures µand ν on X as detailed in Section 2.4 under the paragraph Coarsening. Since d^p(x_i,xˆ_k) = 2⁻^p² for all X_k and x_i ∈X_k, we have

|c^∗¹^p −c¯^∗¹^p| ≤W_p(µ,µ) +ˆ W_p(ν,ν) = 2ˆ ·2⁻^p²

p =√

The error bounds of the cost-based clustering are more difficult to compute and often depend onµandν, since the values in the gap matrix range between d^p((1,1),(2,2)) = 2^p² and

d^p((1,1),(R, R))−d^p((2,2),(R−1, R−1))

=(R−1)√

2^p−(R−3)√

2^p = 2^p² ·((R−1)^p−(R−3)^p).

Forp= 1, however, we can computeG_row andG_col directly. For every cluster X_k there exists another cluster X_l, which is diagonally aligned (assuming R ≥ 4), such that the difference between the maximal and the minimal distances is precisely twice the diameter of the clusters, thus g_k,l = 2√

This is in line with the more general maximal gap matrix entry given above.

Therefore,

G_row =G_col = 2√ 2.

G_min depends on µ and ν with the best case being √

2 and the worst case being 2√

2, depending on how the masses of µand ν are distributed. In any case,

|c^∗−c¯^∗| ≤W₁(µ,µ) +ˆ W₁(ν,ν)ˆ ≤G_min ≤G_row =G_col,

meaning that the Wasserstein bound is the tightest of the bounds, G_row and Gcol are the least tight and Gmin is in between depending on µ and ν.

For p >1 these bounds cannot be compared directly, since the Wasserstein bound applies to|c^∗¹^p−c¯^∗¹^p|, whereas the deviation bound from the cost-based clustering applies to |c^∗−¯c^∗|.

Im Dokument Algorithms for Optimal Transport and Wasserstein Distances (Seite 91-95)