Agglomerative Clustering - Clustering Algorithms

5.3 Clustering Algorithms

5.3.1 Agglomerative Clustering

One of the heuristic clustering methods that can be adapted to the cost-based clustering problem is the agglomerative clustering. It is a hierarchical bottom-up clustering approach, where every element starts as one cluster and in each step two current clusters are selected to be merged until the desired number of clusters is reached. The selection happens greedily with respect to a linkage criterion, that is usually tied to an objective function or a dissimilarity function between the elements.

Agglomerative clustering methods go back to Florek et al. in 1951 [27], who suggested a nearest-neighbor rule, which is now known as single-linkage.

Later, more efficient clustering algorithms for single-linkage [85] and complete-linkage [22] were developed in 1973 and 1977, respectively. The distance between clusters is evaluated by the shortest distance between elements in single-linkage, and by the farthest distance between elements in complete-linkage.

Unfortunately, those and many other linkage criteria require a distance or dissimilarity function d: X ×X → R⁺, also called dissimilarity coefficient, whereas we only have access to a cost (or dissimilarity) functionc: X×Y →R+

between elements inX andY. This is why we suggest linkage criteria based on the objective functions of the cost-based clustering problemG_row, G_col, G_min

and G_prod.

Starting with X ={{x₁}, . . . ,{x_N}} and Y ={{y₁}, . . . ,{y_M}}, the gap matrix is the zero matrix. Whenever we fuse two clusters of X (or Y) we eliminate one row (column) of the gap matrix and another row (column) increases its values. This leads to an increase of the gap objective function values, that depends on the choice G∗ ∈ {G_row, G_col, G_min, G_prod}. We can compute this increase in advance in order to choose the two clusters to fuse, which lead to the least increase of the function. This process is iterated until the desired number of clusters is reached. The two setsX andY are clustered in succession. Since this linkage criterion depends on the choice of G_∗, it has to be chosen beforehand.

In each step of clusteringX (and likewise when clusteringY with different notation), assuming N current clusters, the fusion matrix is an R^N×N+ matrix F, such that f_k,l is the increase in the chosen objective function if X_k and X_l are fused. Since it is symmetric with diagonal zero, we only need to keep track of the upper triangular part. In most of the cases, if we fuse two clusters Xk and Xl, the other entries do not change, and we only have to compute the entries for the new cluster Xk ∪Xl. This leads to Algorithm 2 below.

However, if we cluster X with respect to G_col or Y with respect to G_row, due to the column-wise (or row-wise) maximum of the gap matrix in the objective function, all of the values in the fusion matrix might change after the aggregation of two clusters. This means we either have to recalculate the entire fusion matrix after each step (Algorithm 3) or accept that the remaining values in the matrix are inaccurate (Algorithm 2). Another option is to recalculate the matrix, whenever the number of clusters is halved, or similar. This leads to variants of Algorithms 2 and 3.

Some details on the implementation:

• In addition to G we always keep track of C^min and C^max during the clustering. The reason is simply that they are needed to compute the fusion values f_i,j and it would be inefficient to recalculate the required values each time. For G_row (G_col) we also keep track of the row-wise (column-wise) maximum of G, that is, Gk,· = max_lg_k,l or G·,l = max_kg_k,l.

90 5.3. CLUSTERING ALGORITHMS

Algorithm 2: Agglomerative cost-based clustering Input :X, Y, µ, ν, C, n, m, G∗

Output :Clusterings X,Y with gap matrix G Initialize

X ← {{x₁}, . . .{x_N}},Y ← {{y₁}, . . .{y_M}}, G←0∈R^N×M Initialize F (N ×N fusion matrix for X) by computing f_i,j for

i= 1, . . . , N −1, j =i+ 1, . . . , N with respect to G∗, for other (i, j):

f_i,j ← ∞

while |X | > n do

(k⁰, l⁰)←argmin_k,lfk,l

X_k⁰ ←X_k⁰∪X_l⁰

Xl⁰ ← ∅ (remove from X) UpdateG, µ and F

end

Initialize F (M ×M fusion matrix for Y) while |Y| > m do

(k⁰, l⁰)←argmin_k,lf_k,l Yk⁰ ←Yk⁰ ∪Yl⁰

Y_l⁰ ← ∅ (remove from Y) UpdateG, ν and F end

Algorithm 3: Agglomerative cost-based clustering with recalcula-tion of the fusion matrix (only relevant for G_row and G_col)

Input :X, Y, µ, ν, C, n, m, G∗

Output :Clusterings X,Y with gap matrix G if G_∗ =G_col then

Perform this algorithm for Y, X, ν, µ, C^t, m, n, G_row end

else

Initialize

X ← {{x₁}, . . .{x_N}},Y ← {{y₁}, . . .{y_M}}, G←0∈R^N^×M Fuse the clusters of X as in Algorithm 2

for K =M, . . . , m+ 1 do

Calculate the K ×K fusion matrixF for Y (k⁰, l⁰)←argmin_k,lfk,l

Y_k⁰ ←Y_k⁰ ∪Y_l⁰

Yl⁰ ← ∅ (remove from Y) Update Gand ν

end end

• When updating F in Algorithm 2 after the fusion of X_i and X_j, we remove rows and columns i and j from F, then add a new row and column for the new cluster X_i∪X_j.

• We computef_i,j for two clusters X_i, X_j depending on G∗ as follows – for G_row:

f_i,j := µ(X_i∪X_j)· max

l=1,...,M

nmax{c^max_i,l , c^max_j,l } −min{c^min_i,l , c^min_j,l }^o

−µ(X_i)·Gi,·−µ(X_j)·Gj,·

– for G_col: fi,j :=

l=1

ν(Yl)·maxⁿmax{c^max_i,l , c^max_j,l } −min{c^min_i,l , c^min_j,l }, G·,l

−G·,l

92 5.3. CLUSTERING ALGORITHMS – for G_min:

f_i,j :=

l=1

min{µ(X_i∪X_j), ν(Y_l)}

·max{c^max_i,l , c^max_j,l } −min{c^min_i,l , c^min_j,l }

−min{µ(X_i), ν(Y_l)} ·g_i,l−min{µ(X_j), ν(Y_l)} ·g_j,l – for G_prod:

f_i,j :=

l=1

ν(Y_l)·

µ(X_i∪X_j)·max{c^max_i,l , c^max_j,l } −min{c^min_i,l , c^min_j,l }

−µ(Xi)·gi,l−µ(Xj)·gj,l

• In the update steps ofG, when X_i and X_j (with i < j) are fused, we remove row j and update row i of G. The same for columns in the clustering process of Y.

• We update µby µ_i :=µ_i+µ_j and remove entry j. The same for ν.

• Usually, Algorithm 2 is used for clustering. For the linkage criteria G_row and G_col a Boolean option accel indicates, whether Algorithm 2 (accel = true) or Algorithm 3 (accel =false) is used.

• The clustering for (X, Y, µ, ν, C, n, m, G_row) via the agglomerative clustering method is the same as the clustering for (Y, X, ν, µ, C^t, m, n, G_col). This allows us to always do the unproblematic clustering first in Algorithm 3, so that the clustering where the entire matrix is recalculated in each step is performed with an already reduced number of clusters for the other set. ForG_∗ ∈ {G_min, G_prod} the clustering for (X, Y,µ, ν, C,n,m,G∗) is the same as the clustering for (Y, X,ν, µ, C^t, m, n, G∗), sinceG_min and G_prod are both symmetric inX and Y. In order to analyze the runtime of the agglomerative clustering method, we focus on the clustering of X first and observe that each computation of

f_i,j is done in O(M) time, independent from the choice of G_∗. In Algorithm 2 the fusion matrixF is initialized withN² computations of f_i,j. That means the initialization step is done in O(M N²) time. Within the while loop, for K = |X |, choosing the argmin is done inO(K²) time and in the update of F, one row is computed, which areK computations off_i,j. The other operations are insignificant in runtime. The values of K range from N to n+ 1, hence the runtime of the while loop, and consequentially of the whole clustering of X, is^P^N_K=n+1(K²+M K) = O(N³+M N²).

The clustering of Y has only one difference, namely the computation of one valuef_i,j takesO(n) instead of O(N) time, since |X | has previously been reduced to n. The rest still holds with N and M interchanged, thus the runtime of the clustering ofY isO(M³+nM²). As a whole, Algorithm 2 has anO(N³+M N²+nM² +M³) runtime, thus is cubic in max{M, N}.

For Algorithm 3 let us fix the linkage criterion G_row (for G_col we obtain the same with X and Y exchanged). The first clustering (i.e. of X) is the same as before, thus done in O(N³ +M N²) time. The clustering of Y requires the fusion matrix F to be computed within the loop, so the runtime is ^P^M_K=m+1K²n = O(M³n), which yields O(N³ +M N² +nM³) in total. For N = M and n = m, for example, this is O(N³n), which is already prohibitively slow for the cost-based clustering as a multiscale method for optimal transport, since the original problem can already be solved in O(N³log(N)) time in the worst case and in order to remain under this order, one would have to choose n=o(log(N)). Thankfully, it is not necessary at all to use Algorithm 3 for G_min andG_prod and there are alternatives to it for G_row and G_col as well, albeit with less accurate fusion matrices.

Im Dokument Algorithms for Optimal Transport and Wasserstein Distances (Seite 96-101)