Column interpretability and a motivation for Convex cone

Overview

For CX-type column selection, the interpretability advantage over PCA relies on the fact that single columns areselected, as opposed to computing principal components that are linear combinations of many columns.

With regard to improving interpretability of CX, non-negative CX (NNCX) has been proposed [115] (Section 2.1.4), and non-negativity eliminates a further interpretability problem of PCA (Section 1.2.1). NNCX has so far re-ceived little attention, motivating the evaluation of column selection strategies with respect to the NNCX norm error (Section 5.1.1), as well as the develop-ment of dedicated NNCX algorithms, such asConvex cone.

Interpretability through column selectionis the typical motivation for CX-type column selection (summarised by Mahoney&Drineas in [135]). This section sets out to extend the concept of column interpretability, arguing that a column selection constraint alone does not guarantee interpretability, and that, in fact, some columns can be ”more interpretable” than others.

In the following, Sections 5.1.2 - 5.1.5 develop the concept of an interpretable column as anextreme vector. Then, Section 5.1.6 points out that such extreme vectors are also suitable basis columns for NNCX.

The considerations from this section are motivation for theConvex cone algorithm developed later in Section 5.2, an algorithm that aims to select extreme columns to achieve interpretability and to optimise the NNCX norm error criterion.

5.1.1 Find algorithms that minimise the NNCX norm error

Many of the column selection strategies reviewed in Section 2.3 have been pro-posed in the context of the unconstrained CX (Section 2.1.1). Regarding norm error minimisation for CX, some have been evaluated empirically [67, 201], e.g.

LeverageScoreSampling(Section 2.3.1), while others, e.g. VolumeSampling (Section 2.3.6) have so far remained theoretical concepts.

Empirical results suggest that an algorithm that explicitly optimises the NNCX norm error by a search heuristic (Section 2.3.4) can achieve better NNCX norm errors than a CX algorithm (LeverageScoreSampling) that was turned into a NNCX algorithm by post-hoc optimisation of a non-negativeX⁰⁺ [115]. However, a larger-scale comparison also of other column selection strategies in the NNCX scenario has not yet been performed.

One aspect of the empirical evaluation in Chapter 7 is thus to evaluate common column selection strategies and the newConvex conecone algorithm (Section 5.2) with respect to the NNCX norm error.

5.1 Column interpretability and a motivation forConvex cone 97

5.1.2 Mixed signals are a structural property of (NN)CX

Below, pure and mixed signal columns play a role in defining column inter-pretability. In the case of (NN)CX, matrixX encodes mixed signals (”cluster overlap”). This is a structural property that is characteristic for (NN)CX and that distinguishes it from ”crisp” clustering with binary cluster membership indicators (see Section 3.4).

Recall that, if such a crisp clustering is cast into CX notation (as in Equation 3.16 in Section 3.4), then matrix C contains the cluster centres and X^unary contains the binary cluster membership indicators. The columns ofX^unaryare unary in the sense that they contain a single non-zero entry that identifies the cluster centre with which the column is associated. In contrast, NN(CX) allows for mixed signals in the sense that the jth column of X^real can contain multiple non-zero and real-valued entries that serve to reconstruct thejth column ofA as a mixture of thec columns inC:

A^(j) ≈ C⁽¹⁾X_1j^real+ . . . +C^(c)X_cj^real (5.1) 5.1.3 Generative NNCX mixture model

NNCX requires to selectccolumns from a general matrixA. To define column interpretability, it is helpful to assume a special case where A has been generated from a NNCX mixture model, i.e. by linear combination (with non-negative coefficients) ofsof its columns. Then, a column selection algorithm could be employed to recover thesescolumns fromA.

Donoho&Stodden [62] and Arora et al. [10] (see Section 3.1.2) have proposed such models for NMF (Section 4.6) with the goal of defining when NMF has a unique solution or when it admits an exact solution in polynomial time.

To formalise the mixture model for NNCX, consider them×nmatrixA that is constructed by mixing columns from S. Matrix S is m×s and it containsl= 1, . . . , s”generating” or ”source signal” columnsS^(l), which are pure, non-mixed signals.

Sis a minimal generator ofAin the sense that it must not contain redundant columns or columns that are linear combinations (with coefficients α_r) of columns fromS: IfS^(x)∈S andS^(y)∈S, then (α_xS^(x)+α_yS^(y))∈/S.

Thej = 1, . . . , n columns A^(j) of A are either (1) pure signal columns directly chosen fromS or (2) mixed signal columns, i.e. linear combinations of (w.l.o.g.) two of the pure signal columns:

A^(j):=

((1) S^(l) |l∈ {1, . . . , s}

(2) α_l₁S^(l¹⁾+ α_l₂S^(l²⁾ |l_r∈ {1, . . . , s}; α_l_r >0 (5.2) Assume further that allS^(l)have been inserted intoA, i.e. for eachlwe have S^(l)=A^(j) for at least onej.

98 5 Column interpretability and NNCX: TheConvex conealgorithm

Regarding nomenclature, eachA^(j) that is assigned aS^(l)is called apure signal column. There can be multiple A^(j) that are assigned the same S^(l). The unique columns ofSare calledgenerating columns orsource signals, and they are also pure signal columns.

5.1.4 Interpretable columns are pure signal columns

Interpretability is usually stated as the main motivation for preferring (NN)CX over PCA [115, 135, 55, 201]. How is interpretability achieved?

1. Interpretability by selecting columns. This is what column selection al-gorithms have been concentrating on so far [135]. A column that is selected from a data matrix can be more interpretable than a principal component, simply because the selected column represents a single entity or data point. Often, the single column has a label that makes it intuitively understandable for a domain scientist.

2. Interpretability by selecting pure signal columns. Once mixtures enter the game, not all columns are equally interpretable. Mixed signal columns (Equation 5.2, case 2) are linear combinations of other columns, just like the principal components whose interpretability is in question.

Thus, in a collection of pure and mixed signal columns only thepure signal columns are interpretable. The pure signal columns (Equation 5.2, case 1) allow us to understand howAwas constructed.

5.1.5 Interpretable columns are extreme columns

Assuming the second concept of interpretability from Section 5.1.4 above, an interpretable column is a pure signal column. By construction of matrix A (Equation 5.2), the pure signal columns are the extreme columns of A, and hencean interpretable column is an extreme column.

To see this, consider that the columns A^(j) ∈ A for which A^(j) ∈ S (Equation 5.2, case 1) span a convex cone (definition 1, Section 2.1.5), and all conic (linear with non-negative coefficients) combinations of the columns fromS (case 2) are contained in this convex cone.

A column A^(j) ∈ A that is a pure signal column (A^(j) ∈ S) is also an extreme column (definition 2, Section 2.1.5) of A in the sense that there exist no columns A^(x), A^(y) ∈ A (A^(x), A^(y) 6= A^(j)) such that A^(j) = αxA^(x)+αyA^(y)(αx, αy≥0).

Selecting all s extreme columns would recover the entire generating matrix S. Furthermore, extreme columns can support data interpretation also forc 6=s: Principal components are often hard to interpret as they are obtained by linear combination of many columns, which is, by definition, not the case for the extreme vectors.

5.1 Column interpretability and a motivation forConvex cone 99

5.1.6 Extreme column solution for NNCX The objective criterion for NNCX is to minimise

A−P_Cone(C)A

2 F r (Sec-tion 2.1.5), i.e. we should selectC such that the cone defined by its columns contains as much ofAas possible.

More precisely, the NNCX norm reconstruction error can also be written in terms of the distance of the columnsA^(j)of Ato the nearestv (found by the ”min” function) within the coneV [143, chap. 5]:

A−CX⁰⁺ The norm reconstruction error thus depends on the projection error of the data points (columns) that are outside of the cone. Selecting the set of extreme columns ofA, the minimal generator of the convex cone (theorem 1, Section 2.1.5), into C leads to the smallest subset of columns that leaves no data points outside of the cone defined by it. This hints at the possibility of an algorithm for NNCX: Enumerate the extreme columns of A.

For c approaching the number of extreme columns, the NNCX norm error achieved by such a strategy approaches zero, and, additionally, the selected columns are interpretable according to the considerations above. With a fixed starting point for the extreme vector enumeration, we can furthermore achieve a nested sequence of columns (cp. Section 2.5.2).

5.1.7 Extreme columns vs. central columns

Relying on extremal data points for data understanding is an established concept. For example, archetypal analysis [47] (s.a. [198]) and convex matrix factorisation [58] find data points that lie on the convex hull. These methods do, however, not select extreme columns, but they compute sparse (few non-zero coefficients) linear combinations with non-negative coefficients, as opposed to principal components that are dense (many non-zero coefficients) linear combinations with mixed-sign coefficients.

Anextreme vector solution can also be seen as the anti-concept to PCA (Section 3.3.5): The (first) principal component is the line that goes through the mean of the data cloud. Hence, a principal component is acentral vector.

Aiming to select columns that are by some criterion close to the principal components (see the algorithms in Section 2.3: LeverageScoreSampling, D CX,GreedySpaceApproximation) leads to a subset of central columns in C and thus to a very different solution than a subset of extreme columns inC, as e.g. computed by Convex coneorSiVM (Section 2.3).

In practice, similar CX norm errors can be obtained by both central column and extreme column methods, but the solution can be quite different with respect to NNCX norm error and with respect to the amount of pure (interpretable) and mixed signal (not interpretable) columns in C (cp. the evaluation in Chapter 7).

100 5 Column interpretability and NNCX: TheConvex conealgorithm

Im Dokument Column subset selection with applications to neuroimaging data (Seite 104-108)