• Keine Ergebnisse gefunden

6 Sparse Coding for feature selection

6.2 Sparse Coding

Sparse coding is an algorithm that, like the PCA, aims to identify the directions of greatest variance in the data. However, in contrast to the PCA, the directions of greatest variance are not constrained to be orthogonal.

Each data point can be represented as a linear combination of vectors from a dic-tionaryC = (c1, . . . ,cM), with a coefficient vector a 2 RM. Additionally, we impose the constraint that only a maximum of k entries from the dictionary may be used.

The coefficient vectora is hence a sparse vector, which gives rise to the name sparse coding.

Given a set of data points X = (x1, . . . ,xN), xi 2 RD, we find the dictionary C and coefficient vector ai for each data point by minimizing the following objective

87

function [43]: where|ai|0denotes the number of non-zero entries in the coefficient vectorai.

We will describe an algorithm for computing this minimum by Labusch et al. [43];

before presenting the algorithm in the general case, we will first consider two easier special cases.

Special case 1:|C| =1

We will first consider the case where |C| = 1, i.e. the dictionary contains only one vector. This implies that, instead of coefficient vectorsai, we have scalar coefficientsai

and hence, the number k of non-zero entries in the coefficient vector is necessarily 1. The optimization problem simplifies to the following:

c,amin1,...,aN Sincexi is fixed, we obtain the final optimization problem:

maxc N

i=1

(cTxi)2 subject to kck22 =1 . (6.6) This expression corresponds to the variance of the data along the direction ofc; hence, Equation (6.3) is minimized if we identify the basis vector c that corresponds to the direction of greatest variance. This direction of greatest variance can be found using Oja’s learning rule [52], as discussed in Chapter3.2.2.

Special case 2: k=1

So far, we have discussed the simple case |C| = 1. We will now see how we can determine a dictionaryCwith more than one element but will, for the time being, still

88

impose the constraintk=1, i.e. that the coefficient vector may have only one non-zero entry.

We can solve this variant of the optimization problem using a combination of Oja’s rule and the Neural Gas (NG) algorithm [43]. The Neural Gas algorithm is discussed in Chapter3.2.1.

The algorithm iteratively updates the dictionary and the coefficient vectors until convergence is achieved.

In each iteration, we choose a random data point x. We rank the dictionary vectors by their scalar product with the data pointx, in decreasing order:

cTl0x2

We update each dictionary vector by moving it a certain distance towards the current data point according to their rank:

clj =ate j/lty

x yclj

, (6.8)

wherey = cTl

jx,at is the learning rate, andlt is the neighbourhood size. Bothat and lt are exponentially decreased in each iterationt as follows:

at= a0(afinal/a0)t/tmax , (6.9)

lt=l0(lfinal/l0)t/tmax . (6.10) The basis vectors are normalized in each learning step.

The coefficient vectorsai are determined as described in Chapter3.1.

General case

So far, we have only allowed xi to be represented by an arbitrary single element of the dictionaryCthat is scaled by the coefficient vectorai, withkaik01. The general version of the sparse coding algorithm uses multiple (at mostk) elements of the dic-tionaryC to represent the data point xi. This corresponds to the general case of the objective function (Equation6.2) without any further constraints.

The algorithm for solving this general problem is known as Generalized Sparse Coding Neural Gas [43]. Again, the algorithm iteratively updates the dictionary and the coefficient vectors until convergence is achieved. In each iteration, we again choose a random data point x. We use the optimized orthogonal matching pursuit (OOMP) algorithm (see Chapter3.1) to compute an approximation to x using k of the current dictionary vectors; in every step of the OOMP algorithm, we update the dictionary vectors using a Neural Gas learning rule as before. An additional twist is that, in

89

each step of the OOMP algorithm, we remove the projection of the winning dictionary vector from the dictionary and the current data point.

This general idea is implemented as follows: Let R = (r1, . . . ,rM) = C denote the temporary copy of the dictionary that is made at the beginning of the OOMP algo-rithm; let e = x denote the residual; let U = denote the indexes of the dictionary vectors selected by OOMP.

In each iteration of the OOMP algorithm we updateCandRaccording to a combi-nation of Oja’s rule and Neural Gas, as before. However, the learning rule is modified to minimize only the distance to the current residual. As before, we rank the dictionary vectors ri by their scalar product with the residuale.

rTl0e2

· · · rTlje2

· · · rTlM |U| 1e2

, l0, . . . ,lM |U| 12/U. (6.11) Note that the ranking is now based on the subspace that is orthogonal to CU, where CUare the dictionary elements that have been used so far; also, we consider only those dictionary vectors that have not yet been selected by OOMP. The learning rule is now

clj =clj+lj ,

je, andatandlt are the learning rate and neighbourhood size, as before.

Using Sparse Coding to rank SNPs

As for the PCA, we learn two separate dictionaries, one for the cases and one for the controls [2]. We refer to the dictionary of the cases (class 1) by C1 = (c11, . . . ,c1M) and to the dictionary of the controls (class -1) byC 1 = (c11, . . . ,cM1).

The larger the absolute value of an entry of a dictionary vector ci, the larger the contribution of the corresponding SNP to the orientation of this dictionary vector.

Hence, we again rank the SNPs in the same way as for the PCA:

sj = max

i |(c1i)j| max

i |(ci 1)j| . (6.14) The SNPs are then sorted according to the scoresjin descending order. We will in the following refer to this score as the scscore.

In contrast to PCA, Sparse Coding has two parameters that have to be chosen by the user: M, the number of dictionary vectors, and k, the number of non-zero entries in the coefficient vectors a.

90

6.3 Results and discussion

To evaluate sparse coding for feature selection, we tested the performance on sim-ulated as well as on real data sets and compared it with state-of-the-art algorithms:

selection according to the p-values, the svmscore (see Chapter4) and the pcascore.

For the PCA and sparse coding algorithms, we have to define the number M of principal components (for PCA) or dictionary vectors (for sparse coding). We tested different choices for M and finally used M = 5 for both algorithms. For the sparse coding algorithm, we also have to choose the number k of non-zero entries in the coefficient vectors. As forM we tested different choices forkand finally usedk=4.

6.3.1 Performance on simulated data set

Simulated data sets are of great use for evaluating algorithms because the ground truth is known and can be compared to the result of the algorithm, and because the exact properties of the data set can be controlled. In our case, this means that we can incorporate patterns of SNPs in the data set and can easily evaluate how well the algorithms identify these SNPs.

Comparing the selection strategies

We evaluated the performance of the four different feature selection approaches on data sets simulated as described in Chapter2.1. The first data set we will use contains 1000individuals, of which500are cases and 500controls. The total number of SNPs is10,000.

In addition to incorporating specific patterns, we also incorporate disease-unspecific patterns to account for structures induced by linkage disequilibrium (see Chapter2.1and3.4). In total, we incorporated6patterns in the data, of which one is disease-specific and5 are disease-unspecific. Each disease-unspecific pattern consists of 100 individuals of each class, whereas the disease-specific pattern consists of 100 cases only. The number of SNPs in each pattern is30. The patterns do not overlap.

To assess the consistency of the results, several simulated data sets with the same parameters were created, with similar results in each run. The results on one represen-tative data set are shown in Figure6.2. Each subfigure plots the SNP scores obtained by one algorithm. The red circles mark the SNPs of the disease-specific pattern.

Figure6.2a shows the results for the p-value ranking. We see that the highest ranked p-values do not overlap with the SNPs of the disease-specific pattern. In other words, if we select SNPs according to the p-value ranking, we do not identify the relevant SNPs. The reason for this is that the p-value ranking only assigns high values to SNPs with an individual effect. In this data set, however, none of the SNPs that form the disease-specific pattern are associated with the disease.

91

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0

Figure6.2: Evaluation of four feature selection approaches (p-values, svmscore, pcascore and scscore) on a 10,000 dimensional data set with one disease-specific pattern and 5 disease-unspecific patterns. The SNPs in the disease-specific pattern are marked with red circles. For the p-value and svmscore ranking, the SNPs with the highest ranks do not overlap with the SNPs of the disease-specific pattern. In contrast, for the pcascore and sc-score ranking, the SNPs of the disease-specific pattern stand out. The number of principal components for the pcascore wasM=5. For the scscore, the number of dictionary vectors was also M=5, and the number of non-zero entrieskin the coefficient vectorsai was4.

92

total number of SNPs 15000 20000 25000 30000

10 10 10 10

pattern size 20 20 20 20

(1disease-specific pattern, 30 30 30 30

1unspecific pattern) 40 40 40 40

50 50 50 50

pattern size 10 10 10 10

(1disease-specific pattern, 20 20 20 20 5unspecific patterns) 30 30 30 30

Table6.1:Simulated data sets with1disease-specific pattern and1or5disease-unspecific patterns.

Figure 6.2b shows the results for the svmscore ranking. Again, the SNPs ranked highest according to the svmscore do not overlap with the SNPs of the disease-specific pattern. Hence, the svmscore also misses the disease-relevant SNPs. This happens despite the fact that the SVM should be able to take advantage of disease-specific pat-terns. However, in our simulated data set, the disease-specific pattern is very small:

The pattern is only specific for 15 of the cases, while 45 of the cases cannot be distin-guished from the controls. Without any strong disease-specific patterns or SNPs that enable a classification, the SVM classification plane is mainly based on random struc-tures and the pattern-specific SNPs may be missed. For this reason, the svmscore is not useful for selecting SNPs that are specific only for a small subgroup of individuals when the other SNPs contain only noise.

The results of the selection according to the pcascore and scscore are shown in Figures 6.2c,d. We see that, in contrast to the p-values and svmscore, the pcascore and scscore identify the disease-specific SNPs. Hence, both of these scores are clearly suitable selection strategies on this data set. The reason that PCA and sparse coding can identify the patterns is that they detect the changes in the covariances they cause.

6.3.2 Sparse Coding versus PCA on high-dimensional data

In the previous section, we saw that only the PCA and sparse coding algorithms iden-tify the SNPs of the disease-specific patterns. Hence, in the following, we will compare the performance of only these two approaches in more detail. To do this, we simulated

93

10/15000

Figure 6.3: Percentage of pattern-specific SNPs among the100 SNPs ranked highest by the PCA and sparse coding algorithms. Error bars show standard deviations. The data sets contain one disease-specific and one unspecific pattern.

different data sets with a varying total number of SNPs from 15,000–30,000. In ad-dition, we also varied the pattern size from 10 to 50 SNPs in steps of 10. The data sets contained one disease-specific pattern and either one or five disease-unspecific patterns. The parameters of the simulated data sets are shown in Table6.1.

To explore the effect of random variations in the data, a total of 10 random data sets were generated. We assessed the quality of the results by counting the number of disease-specific SNPs identified by the respective algorithm among the100 highest-ranked SNPs.

Figure6.3shows the results on the data sets containing one disease-unspecific pat-tern.

Overall, the performance of the two algorithms declines as we increase the total number of SNPs or decrease the size of the disease-specific pattern. The reason for this is probably that we have more noise, making it harder for the algorithms to identify the true patterns among the noise. Sparse coding performs better than the PCA for harder data sets, suggesting that the sparse coding algorithm is less influenced by noise than the PCA.

In addition, we see that if the pattern size is too small (10 SNPs), both algorithms fail to identify the relevant SNPs. The performance on this pattern size is nearly equal for all data sets. We conclude that, to obtain good performance, the pattern needs to

94

10/15000

Figure6.4: Percentage of pattern-specific SNPs among the100 SNPs ranked highest by the PCA and sparse coding algorithms. Error bars show standard deviations. The data sets contain one disease-specific and five unspecific patterns.

exceed a critical size. On the other hand, we see that for large patterns, both algorithms identify nearly all SNPs on all data sets.

We now turn to the data sets with five disease-unspecific patterns. Here, we only used patterns size from10to30SNPs to evaluate the algorithms at their limits.

The results are shown in Figure6.4. Overall, the mean performance of the pcascore ranking is worse than on data sets with only one disease-unspecific pattern. The per-formance of the scscore ranking is not affected as much. This means that the advantage of the scscore over the pcascore is more pronounced than on the previous data sets.

The reason for the superior performance of the sparse coding algorithm is probably that, unlike the PCA, it does not force the dictionary vectors to be orthogonal. Hence, if we have several non-orthogonal structures in the data, sparse coding may identify all of them whereas the PCA may fail to identify even one. (For an illustration of this, see Figure6.1.)

6.3.3 Conclusion simulated data sets

We have evaluated the performance of four different feature selection approaches: p-values, svmscore, pcascore and scscore. To evaluate the algorithms, we simulated data sets with disease-specific and disease-unspecific patterns.

95

Except for the disease-specific pattern, there are no other variables that discriminate between the cases and the controls. In addition, only 15 of the cases are associated with the pattern and can hence be identified at all.

The p-value and svmscore rankings do not give high rank to the disease-specific SNPs. For the p-values, the reason is obvious: It only assigns high rank to SNPs with individual effects. For the svmscore, the reason is more subtle: The SVM aims to find an optimal separating hyperplane between the two classes. However, in this data set,

4

5 of the cases cannot be distinguished from the controls, and so the SVM apparently overfits to random structures in the data.

PCA and sparse coding clearly outperform the other two approaches. PCA as a feature selection algorithm yields good results for large pattern sizes with only one disease-unspecific pattern. If we introduce five disease-unspecific patterns, the perfor-mance of the PCA deteriorates compared to the results with one unspecific pattern.

Sparse coding performed consistently better than the PCA, particularly on high-dimensional datasets with multiple unspecific patterns. This suggests that sparse cod-ing is a promiscod-ing feature selection approach for identifycod-ing disease-specific SNP pat-terns that only occur for a small number of individuals.

6.3.4 Performance on GWA data

Evaluating the performance on simulated data can help us to better understand the advantages, disadvantages and limits of the tested algorithms. But it is not sufficient to evaluate an algorithm only on simulated data sets. Real data sets have several prop-erties that go beyond what can be simulated, since we have only limited knowledge about the complex genetic relationships between the SNPs and we know neither the size nor the number of disease-specific and disease-unspecific patterns.

In the previous section, we evaluated the performance of different feature selection approaches by counting the number of identified disease-specific SNPs. In these tests, the SNPs formed disease-specific patterns, which are clearly difficult to identify using standard approaches, such as a selection according to the p-values. However, the re-sults on the simulated data sets suggest that the scscore can identify these SNPs. But does the scscore also improve classification on real data sets?

On the simulated data set, we evaluated the performance by counting the number of identified disease-specific SNPs. Since we do not know these SNPs in real GWA data, we instead used the SVM classification performance as our evaluation measure.

We applied the algorithm to the two data sets used previously: the inflammatory bowel disease (IBD) data set [36] and the coronary artery disease (CAD) data set [33].

The data sets are described in detail in Chapter 2.2. To explore random variations in the data, we performed a5-fold cross-validation.

96

101 102 103 104 105

Figure6.5: IBD data set: Performance (AUC) of a linear SVM on the scscore ranking com-pared with: a) pcascore ranking, b) p-value ranking, c) svmscore ranking, d) all algorithms.

A logarithmic (base10) scale is used for the x-axis.

6.3.5 Performance on the IBD data set

The IBD data set [36] is described in Chapter2.2.2. In brief, the data set contains1719 individuals, of which789are cases and 930are controls. The total number of SNPs is 292,162after quality filtering.

We first compared the performance of the linear kernel SVM (LSVM) on pcascore and scscore ranking (see Figure6.5a). The LSVM on the scscore ranking reaches the best performance prior to including all SNPs, with253,869SNPs and an AUC of0.679.

The LSVM on the pcascore ranking reaches the best performance if all SNPs are in-cluded (AUC of 0.677). In other words, the pcascore cannot identify a subset that improves the classification over all SNPs. In contrast, the scscore can identify a subset

97

of SNPs that enables better classification. Even if the performance is only marginally improved this is still a notable result since a substantially smaller number of SNPs is required.

If we evaluate the performance of the two selection approaches in more detail, we see that initially, the performance of the LSVM gradually improves with increasing subset size. For subsets up to around 400 SNPs, the scscore ranking clearly outper-forms the pcascore ranking. Then, the pcascore selection yields the better results, up to a subset size of over 1800SNPs, after which both selection strategies yield compa-rable results throughout the remaining subsets.

Next, we compared the performance of the scscore with the p-value ranking (see Figure 6.5b). The p-value ranking is superior for smaller subset sizes, whereas, for subsets with more than 4800 SNPs, the scscore yields the better results. The p-value selection cannot identify a subset that performs better than the LSVM on all SNPs.

Finally, we compared the scscore with the svmscore (see Figure 6.5c). Overall, the svmscore is clearly superior. However, like the p-value ranking and pcascore, the svm-score cannot identify a subset that performs better than the classification on all SNPs.

This means that the best performance overall is reached when using the scscore.

As discussed in the previous chapters (see Chapter4), a Gaussian instead of a linear kernel may improve the classification performance. Hence, we repeated the tests with

As discussed in the previous chapters (see Chapter4), a Gaussian instead of a linear kernel may improve the classification performance. Hence, we repeated the tests with