Learning from what we know - Machine Learning Methods for Genome-Wide Association Data

Predicting the risk of a disease is of great importance in clinical medicine. So far, the risk of a disease is mainly estimated for environmental factors. However, genetic fac-tors also play an important role in the pathogenesis of most of today’s diseases. If we identify the genetic factors that influence a disease, we may broaden our understand-ing of the mechanisms underlyunderstand-ing the disease as well as potential therapeutic targets.

Hence, genetic risk prediction opens up a new area of personalized medicine. The principle of personalized medicine is to adjust the clinical treatment and preventive action to the genetic pattern of each individual. Genetic testing is already in common use for Mendelian diseases, where the underlying genetic factor is known [8,28]. How-ever, genetic tests are not yet available for complex diseases. This is because of their complex structures, where multiple loci work together to increase or decrease the risk of a disease.

GWA studies aim to identify the genetic risk factors that influence the risk of a disease. Most GWA studies have so far concentrated on individual SNPs. Each SNP can be seen as a weak classifier, which can be used to estimate the risk of the disease.

In other words, each SNP in isolation contributes a small amount of information about the risk of the disease.

This knowledge is gained by looking at the distribution of the genotypes in known cases and comparing these to control individuals. Subsequently we can estimate the risk of a disease from what we have observed. In other words, we learn from the data at hand and use this knowledge to predict the condition of unseen subjects.

To be useful in clinical medicine, a genetic model for risk prediction needs to have a certain minimum predictive value beyond the prediction made using traditional risk factors. Most SNPs identified to date have only a small effect on risk. Hence, so far, a single SNP is not suitable for predicting an individual’s genetic risk of disease. An application in personalized medicine is thus not possible.

If we predict risk on a combination of SNPs taken together, the prediction may improve; we call this a multilocus approach. Figure 3.2 illustrates the classification based on single SNPs versus the multilocus approach. Each axis represents a single SNP and the distribution of the genotypes for the two classes. If we only look at one SNP at a time, we are not able to distinguish between cases and controls since each genotype is equally frequent. However, if we combine the two SNPs, we can clearly see a disease-specific pattern.

Classification based on a combination of features is a general task in most machine learning applications. Machine learning algorithms aim to build a model based on existing data which can help us to predict the outcome, in our case the phenotype, on unseen data. Classification algorithms typically take the input, in our case the SNPs of a single individual, and assign it to a discrete class, the phenotype. The prediction is only based on the SNPs, and the relationship between these SNPs and the phenotype is modeled based on a dataset where we know the phenotype for each individual.

Figure3.2: Schematic view of the classification problem. Both axes illustrate the distribu-tion of the genotypes of a given SNP. A single SNP is not sufficient to separate the classes due to the equal frequencies of the genotypes. However, by looking at the two SNPs at the same time, we can differentiate between the cases and the controls.

Several studies have aimed to combine multiple genetic variants to assess the risk of a common complex disease [23,53,56,57]. However, the consensus of these studies is rather disappointing. Many of these studies have combined the SNPs that have been validated so far to see whether risk prediction can be improved. One of the most used approaches is the risk prediction based on the genotype score (GS) [23,53]. The GS is an additive model that counts the number of risk alleles to make a risk prediction.

However, each of the risk factors known so far has only low discriminatory and predictive value, and so far, even if they are combined, theses variants only explain a very small proportion of the genetic contribution. The predictive value of these studies is hence far away from being useful in a clinical setting.

The traditional risk prediction strategies (e.g. GS) have several weak points: The small number of SNPs on which classification is based, the selection strategy and the linear fashion of the classifiers. Let us discuss each of these weak points more closely.

First, most approaches classify based on the risk variants identified so far, and these are only small in number. Moreover, these SNPs generally all have marginal effects, making classification a difficult task. Second, selection of only the validated SNPs may not be the best choice since these SNPs explain too little about the genetic risk of a disease. Third, most of the studies only allow for linear main effects and ignore inter-actions and more complex relationships between variants [28]. Interinter-actions between SNPs may however increase the predictive value since interactions are expected to have larger effect sizes than individual SNPs and their additive effects. In addition to

SNP interactions in the traditional genetic sense, there may also be other, more gen-eral types of disease-specific structures in the data, which the algorithm will be able to exploit to improve classification further.

The drawbacks of the standard feature selection approaches, as we discussed above, can however easily be avoided by applying well-known classification methods from the field of machine learning. Since we do not expect all susceptibility loci to be identified yet, we believe that there may be better strategies than selecting only the SNPs that have been validated so far. Instead, we aim to incorporate large numbers of SNPs into the model. Second, we expect there to be more complex relationships, such as interaction between the SNPs. Hence, we aim to identify these structures and hence improve classification.

We will apply two approaches that yield classifiers which are based on large ensem-bles of SNPs. The first of these approaches is the well-known support vector machine (SVM). (While this work was being prepared, other groups also started using the SVM on GWA data [28,58,59].) The advantage of the SVM is that it utilizes all features and is applicable to very large datasets. A drawback of the SVM is however that the num-ber of SNPs used for classification can only be controlled by the numnum-ber of features in the data set. If a classification based on a smaller number of SNPs is desired, the SNPs have to be selected beforehand. This is generally done by selecting a subset based on the individual significance values, the p-values [23,28,57].

Second, we will apply a boosting algorithm modified for use on GWA data. The main idea of boosting is to combine several weak classifiers, in our case SNPs, to one strong classifier. The advantage of the boosting algorithm is that it selects SNPs one after another in order to obtain a set of SNPs that successively improve the classifica-tion. Hence, the number of SNPs can be better controlled by the boosting algorithm than by the SVM since no preselection of SNPs is needed.

4 Classification of GWA data with the

Im Dokument Machine Learning Methods for Genome-Wide Association Data (Seite 43-49)