• Keine Ergebnisse gefunden

5.3 Low level analysis

5.3.7 Cluster analysis

Clustering analysis is the most frequently used exploratory technique, when there is no presumptive knowledge about the data (DATTA and DATTA 2006, DRAGHICI 2011, BRAUN 2014). It is based on the idea that genes interact with each other and build groups according to similar expression profiles (DATTA and DATTA 2006, BRAUN 2014). In general, clustering genes is used to identify groups of co-regulated genes or typical spatial and temporal expression patterns (Figure 5.11). Clustering samples is moreover suit-able to identify distinct phenotypes, classes or stages of the disease as shown in Figure 5.11. Clustering methods are referred to as unsupervised machine learning algorithms (DRAGHICI 2011). Expression profiles are grouped according to their similarity, where

CHAPTER 5. MICROARRAY-BASED GENE EXPRESSION

the measure of similarity is called the distance (DRAGHICI 2011). Multiple ways of cal-culating the distance exist. Euclidean distance is the most commonly applied distance measure. A variation of the aforementioned, the Squared Euclidean distance and the Manhattan distance tend to give slightly more weight to the outlier (DRAGHICI 2011).

Pearson correlation distance focuses on whether the coordinates of two points change in the same way. Problematic are outliers, which lower the overall correlation. Jackknife cor-relation is more robust to one or few erroneous measurements (DRAGHICI 2011). What distance measure to use depends on the structure of the data, and different opinions in the literature lead to no final consensus. Important to mention is that the clustering algorithms are not necessarily deterministic, meaning, that the same clustering algorithm applied to the same dataset may produce different results because of their nondetermin-istic components (DRAGHICI 2011). The most widely used methods are hierarchical clustering, k-means and self-organizing maps (QUACKENBUSH 2006, BRAUN 2014).

Hierarchical clustering was the first clustering algorithm utilized in microarray technology (EISEN et al. 1998) and is still the most popular clustering algorithm used (DATTA and DATTA 2006). Hierarchical clustering is a deterministic method and can be applied to all data, genes, samples, and time points. Hierarchical clustering produces a cluster tree, also known as dendrogram ending with the single gene or samples (Figure 5.11). It is often intuitively displayed graphically as a heat map (Figure 5.11; EISEN et al. 1998). The distance between the clusters can be the distance between nearest neighbors (single linkage), between furthest neighbors (complete linkage), between the center of the clusters (centroid linkage) or between the average between all pairs (aver-age link(aver-age). Although computationally more elaborate, centroid link(aver-age represents the structure of the data often more accurately than single linkage tends to (KERR et al.

2008, DRAGHICI 2011). Thus, it is generally not recommended to use single-linkages clustering (D’HAESELEER 2005, JASKOWIAK et al. 2014). To divide the data in meaningful clusters the dendrogram has to be cut on a certain point (Figure 5.11).

K-mean clustering is a simple and fast method and therefore the most widely used nondeterministic algorithm (DRAGHICI 2011). The number of clusters has to be chosen in advance by the user. To determine the number of clusters is a great issue (ZHAO and KARYPIS 2005). If several classes are used, it is suggested to use the known number of classes (DRAGHICI 2011). If the cluster analysis has an exploratory function, it is recommended to repeat the analysis with different clusters and compare the results (ZHAO and KARYPIS 2005, DRAGHICI 2011). Some authors recommend to track the decrease in intra-cluster distance after adding another cluster and using the number of clusters where the decrease begins to stagnate (ZHAO and KARYPIS 2005), others recommend to use PCA analysis to specify the number of clusters (QUACKENBUSH 2006). To assess the trustworthiness of a cluster the size of the cluster can be compared to the distance to the nearest cluster. If the distance between the clusters is larger than the size of the

CHAPTER 5. MICROARRAY-BASED GENE EXPRESSION

Figure 5.11: Hierarchical cluster analysis

Heatmap displaying the expression profile of 780 differentially expressed ProbeSets from the distemper data set (Figure 5.5; ULRICH et al. 2014b) . Each row represents a ProbeSet; each column represents a sample, whitch was histologically categorized into controls, acute, subacute and chronic canine distemper virus leukoencephalitis. For a better graphical representation (color scale ranging from 4-fold down-regulation in green to 4-fold up-regulation in red) of the expression values, the log2-transformed in-dividual fold changes were utilized. Hierarchical cluster analysis was performed em-ploying Euclidean distance and complete linkage using MultiExperiment Viewer (MeV;

http://www.tm4.org/mev.html). The tree map was cut according to the graphical repre-sentation of the data, where 4 distinct expression profiles were identifiable (Cluster 1-4) and following the recommendations as described in Chapter 4.4 (DRAGHICI 2011).

CHAPTER 5. MICROARRAY-BASED GENE EXPRESSION

Self-organizing maps (SOM)is a nondeterministic neuronal network technique. In difference to the two previously mentioned algorithms SOM provides information about the relationship of the patterns (TAMAYO et al. 1999, ZHAO and KARYPIS 2005, QUACKENBUSH 2006). Depending on the distance metric, SOM is robust to noise and outliers (KERR et al. 2008). Self-Organizing Tree Algorithm (SOTA) is a more flexible combination of SOM and a top-down hierarchical clustering algorithm (YIN et al.

2006, KERR et al. 2008). Biclustering simultaneously clusters both rows and columns to find local patterns (DRAGHICI 2011).

Nonetheless, not only the choice of a specific clustering algorithm is essential, but also if all genes or a subset of genes should be used and whether to use original scaled data or log scaled data (REIMERS 2005). Log scale is amplifying noise especially in genes with low expression values, therefore if low abundance genes are included in the analysis, it is recommended, that the original scale data are utilized (REIMERS 2005). Genes with an expression value within the background noise should be discarded (KERR et al. 2008).

Table 5.4 provides a list of useful freely available clustering tools.