Learning from the integrated data - Integrating heterogeneous data sets related to Alzheimer’s

4. Integrating heterogeneous data sets related to Alzheimer’s disease (Pub-

4.4. Learning from the integrated data

To study if novel information about the disease can be inferred from HENA, we attempted to identify genes that were associated with Alzheimer’s disease us-ing information about genes and interactions of different types between pairs of genes. We also aimed to understand if network representation of heterogeneous data would be useful in this task. The study workflow is depicted on the Figure 21.

Figure 21. Identification of genes potentially associated with Alzheimer’s disease in HENA data set. The diagram represents individual steps of the analysis for the identifica-tion of disease related genes.

The most straightforward way to identify genes that are associated with Alzheimer’s disease would be to use a supervised machine learning method, where genes are labeled based on the association with Alzheimer’s disease. Using the labeled set of genes we can train a model to find a decision boundary between two classes, and apply it to predict the association for the rest of the genes.

The problem is that such a set of confirmed positive and negative associa-tions of genes and Alzheimer’s disease is not yet well defined. Due to the limited technological advances the current research of Alzheimer’s disease is not able to provide sufficient data to define the difference between patients and healthy indi-viduals at the level of molecular interactions. One example to illustrate this issue is an inability to measure the presence or absence of the interaction between the proteins in a live brain regions.

4.4.1. Defining a node class

The genes, and proteins as their products, can be defined as associated with the disease based on genome-wide association studies and based on curated knowl-edge. Despite the substantial number of studies that has been carried out in the field of Alzheimer’s disease, there is no clear agreement between the stud-ies on the set of genes that are clearly associated with the Alzheimer’s disease [111, 229, 230]. It is even more challenging to define the genes that are not related to the disease, which leads to inability to define the negative class, i.e. genes that are clearly not associated with the Alzheimer’s disease. As the results of HENA study demonstrate, the noisiness of the gene labelling poses a serious challenge.

Therefore, we adopt the following strategy:

• We create three classes of genes: positive, negative and unknown. We use information about the nodes from HENA to assemble a set of genes associ-ated with the disease based on GWAS data set and Alzheimer’s relassoci-ated PPI data set collected in HENA. Negative gene label corresponds to the genes that are defined as essential non-disease genes in the evolutionary study by Spataro et al. [231]. The rest of the genes are labeled as unknown.

• Next, we explore one-class or anomaly detection problem in order to assess the quality of class separation.

• We use different feature sets, i.e. biological features and graph structural features, and apply two supervised approaches to classify the nodes with the unknown association with the disease.

• We apply supervised models, i.e. Random Forest and HinSAGE method, on set of genes labelled as unknown and suggest most likely candidates for Alzheimer’s associated genes. Additionally, we perform qualitative analy-sis by exploring the existing body of research about the suggested genes.

4.4.2. Full graph exploration

Due to the fact that we are focusing on genes and their relation to the disease, we have excluded IGRs and corresponding IGRIs from this analysis. We also kept only strong co-expression interaction in disease-associated brain regions with Spearman correlation coefficient ≥0.5. The summary of the resulted graph is shown in Table 4.3.

# nodes # edges

full graph 24825 9740721

PPI subgraph 10445 52003

co-expression subgraph 14634 9671535 epistasis subgraph 13881 17183

Table 4.3.Number of nodes and edges for each sub-graph in HENA data set.

In Figure 22 we can see the whole picture. The graph consists of one largest component in the center, and many small components populating the "ring" clos-est to the central connected component. Alzheimer’s genes, which are the minor-ity, are shown with larger size red-colored nodes. They are seemingly uniformly distributed in the connected component as well as in the rings, which demon-strates that the problem of gene classification according to their association with the disease is not trivial.

Figure 22.Visualization of HENA data set using Gephi platform [232]. Colors represent node types: red corresponds to the association with Alzheimer’s disease, yellow node color indicates the genes that have no direct association with the disease, purple colored are genes with no information about their association with the disease.

4.4.3. Community detection analysis

In order to explore the connectivity of the nodes and identify potentially func-tionally related groups of nodes we performed clustering of the whole graph.

We applied infomap clustering method [233, 234] to identify structurally related groups of nodes. By observing the resulted infomap communities we noted that Alzheimer’s genes are often connected via one hop or two hops.

The analysis of communities also suggests that there are several big clusters that mostly have internal connections of one particular type, for example, a com-munity of genes with PPI connections, or a separate cluster of co-expressions.

This observations does not contradict the previous knowledge about biological networks properties of being modular, small-world networks. In the biological terms it means that nodes sharing biological functionality are located within close proximity in the network [235, 236]. Taking into account that clustering method disregards the heterogeneity of the edges, we formed a hypothesis that the way the genes are connected reflects the type of the interaction they are involved in.

Based on the results of community detection analysis we collected additional node features that rely on graph structure.

Im Dokument ELENA SÜGISIntegration Methods for HeterogeneousBiological Data (Seite 68-71)