Coarse-graining Procedure - Probabilistic Models to Detect Important Sites in Proteins

6. Discussion 55

6.1.1. Coarse-graining Procedure

I studied different clustering methods and figured out that the Louvain clustering algorithms were the most suitable for my approach because they allowed me to integrate the graph structure into their calculations. In particular, I decided to choose the Louvain algorithm with Pott models [39] due to its comprehensible interpretation. With the Pott model, the clustering algorithm tried to maximize the number of internal edges within communities while keeping their sizes relatively small. In other words, the splitting of a big cluster into two smaller ones depends on the density of links between those two smaller clusters (see Section 3.4). Those interpretations of clustering fit my graph-based model where I also assumed internal connections in rigid domains are more frequent than ones between domains.

Even though the coarse-graining process on the protein graph occasionally created wrong clusters, it significantly enhanced the mean variance-driven signals for the quality function. In the following subsections, I discuss and analyze two main effects of the coarse graining which are the advents of inconsistency error and signal enhancement.

Inconsistency Error

I introduce a metric, called inconsistency error, to measure the efficiency of the protein graph construction and the coarse graining. This metric quantifies the heterogeneity of communities in the coarse-grained graph weighted by their sizes. For the formal definition, letG= (V,E)be a graph with a set ofNverticesV. For everyi-th verexvi∈ V, its label is denoted asσi. In addition, letC={C_k}be a partition of vertices into communitiesC_k⊂ V resulted from the coarse graining. I define the inconsistency error of the coarse-graining procedure as

IE(C|G) =2

∑

C_k∈C

|C_k| N

∑_i<j∈C_k|σi6=σj|

|Ck|(Ck−1) (6.1.1)

where|C_k|is the cardinality of a communityC_kand|σi6=σj|is 1 ifσi6=σj, or 0 otherwise.

This above entity is the average number of labelling mismatches within a cluster weighted by the cluster size. The entityIE(C|G)will be zero if the coarse graining gives us all ho-mogeneous clusters. However, the zero value of inconsistency error is not all what I like to obtain. It is important to notice that the coarse graining could easily achieve zero inconsis-tency by assigning each vertex as its own cluster, yet it does not produce any benefit. Thus, it is important to control the coarse graining in a manner that it produces small inconsistency while the number of the clusters is significantly smaller than the number of vertices. On the other hand, this error will approach one if the coarse graining totally gives wrong clusters.

This only happens when each vertex has distinguish label, but all are grouped into a single big cluster.

I firstly study different ways to construct protein graphs from multiple conformations mentioned in Section 4.1.1. In short, in the disjunction-based protein graph construction, denoted as type (I), I created an edge between two vertices if their distance is smaller than a cutoff inat least oneconformation. The edge weight is the number of such conformations.

In contrast, in the conjunction-based protein graph construction, denoted as type (II), an edge between two vertices is created if and only if its distance is smaller than a cutoff inall conformation. Additionally, I assigned a weight to an edge by its reciprocal exponentiated variance computed over all conformations (see Equation 4.1.3). Such weight assignment follows the idea that low-variance edges have a weight close to one and high-variance edges are assigned to a weight close to zero.

Figure 6.1 shows that the second protein graph construction consistently outperformed the first type in term of the inconsistency error. A possible explanation is as follows. The second type of construction rule produces a sparser protein graph where an edge between two vertices are created only when ones are certain that these two are close in all cases.

Consequently, rigid bodies in a protein graph tend to have more edges than the ones between bodies. Thus, the coarse-graining procedure obtains less error of inconsistency.

Additionally, I examined varying values of edge cutoff, ranging from 7.5 to 13.5Å.

Ac-Figure 6.1.: Histogram of inconsistency error from graph construction type I and II and their varying cutoff values respectively.

cording to the results from Figure 6.1, there was a small, but not significant improvement of the inconsistency error for larger cutoff values.

Signal Enhancement

As mentioned in Section 4.1.4, the efficiency of the quality function (Equation 4.1.6) de-pends on the separability between the mean variance values of inter- and intra-vertices or edges in the line graph.

Given an array of mean-variance values calculated from inter- and intra-vertices in the line graph, I used the area under the ROC curve (AUC) to measure how well the mean-variance can distinguish between these two group of vertices. Likewise, to measure the separability between inter- and intra-edges in the line graph, we apply the AUC in the iden-tical manner.

The AUCs calculated on vertices and edges in the line graph derived from the coarse-grained graph (red bars in Panels (A) and (B) of Figure 6.2) are significantly larger than the ones calculated on vertices and edges in the line graph derived directly from the protein graph (blue bars in Panels (A) and (B) of Figure 6.2). Thus, the illustrations in Figure 6.2 give us a strong evidence of the advantage of using the coarse graining in my methods.

Overall, in my study, I adjusted the resolution parameter of Louvain algorithm so as to produce about twenty clusters of medium size. Too big clusters could result in the increase of the inconsistency error because amino acids in hinge regions tend to be merged together.

Figure 6.2.: Histograms of area under the ROC curve (AUC) evaluated on 487 proteins in the DynDom dataset. (A) Histograms of AUC calculated from the inter- and intra-vertices in the line graphs derived from the protein graph(blue histogram) and from the coarse-grained graph (red histogram). (B) Histograms of AUC cal-culated from the inter- and intra-edges in the line graphs derived from the pro-tein graph(blue histogram) and from the coarse-grained graph (red histogram).

Too small clusters, on the other hand, tend to have smaller inconsistency errors with the cost of the insignificance of the mean variance between two clusters.

6.1.2. Line Graph Transformation

Here, I deliver my insights and motivations concerning the line graph transformation. First, I explain why I need to modify the construction of the line graph as to make it suitable to my study of rigid domains in protein estimation. Second, I present the motivation beneath the formula of feature functions defined on vertices in the line graph (Ψ⁽¹⁾in Equation 4.1.7).

Afterward, I discuss my ideas behind the formula of feature functions defined on edges in the line graph (Ψ⁽²⁾in Equations 4.1.8, 4.1.9, 4.1.10) as well as their limitations.

Modified Line Graph Construction

In the old line graph construction [79], any pair of incident edges in the original will become an edge in a line graph. In case of the two ended vertices of those pair incident edges also form an edge, the mean-variance of these two vertices is used twice in the quality function, one for a vertex feature and one for an edge feature. To avoid such duplication, I modified the line graph construction by eliminate edges if their two end vertices are linked.

Additionally, such edges pruning in the line graph produces a sparser graph which saves computational resources in the calculation of the generalized Viterbi algorithm.

Feature functions on vertices in the line graph

Given a protein with known rigid domains, the left panel of Figure 6.1.2 shows that the mean-variance values of inter-vertices in the line graph constructed from the coarse-grained graph tend to be bigger than ones of the intra-vertices. This observation fits very well with the rigidity definition. However, an optimal threshold dividing mean-variance of inter- and intra-vertices is protein-dependent and there is no such universal threshold. Yet I noticed that if I consider the mean-variance of inter-vertices as outliers, I could identify almost of them via outliers detection. Thus, in the feature functions on vertices, I reward the quality function when either the mean-variance of a vertex is an outlier (probably an inter-vertex) and the predicted label of that vertex is−1, or the mean-variance of a vertex is non-outlier (probably an intra-vertex) and its predicted label is+1.

Feature functions on edges in the line graph

With the similar observation, the right panel in Figure 6.1.2 shows that the mean-variance of inter- and intra-edges in the line graph seems to follow two distinguish but overlapped distributions. Still, it is problematic to calculate these two distributions because there are not

Mean-variance on vertices

Frequencies

Mean-variance on edges

Frequencies

Figure 6.3.: The histograms of mean-variance values calculated from vertices and edges in a line graph. (Left) The left panel shows the frequencies distribution of the mean variance calculated on intra-vertices (red bars) and inter-vertices (blue bars) in the line graph. The big red dots the x-axis indicate the outliers according to the outliers detection. (Right) Similarly, the right panel illustrates the frequencies distribution of mean-variance for the intra- and inter-edges in the line graph.

The big red dots are also the outliers according to the outliers detection.

enough samples for the maximum likelihood estimator such as expectation maximization (EM). However, if I applied the outliers detection trick as mentioned above, I could obtain a decent amount of inter-edges based on their mean-variance. As shown in the right panel of Figure 6.1.2, outliers indicated via big red dots in thex-axis cover a lot of mean-variance calculated from the inter-edges.

From such observations, I designed a feature function for an edge in the line graph in a way that it could be in favor of vertices labels according to the mean-variance of the edge and its two corresponding vertices. The design of the feature function on edgesΨ⁽²⁾is based on the inferences described on Figure 6.4. From Figures 6.4.A to 6.4.D, it is trivial that a preferable labeling could be obtained directly from the mean-variance of edges and their two vertices. For instance, in Figure 6.4.A, a high mean-variance of two ended vertices along with a low and a high mean-variance between a common vertex and two ended vertices infer a positive and negative labels of vertices in the line graph, respectively. The labeling inferences of other cases such as (B), (C) and (D) follow the similar reasoning. However, there are two cases (6.4.E&F) where the method could not unambiguously infer the labeling.

Those unambiguous cases happen when signals from two pairs of vertices show they belong to one domain but the signal from other pair shows that they do not. Nevertheless, when I examined those two cases, there were substantial differences.

In the case ofγe=−1 andγv1=γv2= +1 (Figure 6.4.E), it implies that two end vertices probably belong to different domains and the common vertex locates on the hinge region.

To decide which domain this common vertex belongs to, two mean-variance ofv₁ andv₂ were compared, as shown in the Equation 4.1.9. On the other hand, Figure 6.4.F shows another contradictory case where I set the value to zero, thus they could not interrupt to the labeling inference of the generalized Viterbi algorithm.

Im Dokument Probabilistic Models to Detect Important Sites in Proteins (Seite 65-71)