Rigid Segmentation Benchmark - Probabilistic Models to Detect Important Sites in Proteins

5. Results 39

5.1.2. Rigid Segmentation Benchmark

Following the line of Nguyen et al. [19], I assessed my graph-based method on the big benchmark DynDom data set [86] which is the collection of proteins with two different conformational states. To remove the redundancy, I filtered out all proteins whose average TM-score [55] with other proteins was greater or equal 0.5, which was a threshold according to Zhang et al. [87] indicated that these proteins would have similar structure. Moreover, in the scope of this study, I only investigated into medium to large conformational changes.

Thus, I also removed protein structure pairs whose RMSDs are less than 5.0 Å. As a result, I obtained 487 proteins for the assessment. Additionally, I removed domains that consisted of less than ten amino acids.

For the numeric evaluation, I utilized the two metrics which are error and overlap defined in [19]. In the following, I explain and give an example to demonstrate how these two metrics work.

Error

The error counts how often two segmentations disagree on if their pair of amino acids are on the same domain. For instance, let us consider a protein comprised of ten amino acids with two different segmentations described in the Table 5.1.

Table 5.1.: An example of a protein with two segmentations.

1^stsegmentation 0 0 0 1 1 1 2 2 0 0 2^ndsegmentation 1 1 2 2 2 0 0 3 3 3

The first segmentation has three domains while the second result has four ones. To present the inter-intra relationship over every pair of amino acids in a segmentation, I utilized the 10×10 square matrix (Tables 5.2 and 5.3).

Table 5.2.: Inter-intra relationships over all pair of amino acids in the 1^stsegmentation.

0 0 0 1 1 1 2 2 0 0

0 0 0 1 1 1 1 1 0 0

0 0 1 1 1 1 1 0 0

0 1 1 1 1 1 0 0

1 0 0 1 1 1 1

1 0 1 1 1 1

1 1 1 1 1

2 0 1 1

2 1 1

0 0

Table 5.3.: Inter-intra relationships over all pair of amino acids in the 2^ndsegmentation.

1 1 2 2 2 0 0 3 3 3

1 0 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1

2 0 0 1 1 1 1 1

2 0 1 1 1 1 1

2 1 1 1 1 1

0 0 1 1 1

0 1 1 1

3 0 0

3 0

Because of the symmetry, I only need the upper triangle tables. The diagonal of those tables are always zero, thus not interesting to us. An entry value of a table would be ”0” if its two corresponding vertices, according to the segmentation, belong to an identical domain, or be ”1” otherwise.

The error between these two segmentations is calculated by counting the disagree-ments between those above tables, normalized by the number of pair of amino acids. In the other words, the disagreement is calculated by Not-Or-Exclusive operation as follows:

2×sum(Table5.4)

10×9 =²₁₀^×_×¹⁶₉=0.356.

Table 5.4.: The table contains disagreements between Tables 5.2 and 5.3.

0 1 0 0 0 0 0 1 1

1 0 0 0 0 0 1 1

1 1 0 0 0 1 1

0 1 0 0 0 0

1 0 0 0 0

1 0 0 0

1 0 0

1 1

Overlap

The overlap takes into account the matches between two segmentations after arrangement them in a manner that they obtain the maximum agreement which is obtained by solving a low-dimensional linear assignment problem. Let us consider the same above example to illustrate how to calculate the overlap between these two segmentations. Firstly, we compute the overlap matrix which is:

Table 5.5.: The overlap table between two segmentations.

Segmentation 1

Segmentation 2

0 1 2 3

0 0 3 2 2

1 1 0 2 0

2 1 0 0 1

The first column and first row in the Table 5.5 contain the domain indexes of the first and second segmentation respectively. Consider the first segmentation as a reference, for each domain index we count how many times amino acids in this domain agree to other domains in the second segmentation. The highlighted number in each row is the maximum agreement it can get. Thus, the overlap between those segmentations is the sum of all maximal agreements divided by the length of the sequence, which is ^(3+2+1)₁₀ =0.60.

Though the metrics of error and overlap are calculated differently, they are highly coun-tercorrelated.

Benchmark Assessment

Figure 5.2 shows the histogram of the error and overlap between mine and DynDom’s seg-mentations on 487 entries in DynDom data set with the edge cutoff value 7.5 Å. The median error and overlap are 0.038 and 0.972 respectively. Particularly, around 30% of my label-ings highly agree with annotation provided by DynDom (overlap≥0.99). Yet, occasionally my method was unable to calculate reasonable segmentations due to two possible reasons.

Firstly, the coarse-graining step failed to produce homogeneity communities, i.e., most of amino acids in a community belonged to a same domain. Secondly, the mean variance-based signals calculated from inter and intra vertices/edges in the line graph were indistinguish-able. This created confusions to the scoring function and thus the most probable label did not coincide with the actual one.

I investigated examples whose segmentations according to my method disagreed with DynDom. I observed that sometimes my method suggested a more reasonable labeling

Figure 5.2.: The histogram of the error and overlap evaluated on 487 proteins in the Dyn-Dom data set.

than annotations from DynDom. For example, let us consider ahuman importin subunit beta-1protein which is an entry in DynDom data set. Panel (B) in Figure 5.3 represents my algorithm’s result, run on the open and closed states of this protein (PDB code 3lww, chainA&C). My method produced two separate rigid domains whose RMSDs were 2.228 and 1.003Å. On the other hand, DynDom annotation suggested three rigid domains whose RMSDs were 6.843, 4.321, and 2.106Å(Panel (A) in Figure 5.3). It is noticeable that the first domain of DynDom annotation (dark green) is small, fragmented and have a relatively large RMSD. In addition, the second domain (dark red) has a significant portion which is intertwined with the third domain (dark blue). Overall, my segmentation on ahuman importin subunit beta-1protein seems more reasonable than one from DynDom according to RMSD metric as well as pictorial presentations.

I also studied the influence of the edge cutoff used in the construction of protein graphs through experiments on varying cutoff values. I summarize the results (Table 5.6) reporting the mean and median of error and overlap on 487 proteins in DynDom data set attained with various edge cutoff values. The overlap seems to be unaffected by the choice of edge cutoff, meanwhile the error is slightly dropped with bigger cutoffs. I suggest two probable explanations. Firstly, a big edge cutoff produces a denser protein graph which seemed to obtain a better coarse-grained graph (see Discussion section 6.1.1). Secondly, a denser protein graph eventually results in a denser coarse-grained graph which seems to enhance the mean-variance driven signals for the scoring function in the line graph. Nonetheless, I restricted the cutoff to smaller values due to the computational cost of the generalized

(A) (B)

Figure 5.3.: Protein graph of human importin subunit beta-1 protein. (A) Segmentation suggested by DynDom: three rigid domains colored by dark green, red and blue. (B) My segmentation: two rigid domains colored by light green and blue.

Table 5.6.: The performance of my graph-based method with varying edge cutoff values assessed on DynDom data set.

Cutoff

Metric

Median overlap Mean overlap Median error Mean error

7.5Å 0.972 0.924 0.038 0.086

10.5Å 0.977 0.924 0.034 0.083

13.5Å 0.972 0.926 0.033 0.081

Viterbi algorithm.

Im Dokument Probabilistic Models to Detect Important Sites in Proteins (Seite 51-56)