Implementation, Evaluation and Optimization

Material and Methods

5. Automated Assignment of Human Readable Descriptions (AHRD)

5.3. Implementation, Evaluation and Optimization

AHRD 2.0 has been written in Java version 1.5 (java.com) and requires Apache Ant (ant.apache.org) for compilation. We designed it using the “test driven development” approach with the framework JUnit (junit.org). AHRD is configured using YML files (yaml.org) which allow adaptation of param-eters and inclusion of an arbitrary number of reference databases. The blacklists are configurable and given as lists of regular expressions. The three different sets of proteins used to evaluate and optimize AHRD were obtained as follows: 1419 manually curated, expert annotated proteins from the recently completed Bluemeria graminis fungal genome (Spanu, Abbott, Amselem, et al. 2010) were selected as the “B. graminis” test set. To generate the “swissprot” test set 1000 proteins, that had a creation date in July 2011, were randomly extracted from the UniprotKB/Swissprot (Boeckmann, Bairoch, Apweiler, et al. 2003; Bairoch and Apweiler 2000) database version July 2011. Finally the “tomato”

test set contains 1132 manually curated, expert annotated proteins from the recently published tomato genome. Genes from the tomato set mainly are proteins involved in pathogen resistance.

5. Automated Assignment of Human Readable Descriptions (AHRD)

5.3.1. Reference sets’ characteristics

To infer the description diversity found in a given reference protein set, I divided the number of distinct descriptions by number of contained proteins. Furthermore the frequency of each distinct protein description was assessed, after they had been blacklisted and filtered using the described procedure (section 5.1, page 33). Subsequently the number of most common descriptions was computed as the minimum number of descriptions that accounted for a fourth of the proteins in the reference set.

Afterwords, those most common descriptions, found to account for the first quarter of proteins in a given reference set, were ignored, and the next common ones, accounting for the second, third, and finally fourth quarter of the references were measured iteratively. These measures were assessed with the goal to answer the question whether more diverse reference sets preferred different optimal parameters than less diverse do.

Subsequently was inferred, whether the proteins of the three reference sets were drawn from often annotated and frequently studied proteins or had a broader spectrum of functions, for example as found in complete eukaryotic angiosperm proteomes. In this context, sequence similarity searches were carried out, separately for each reference set, and in the three public protein databases Unipro-tKB/Swissprot, UniprotKB/trEMBL, and TAIR10. The following comparison of the results revealed which reference set had more hits of high sequence similarity in each of the public databases. Unipro-tKB/Swissprot entries undergo a manual revision by expert curators before they are added to the public database (Boeckmann, Bairoch, Apweiler, et al. 2003). Because the confidence of an expert curator in a candidate protein annotation is surely increased, the more reference proteins of high sequence similarity share the candidate function, a tendency can be expected to find more Swissprot entries of frequently annotated functions and currently favoured research interests. Thus a reference set with results showing such a tendency was interpreted to be drawn from frequently annotated pro-teins and less to resemble the protein function spectrum expected from a random selection of propro-teins from a complete proteome like for example that ofA.thaliana. Finally, in order to obtain a measure of how alike, according to their function, any two proteins in theB.graminis reference set are, pairwise sequence identities were measured using BLAST (McGinnis and Madden 2004) with an E-Value cutoff of 10.0. After self matches had been excluded the distribution of these pairwise sequence identities was examined.

5.3.2. Reference set curation

The sequence similarity searches for proteins in the above three reference sets,B.graminis, Swissprot, and Tomato, were done with “blastp” (version 2.2.21) (Altschul, Madden, Schaffer, et al. 1997; McGin-nis and Madden 2004) with an e-value threshold of 0.0001. For each query protein in these test sets we searched three different protein databases for similar sequences: UniprotKB/Swissprot (version July 2011) (Boeckmann, Bairoch, Apweiler, et al. 2003; Bairoch and Apweiler 2000), UniprotKB/trEMBL (version July 2011) (Boeckmann, Bairoch, Apweiler, et al. 2003; Bairoch and Apweiler 2000) and TAIR10 (Huala, Dickerman, Garcia-Hernandez, et al. 2001; Poole 2007). From these, to avoid self matches, we removed all proteins belonging to speciesSolanum lycopersicumand all proteins from the swissprot test set. Because the Blumeria graminis genome had not yet been published, none of its proteins were contained in the three searched protein databases. Gene ontology term annotations (Ash-burner, Ball, Blake, et al. 2000) were obtained from matching InterProScan (version 4.5) (Apweiler, Attwood, Bairoch, et al. 2000; Zdobnov and Apweiler 2001) results to the InterPro2GO mappings (file version March 2nd 2011) (Apweiler, Attwood, Bairoch, et al. 2000; Zdobnov and Apweiler 2001) and using our in house pipeline PhyloFun (version 1.0) based on Sifter, version 1.2 (Engelhardt, Jordan, Muratore, and Brenner 2005; Engelhardt, Jordan, Srouji, and Brenner 2011).

5. Automated Assignment of Human Readable Descriptions (AHRD)

5.3.3. Competitors and Quality-Assessment

AHRD’s annotations were compared to two competitive methods. The first was the Blast2GO-Suite

“b2g4pipe” version 2.5.0 (Conesa and Gotz 2008; Conesa, Gotz, García-Gómez, et al. 2005), which enables execution on the command line. The required Blast2GO database was downloaded and set up according to the provided documentation with the latest data available in July 2011. The second competitive method took protein descriptions from the best BLAST hits (Altschul, Madden, Schaf-fer, et al. 1997; McGinnis and Madden 2004) from the above three independent sequence similarity searches. We assessed AHRD’s performance averaging the F2-score (Rijsbergen 1979) on every Human Readable Description (HRD) assigned by our program. The F2-score is calculated as the weighted harmonic mean of the two statistics precision and recall, both of which are based on counting shared words, ignoring case, in both the reference (REF) and the assigned HRDs. Treating the reference and the assigned description as mathematical sets of words, precision and recall can be calculated as follows:

precision= |REF ∩HRD|

|HRD| , (5.9)

recall= |REF ∩HRD|

|REF| , (5.10)

where| · | is the set cardinality.

AHRD’s performance, measured as the average F2-score, was compared with the two other com-petitive annotation methods explained above.

5.3.4. Parameter optimization

Using this mean F2-score as the objective function, we were able to optimize AHRD’s parameters and asses its robustness. To achieve this we implemented a simulated annealing approach (Kirkpatrick, Gelatt, and Vecchi 1983), and ran it on the three above test sets. In order to avoid overfitting, the resulting parameters found to be optimal for one test set were cross validated on the other two, respectively. In detail during each iteration of the optimization the mean F2-score was calculated for the currently used parameters, which then were accepted, if the score had improved compared to the currently accepted parameters. Worse performing parameters could also be accepted with the probability p_acpt depending on the current temperature T_cand the constant scaling factor k:

p_acpt=e^−(F²^(a)−F²^(c))·^Tc^k , (5.11) where F₂(·) is the mean F2-Score,

ais the accepted parameter set,

c is the currently evaluated parameter set, Tc is the current temperature, and

k is the scale parameter.

Each iteration of this simulated annealing implementation cooled down its temperature by 1 degree, after which a neighboring set of the accepted parameters was generated and evaluated in the next iteration. This neighbor generation was achieved by slightly mutating a randomly selected parameter

5. Automated Assignment of Human Readable Descriptions (AHRD)

by a valueM based on two custom parametersc₁,c₂ and a Gaussian distributed random valuer with mean 0 and standard deviation 1:

M = (r·c₁) +c₂, (5.12)

wherec₁, c₂ are configurable weights.

We executed eight separate optimization runs with different start temperatures, number of starting parameter sets, and different values for c₁, c₂, and k as well as some differences in implementation.

Of these optimization runs the first six still used a different formula to compute AHRD’s “old overlap score”. This formula computes the coverage on the query sequence only while the “new overlap score” (formula 5.1, page 34) takes into account the coverage on both the query and the hit (subject) sequences.

o_i = QueryEnd−QueryStart+ 1

QueryLength , (5.13)

whereQuery is the query protein’s amino acid sequence,

QueryStartand QueryEndrefer to the query sequence position in the BLAST alignment, QueryLengthis the query sequence length.

We also increased the likelihood of the simulated annealing approach following the mean F2-Score slope uphill, which we achieved by introducing a probability of mutating that parameter again, that lead to an increase of mean F2-Score during the last iteration. This probability p_h was termed “hill climbing probability” and computed as follows:

p_h = e^−(1−d)+s

e⁰+s , (5.14)

where

dis the increase in mean F2-Score achieved in the last iteration, s scaling factor set to 0.7.

Its distribution is plotted in figure 5.1 (page 39).

5. Automated Assignment of Human Readable Descriptions (AHRD)

0.0 0.2 0.4 0.6 0.8 1.0

0.70.80.91.0

'Hill Climbing' probability

Increase in mean F2−Score

P('Mutate same parameter again')

Figure 5.1.: Simulated annealing “Hill climbing probability” distribution

The parameters and method used in each simulated annealing run are summarized in table 5.1.

Table 5.1.: Simulated annealing parameters

Run No. start points Start Temperature c1 c2 k Use “Hill climbing”

One 1000 10000 1.5 1.5 3500000 No

Two 1000 30000 1.5 1.5 3500000 No

Three 1000 30000 0.25 0.25 15000000 Yes

Four 1000 30000 0.25 0.25 9250000 Yes

Five 10000 10000 0.25 0.25 7000000 No

Six 1000 10000 0.25 0.25 7000000 Yes

Seven 709571 1 0 0 0 No

Eight 6 50000 0.25 0.25 7000000 No

In order to estimate the proportion of the parameter space that had been evaluated we first ap-proximated its size by limiting its axis to the intervals from zero to one for all weight parameters, or zero to a hundred for the database weights, respectively. Then we discretized each axis into a hundred distinct values and computed the size of the parameter space as 100ⁿ^a wherena denotes the number of different parameters subjected to optimization, and hence evaluates to 10. Furthermore each set of parameters generated and evaluated during simulated annealing was compared with all others in order to measure how many pairwise distinct parameter sets had been tested. To further estimate coverage of the parameter space and performance of the optimization itself, the fractions of parameter mutations that yielded an increase, decrease, or no change in F2-Score, respectively, was assessed, as well as the euclidean distances walked in parameter space by each simulated annealing

run. Finally we assessed the influence of the temperature on the used implementation of simulated annealing, specifically the current rates of accepting, or rejecting mutated parameter sets were mea-sured on intervals of 1000 degrees and plotted together with the F2-Scores of currently accepted and all evaluated parameter sets.

Optimization runs seven and eight were done after switching to the introduced overlap scoring and discontinuing the usage of the “old overlap-score” (see chapter 5.3.4, page 38), which made new optimization necessary. After assessing six high scoring parameter sets in run seven we submitted those to further optimization in run eight.

Im Dokument Protein Function Prediction using Phylogenomics, Domain Architecture Analysis, Data Integration, and Lexical Scoring (Seite 35-40)