• Keine Ergebnisse gefunden

1.2 Objectives and overview

5.1.1 Whole genome phylogeny reconstruction

Alignment-free sequence comparison has become a vibrant field of research in bioinformat-ics and a large variety of different approaches were proposed over the past decades. Most traditional alignment-free methods are either based on word counts (see Section 1.1.2) or on match lengths (see Section 1.1.3). Recently, a new idea to estimate distances emerged which is based on micro-alignments (see Section 1.1.4). I discussed the advantages of these methods compared to other alignment-free approaches but also pointed out that current approaches are limited to closely related organisms.

In my research, I focused on the development of a new alignment-free method for phy-logeny reconstruction that overcomes the limitation of current approaches (see Chapter 2).

Similar toCO-Phylog[133] andandi[38],FSWM is based on micro-alignments. Both com-peting approaches use a minimal length criteria of the matching word pairs to distinguish between background and homologous hits. This approach, however, becomes increasingly imprecise the longer the sequences are. The relationship between the sequence length and the number of homologous and background matches was pointed out in Section 1.1.2. To alleviate this problem both approaches use additionally a uniqueness criteria, as described in Section 1.1.4. The effectiveness is, however, limited. Instead of filtering matches by length and uniqueness, I proposed a new filtering procedure which calculates a score for 88

each spaced-word match and discards matches with a score below a threshold. To in-vestigate which threshold should be used, I plotted the number of spaced-word matches against their score. We called these plots spamograms (spaced-word match histograms) (see Chapter 4). These spamogramsshowed that a threshold of 0 separates most homolo-gous matches from background matches (see Chapter 2 Figure 1) and therefore I used this as default parameter. If background matches are filtered out, patterns with fewer match positions can be used without losing the signal between the sequences due to noise. This is especially important if the sequences are more distantly related because the number of homologous word matches decrease with increasing sequence divergence but the number of background matches remain the same. Consequently, theFSWM approach is competitive withCO-Phylogandandi for closely related sequences but outperforms both for more dis-tantly related sequences. I showed that FSWM is able to estimate substitution rates up to about 0.9 and is less influenced by insertions and deletions then other approaches (see Chapter 2 Figure 2). Moreover, the resulting phylogenies based on distances calculated with FSWM were more accurate then the phylogenies based on distances calculated by other alignment-free approaches (see Chapter 2 Figure 3).

One notorious problem associated with whole genome comparison are repetitive regions and gene duplications. For example, a spaced-word can occur at positioniin sequenceS1

and the same spaced-word occurs in sequence S2 at position j and j0, i.e. there are two matches, and both matches have a positive score. In this scenario one could assume that only one match occurs due to orthology and the other match occurs due to duplication.

For phylogeny reconstruction, the orthologs are used to estimate distances and must be identified first [98, 126, 45]. To this end, I proposed to use a greedy one-to-one mapping of spaced-word matches which works as follows: at first all spaced-word matches with a positive score are stored in a list. Then, the spaced-word match with the highest score is picked. Next, the number of matching and non-matching nucleotides at the don’t care positions of this spaced-word match is determined and this match and all other matches that start at the same positions in S1 or S2 are removed from the list. This procedure is repeated with the next highest-scoring spaced-word match in the list until the list is empty. CO-Phylog andandi, on the other hand, implicitly deal with duplications by their respective uniqueness criteria, i.e. word matches where no clear assignment is possible are ignored.

The run time ofFSWM is sufficiently fast to compare large pairs of sequence in reasonable time. For example, it took FSWM about one and a half day to calculate all pairwise

distances of a large eukaryotic data, consisting of 14 taxa with a total size of 4.8 gb. In the study presented in Chapter 2, FSWM was considerably slower thanandi but faster than CO-Phylog . Complexity-wise, however, CO-Phylog should be much faster than FSWM but the published implementation leaves room for improvements. The issue of the run time ofFSWM is discussed in Section 5.2.

Web server

To make the FSWM approach accessible to a broad range of researchers, I developed a user-friendly web server. Users can upload a set of sequences in Multi-FASTA format and the number of match positions of the pattern, called the weight, can be specified. Then, the job is executed on a powerful server and the user can track the progress of the job via the assigned id. After the calculations are finished the spamograms are visualized on

Figure 2: Screenshot of theFSWM web server, available athttp://fswm.gobics.de 90

the website and the user can switch between them. The visualization of the spamograms helps the user to identify a good cut-off value between random and homologous matches if the default threshold does not separate them properly. The threshold can be changed independently for each spamogram with the scroll bar at the bottom. If the threshold is changed for one sequence pair, the resulting distance is dynamically recalculated and shown on the website. Once the user is satisfied with the selected thresholds, the distance matrix and corresponding tree, calculated withNeighbor-Joining, can be downloaded. Also thespamograms can be saved as an image.

There are some restrictions on the web server to prevent server overload. The total file size is limited to 512 mb and the number of sequences is restricted to 100. If large-scale data sets are to be analysed or for researchers with knowledge in linux, it is preferable to use the command line version of FSWM. The command line tool, however, does not provide a visualization of thespamograms. A solution to this problem could be to upload a smaller subset of sequences on the web server to assess which threshold works best for most sequences pairs.