• Keine Ergebnisse gefunden

Data analyses methods prior to phylogenetic tree reconstruction

2.3 Data analyses methods prior to phylogenetic tree

Data analyses were conducted in close cooperation with several co-workers of the molecular laboratory at the ZFMK in Bonn (namely P. KÜCK, R. STOCSITS and H. LETSCH) and at the University of Hamburg (B. MISOF), who programmed the software. For this reason these newly developed software tools could be used and further improved by direct collaboration and discussing the results of the analyses (table 2.1).

Table 2.1: Main analyses included. Used marker genes and main focus is shown.

Analyses Marker genes Main focus in analysis

[A] 16S, 18S, COI Standard vs. secondary structure guided alignments

[B] 18S, 28S Time-homo vs. heterogeneous processes, secondary structure [C] EST sequences Orthologous gene selection, relative information content of genes

Because this thesis was conducted in the framework priority program “deep metazoan phylogeny” of the DFG (Deutsche Forschungsgemeinschaft) a close cooperation existed with collaborators of the other three arthropod groups involved in this program. This was in general the case with K. MEUSEMANN.

The applied methods and programs to reconstruct trees for this thesis will be explained in the specific analyses sections. It is to mention that the EST analysis procedure followed in general the same principles as shown in figure 2.6 but differed slightly according to the differences between phylogenomic and single gene data. A detailed flowchart for the EST analysis is given in chapter 2.6. For that phylogenomic analysis some bioinformatics and computational issues had to be outsourced to the bioinformatics group (V. HÄSELER, Vienna, sequences. Therefore sequence errors cannot be discovered in these data. Processed sequence data is prealigned [2] applying multiple sequence alignment programs. In case of rRNA genes a secondary structure-based alignment optimization follows. To gain a first impression of the information in the data, its structure is evaluated by phylogenetic network reconstruction. It follows the final alignment evaluation & processing [4]. For each gene ALISCORE (MISOF & MISOF 2009) is performed to identify randomly similar aligned positions and ALICUT excludes those positions found by ALISCORE (=masking process). Single, masked alignments are concatenated by a PERL-script to the final alignment. For most analyses it is useful to compare data structure before and after the alignment process in a network reconstruction [5]. After this the last step is the phylogenetic tree reconstruction [6].

2.3.1 Sequence processing and quality control

All resulting sequence electropherograms were analyzed and assembled using the software programs SeqMan (DNASTAR, Lasergene), CEQ 8000 (BECKMAN COULTER) or Bioedit 7.0 (HALL

1999). Unfortunately most published sequences are not linked to their trace files and consequently the quality of electropherograms cannot be determined. All final sequences and composed fragments were blasted in NCBI using BLASTN, MEGABLAST and BLAST2SEQUENCES to exclude contaminations. This is the terminal but eminent important (and very often ignored) step to finish the laboratory work. Ambiguous own or published sequences were always excluded from analyses.

2.3.2 Multiple sequence alignment

Sequence pre-alignments were performed for each gene separately with the commonly applied alignment programs MUSCLE (EDGAR 2004A; 2004B) and MAFFT (KATOH ET AL. 2002).

For a comparison of MUSCLE and MAFFT alignments see (2.4). Tests of MAFFT have indicated that its LINSI-algorithm is more reliable for rRNA genes. These often inhabit expansion segments and ambiguous regions with variable length polymorphisms, which require a different estimation and judging for introducing gaps (KATOH & TOH 2008).

Generally several different MSA programs were tested in parallel for this study. In addition to the above cited software for example the new version of CLUSTALX (LARKIN ET AL. 2007;

THOMPSON ET AL. 1997; THOMSON ET AL. 1994) and T-COFFEE (NOTREDAME ET AL. 2000) were applied, but MUSCLE and MAFFT outperformed these and other programs regarding time and efficiency. Finally, the standard settings were used for all alignment programs.

2.3.3 Alignment optimization based on secondary structure information The first step in alignment algorithms relies on identification of similar sequence regions which are subsequently arranged to sets of strings with maximized character identity in alignment positions, underlying homology hypotheses, see section 1.4.

The software RNAsalsa (STOCSITS ET AL. 2009) is a new approach to align structural rRNA sequences based on existing knowledge about structure patterns, using constraint directed thermodynamic folding algorithms and comparative evidence methods. This makes alignment reconstruction more objective. For each molecule in addition to sequence similarity a second trait, the structure is considered. RNAsalsa automatically and simultaneously generates both individual secondary structure predictions within a set of homologous RNA genes and a consensus structure for the dataset. Successively sequence and structure information is taken into account as part of the alignment's scoring function.

Thus, functional properties of the investigated molecule are incorporated to corroborate homology hypotheses for individual sequence positions. The program employs a progressive multiple alignment method, which includes dynamic programming and affine gap penalties.

Inferred site covariation patterns are used then to guide the application of mixed nucleotide / doublet substitution models in subsequent phylogenetic analyses. RNAsalsa needs a

prealignment as input. For a description of the exact algorithm and parameters of RNAsalsa, see STOCSITS ET AL. (2009), manual and software download see the homepage at: http:

//rnasalsa.zfmk.de.

Secondary structure constraints for analyses [A] are based on the 16S (L20934) and 18S (78065) sequences of Anopheles gambiae and albimanus. For analysis [B] the 28SS+5.8S (U53879) and 18S (V01335) of Saccharomyces cerevisae were used.

Corresponding secondary structures for the sequences were extracted from the European Ribosomal Database (DERIJK ET AL. 2000; VAN DE PEER ET AL. 2000; WUYTS ET AL. 2000; WUYTS ET AL. 2004; WUYTS ET AL. 2002). Structure strings were converted into dot-bracket-format using Perl-scripts. Folding interactions between 28S and 5.8S (GILLESPIE 2005; GILLESPIE ET AL. 2006; MICHOT ET AL. 1983) required the inclusion of the 5.8S gene in the constraint to avoid artificial stems. Alignment sections presumably involved in the formation of pseudoknots were locked from folding to avoid artifacts. Pseudoknots in Saccharomyces cerevisae are known (WUYTS ET AL. 2000) for the 18S (stem 1 and stem 20, V4-region: stem E23\9, E23\10, E23\11 and E23\13) while they are lacking in the 28S secondary structure.

Prealignments and constraints served as input, RNAsalsa ran with default settings.

2.3.4 Evaluating structure and signal by network reconstruction

Phylogenetic networks (HUSON & BRYANT 2006) were reconstructed to evaluate the general structure, potential conflicts and signal-like patterns in the alignments. Without constraining the results of a phylogenetic analysis in form of a bifurcated tree, these phylogenetic networks can be used to visualize the presence of conflicting signals in the data (HUSON ET AL. 2005). Conflicts are indicated by non-parallel edges that represent conflicting splits between taxa, and show the relative support for splits in the data by the length of parallel edges supporting a certain split (as an indicator for the weight of the split, analogous to branch lengths in a tree). For a detailed description of phylogenetic networks see (HUSON &

BRYANT 2006; WÄGELE & MAYER 2007). With the software Splitstree 4.10 (HUSON 1998; HUSON

& BRYANT 2006) was the neighbor-joining algorithm applied for network reconstruction in analyses [A and B] and additionally the LogDet transformation in analyses [B] to analyze the alignment of the complete 18S & 28S rRNA genes. LogDet is a distance transformation that corrects for biases in base composition (PENNY ET AL. 1994; STEEL ET AL. 2000).

2.3.5 Alignment evaluation and processing

Alignments were assessed with the software ALISCORE (MISOF & MISOF 2009) to identify ambiguous or randomly similar aligned sections. ALISCORE uses for this purpose a parametric approach, relying on defined models of sequence evolution. This results in a

ALISCORE generates profiles of randomness using a sliding window approach. Sequence positions within this window are assumed to have random-like nucleotide patterns when the observed score does not exceed 95% of scores of random sequences of similar window size and character composition generated by a Monte Carlo resampling process. ALISCORE generates a listfile of all putative randomly similar sections. No distinction is made between random similarity caused by mutational saturation and alignment ambiguity. The default settings were used, the window size was (w=6), gaps were treated as ambiguities (- N option) and the maximum number of possible random pairwise comparisons (- r option) was analyzed.

The alignment masking process was conducted with the program ALICUT (KÜCK, http://utilities.zfmk.de). This perl-script masks the alignment by excluding the positions identified in an ALISCORE analysis to be randomly similar.

The consensus secondary structure of rRNA genes given in RNAsalsa was included into the alignment in analyses [A and B]. Consequently, both the aligned sequences and the consensus sequence are masked. In this way, the user can consider secondary structure information for phylogenetic analysis, for example, by implementing mixed models for RNA molecules. By default, ALICUT excludes also stem positions if identified as “randomly similar aligned” and converts the corresponding stem nucleotide into a dot ignoring covariation.

However, it is plausible that evolution of stem positions is constrained by secondary structure and covariation patterns. Therefore, the -s function in ALICUT was used to keep all stem positions in the alignment.

2.4 Analyses [A]: Can 16S, 18S and COI marker genes