• Keine Ergebnisse gefunden

The absence of many Actinobacteria, which are of high interest for a detailed evaluation of the algorithm in cooperation with members of the Institute of Genome Research at Bielefeld University, together with the clustering in the COG database being too specific, led to the decision to develop an alternative approach that allows a fully automated and parameterized grouping of genes into families of homologs, only based on the evaluation of their sequence similarity. Therefore, we present a new definition of a family of homologous genes, where homology in this case is defined by a combination of parameterized tests for paralogy and orthology. Based on this definition of gene families, we describe an algorithm that establishes this family classification based on the results of an all-gainst-allTBLASTN comparison.

4.2.1 Notation

Given a collection of k genomes G = (G1, G2, . . . , Gk), |Gi| is the number of genes in Gi and Gi[j] denotes the j-th gene of genome Gi with |Gi[j]| the length of its amino-acid sequence. Applying an all-against-all TBLASTN comparison [3], we get for each pair of genes (Gi[j],Gi0[j0]), 1 i, i0 k, 1 j ≤ |Gi|, 1 j0 ≤ |Gi0|, a TBLASTN hit Hit(Gi[j], Gi0[j0]). TheTBLASTN hit contains three different values: Hitp(Gi[j], Gi0[j0]) is the rate of matching amino-acids,Hitl(Gi[j], Gi0[j0]) the length of the calculated alignment, and Hite(Gi[j], Gi0[j0]) denotes the E-value of the hit. Furthermore we call

Hitc(Gi[j], Gi0[j0]) := Hitl(Gi[j], Gi0[j0]) max(|Gi[j]|,|Gi0[j0]|) the coverage of Hit(Gi[j], Gi0[j0]).

4.2. A RELAXED FAMILY DEFINITION 55 Definition (paralog). Given two genes g = Gi[j] and g0 = Gi0[j0] of G, i = i0, j 6= j0, and the thresholdsp0, e0, and c0, g and g0 are called paralogous if and only if

Hitc(g, g0)> c0 and (Hitp(g, g0)> p0 or Hite(g, g0)< e0).

We call two genes of the same genome paralogous, if theirTBLASTN hit has a sufficiently large coverage and either the rate of matching amino-acids exceeds the chosen threshold or the probability that this hit occurred just by chance (E-value) is below the selected threshold.

Instead of relying only on the E-value as measurement of the significance of aTBLASTN hit, we also regard the rate of matching amino-acids to decide whether two genes are conserved paralogs or not. This becomes especially important for genes consisting only of a very short amino-acid sequence, since by the definition of the E-value genes of shorter amino-acid sequences are more likely to occur by chance, and therefore the E-value might not reflect the true significance of the hit. Another important aspect in the test-condition for paralogs is the term regarding the coverage of a TBLASTN hit. Many genes are composed of smaller pieces of amino-acid sequences called domains. Usually, domains encode a certain sub-function of a gene, and it is quite common that if a sub-function is required in many different processes of an organism, one will find several genes containing such a highly similar domain, while the general functions of the genes may differ. Therefore, choosing for example a threshold of c0 > 0.7 requires that the length of the computed alignment for two genes has to cover at least 70% (generally at least three of four domains) of the amino acid sequence of both genes.

Definition (ortholog). Given two genes g =Gi[j] and g0 =Gi0[j0] of G, i6=i0, and the thresholds p0, e0, and c0, g and g0 are called orthologous if and only if

Hitc(g, g0)> c0 and (Hitp(g, g0)> p0 or Hite(g, g0)< e0).

In general, the definition of paralogs and orthologs are identical, except that orthologs cannot be found inside the same genome and paralogs not between different genomes. But observe, that in the practical application it will be possible to set the thresholds for the parameters p0,e0, and c0 differently for paralogs and orthologs.

4.2.2 Creating the Families of Homologs

The general idea of our strategy to group genes into families of homologs is similar to the construction of the COG clusters described in Section 4.1. In contrast to the COG approach, here we have a more flexible description of paralogous and orthologous genes

and our grouping procedure does not rely on the identification of best hits between species with a constrained phylogenetical relation (best hits creating a triangle structure in phy-logenetically distant species).

After performing the all-against-all TBLASTN search, in a first step the TBLASTN hits between two genes g and g0 are made symmetric. This becomes necessary, since for g and g0 always at least two hits (Hit(g, g0) and Hit(g0, g)) exist. Since the TBLASTN program is not symmetric in the order of input parameters the two hits might slightly differ in their coverage, E-value and matching rate. These differences may cause inconsistencies in determining if the two genes are paralogs or orthologs, especially if the values are in a close proximity around the selected thresholds p0, e0, and c0. To enforce symmetry, the hits are modified by averaging as follows:

Hitgc(g, g0) = Hitc(g0, g) +Hitc(g, g0)

2 ,

Hitgp(g, g0) = Hitp(g0, g) +Hitp(g, g0)

2 ,

Hitge(g, g0) = 10(log10(Hite(g0,g))+log10(Hite(g,g0)))/2.

The creation of symmetric hits becomes more difficult if for two genes more than one pair of hits is reported by the TBLASTN program. In this case, only the pair which is least likely to occur by chance, i.e. the pair with the lowest E-value, is considered for determining orthology or paralogy.

In the next step, the families of paralogs for each of the given genomes are constructed using a single-linkage clustering (see Algorithm 5).

First, a gene g is chosen that was not assigned to any family before, and a new family f is created that initially consists only of the gene g. Then, all genes are added to the family f that are paralogs of g. For each gene newly added to a family f, it is tested if there exist other paralogs that are not present in f so far. If so, these genes are added to f as well. Finally, f is stored in the list Lthat holds the result of the clustering.

This procedure guarantees that if two genes are paralogs according to our definition, they are grouped into the same family. During the clustering, the restriction regarding the coverage of a TBLASTN hit is used to suppress the problem of chaining, which is a typical side-effect of single linkage clustering, especially when genes contain only a small but highly conserved matching region, e.g. a single conserved domain. Clearly, the selection of appropriate threshold values for p0, e0, and c0 is essential for retrieving clusters of an appropriate quality, and is discussed in Section 6.3.

In the second grouping step, for each family of paralogs, corresponding (orthologous) families of paralogs in the other genomes are determined. Therefore we introduce the parameter r0 referred to as overlap ratio. We call two families of paralogs f and f0 a pair

4.2. A RELAXED FAMILY DEFINITION 57