• Keine Ergebnisse gefunden

5.2 Implementation

5.2.3 msaMuscle.R, msaClustalW.R, and msaClustalOmega.R

Those files provide exactly one function, which have the name msa followed by the algorithm name starting with an upper case letter (e.g. msaMuscle). This naming convention is used inmsa.Rto call the algorithm, specified in the parametermethod.

As stated earlier in this section, the specific functions of the algorithms have the same interface asmsa, except the parametermethod.

Despite the fact, that the algorithms can be called directly, all checks are done in this functions, whereas the functionmsainmsa.Racts as a wrapper function. Another reason for doing all the checks in this functions is due to different values expected by the algorithms for the same parameters. For example, in MUSCLE, the default value for the gap opening depends on the substitution matrix in use. On the other side, ClustalΩcurrently does not support custom gap opening values.

After all parameters are checked, including those parameters which are algorithm specific,Ruses the packageRcppto call the algorithms. The algorithms are originally written in different languages likeCandC++.

Listing 5.3: msa/R/msaMuscle.R

1109 result <- .Call("RMuscle", inputSeqs, cluster, -abs(gapOpening),

1110 -abs(gapExtension), maxiters, substitutionMatrix, type, 1111 verbose, params, PACKAGE="msa")

1112

1113 out <- convertAlnRows(result$msa, type)

1114

1115 if (length(inputSeqNames) > 0)

1116 {

1117 if (order == "aligned")

1118 {

1119 perm <- match(names(out@unmasked), names(inputSeqs)) 1120 names(out@unmasked) <- inputSeqNames[perm]

1121 }

1122 else

1123 {

1124 perm <- match(names(inputSeqs), names(out@unmasked)) 1125 out@unmasked <- out@unmasked[perm]

1126 names(out@unmasked) <- inputSeqNames

1127 }

1128 }

1129 else

1130 names(out@unmasked) <- NULL 1131

1132 standardParams <- list(gapOpening=gapOpening,

1133 gapExtension=gapExtension,

1134 maxiters=maxiters,

1135 verbose=verbose)

1136

1137 out@params <- c(standardParams, params) 1138 out@call <- deparse(sys.call())

1139 out

1140 }

It can be seen in listing 5.3, that calling C or C++ code from R is easy. Section 5.2.4 will go into depth on this topic. In row 1113 the listing shows, that the alignments are converted to Biostrings compatible alignments by calling convertAlnRows.

Finally, the alignments are aligned according to the input sequences – if desired –, and some meta data is added to provide more information about how the alignment was generated.

The following subsections will focus on the specific values of the input parame-ters of the implemented algorithms. Those parameparame-ters and all other parameparame-ters are checked before the original codes of the algorithms are called.

Input Parameter Conventions for msaMuscle.R

In msaMuscle.R, the parameters are (Bodenhofer, Bonatesta, & Horejˇs-Kainrath, 2015):

• cluster: The clustering method which should be used. Possible values are

"upgma","upgmamax","upgmamin","upgmb", and"neighborjoining".

5.2 Implementation

• gapOpening: Gap opening penalty; the default is 400 for DNA sequences and 420 for RNA sequences. The default for amino acid sequences depends on the profile score settings: for the setting le=TRUE, the default is 2.9, for sp=TRUE, the default is 1,439, and forsv=TRUE, the default is 300. Note that these defaults may not be suitable if custom substitution matrices are being used. In such a case, a sensible choice of gap penalties that fits well to the substitution matrix must be made.

• gapExtension: Gap extension penalty; the default is 0.

• maxiters: Maximum number of iterations; the default is 16. In the original MUSCLE implementation, it is also possible to setmaxitersto 0 which leads to an (out of memory) error. Therefore,maxiters=0is not allowed inmsaMuscle.

• substitutionMatrix: Substitution matrix for scoring matches and mis-matches; can be a real matrix or a file name If the file interface is used, matrices have to be in NCBI-format. The original MUSCLE implementation also accepts matrices in WU BLAST (AB BLAST) format, but, due to copyright restrictions, this format is not supported bymsaMuscle.

Input Parameter Conventions for msaClustalW.R

In msaClustalW.R, the parameters are defined as follows (Bodenhofer, Bonatesta,

& Horejˇs-Kainrath, 2015):

• cluster: The clustering method which should be used. Possible values are

"nj" (default) and "upgma". In the original ClustalW implementation, this pa-rameter is calledclustering.

• gapOpening: Gap opening penalty; the default value for nucleotide sequences is 15.0, the default value for amino acid sequences is 10.0.

• gapExtension: Gap extension penalty; the default value for nucleotide se-quences is 6.66, the default value for amino acid sese-quences is 0.2.

• maxiters: Maximum number of iterations; the default value is 16. In the original ClustalW implementation, this parameter is callednumiters.

• substitutionMatrix: Substitution matrix for scoring matches and mis-matches; can be a real matrix, a file name, or the name of a built-in substitution matrix. In the latter case, the choices"blosum","pam","gonnet", and"id"

are supported for amino acid sequences. For aligning nucleotide sequences, the choices"iub"and"clustalw"are possible. The parameterdnamatrixcan also be used instead for the sake of backwards compatibility. The valid choices for this parameter are "iub"and"clustalw". In the original ClustalW imple-mentation, this parameter is calledmatrix.

Input Parameter Conventions for msaClutalOmega.R

InmsaClustalOmega.R, the parameters are as listed below (Bodenhofer, Bonatesta,

& Horejˇs-Kainrath, 2015):

• cluster: The cluster size which should be used. The default is 100. In the original ClustalOmega implementation, this parameter is calledcluster-size.

• gapOpening,gapExtension: ClustalOmega currently does not allow to adjust gap penalties; these arguments are only for future extensions and consistency with the other algorithms andmsa. However, setting these parameters to values other than"default"will result in a warning.

• maxiters: Maximum number of iterations; the default value is 0 (no limita-tion). In the original ClustalOmega implementation, this parameter is called iterations.

• substitutionMatrix: Name of substitution matrix for scoring matches and mismatches; can be one of the choices "BLOSUM30", "BLOSUM40",

"BLOSUM50","BLOSUM65","BLOSUM80", and"Gonnet". This parameter is a

5.2 Implementation new feature - the original ClustalOmega implementation does not allow for using a custom substitution matrix.