Implementation - SARUMAN: Semiglobal Alignment of short Reads Using CUDA and

5. Comparative genomics on single nucleotide level: SARUMAN 89

5.3. SARUMAN: Semiglobal Alignment of short Reads Using CUDA and

5.3.2. Implementation

5.3. SARUMAN: Semiglobal Alignment of short Reads Using CUDA and

NeedleMAN-Wunsch 107

Alignment phase As soon as a second qgram hit has been found for a given start indexB, the alignment off tog[B−e, B+m+e] is enqueued for alignment and the rest of the filter phase for this value of B can be skipped. To speed up this align-ment procedure in practice, the verification by alignalign-ment is computed on graphics hardware. Due to the parallel computation of alignments on graphics hardware, a huge number of possible alignment positions is collected before being submitted to the graphics card. The actual number depends on the amount of memory available on the graphics card (VRAM). The alignment of the read sequence f to the pos-sibly matching part g[B−e, B+m+e] of the reference sequence identified in the filter step is computed by a Needleman-Wunsch algorithm. On both sides of this template sequence, eadditional bases are extracted from the reference sequence to allow for indels in an end-gap free alignment, if needed. We use unit edit costs for the alignment matrix. As soon as the error threshold is exceeded in a complete column of the distance matrix, the calculation is stopped, the read is discarded, and backtracing is omitted. If the complete distance matrix is computed, an optimal alignment path is calculated in the backtracing step and the read is reported as hit together with its optimal alignment. SARUMAN does not use banded alignments, yet, this is planned for future releases.

108 Chapter 5. Comparative genomics on single nucleotide level: SARUMAN described in detail in the previous section. The filter algorithm computes potential alignment positions for all reads in the reference genome, which are verified in an optimal, CUDA-accelerated pairwise alignment as described in Section 5.3.1.1.

Hence, SARUMAN is divided into two consecutive phases, namely mapping and aligning. Phase one, the creation of the qgram index together with the following mapping of reads through qgrams is completely processed on the host computer.

In phase two, CUDA is used to compute the edit distance for candidate hits on the graphics card using a modified Needleman-Wunsch algorithm. Implementation details of this two phases are described in the next two sections.

5.3.2.1. Mapping phase

As mentioned the approximate filter algorithm is based on a qgram index of the ref-erence sequence. Such a qgram index can be very memory consuming, especially if a complete index of all possible qgrams of a given length is created. Thus, SARU-MAN hashes only qgrams that are observed within the reference sequence. The memory usage of the qgram index depends on the qgram length and the number of replicates per qgram number, but is mainly proportional to the size of the reference genome. Our tests show that a standard computer with 4 GB of RAM and a recent dual core CPU is able to read and process most bacterial genomes. A benchmark with 75bp reads and a qgram size of 9 showed the memory footprint to be be-low 4GB for typical bacterial genome sizes of 2-5Mb (see Table 5.2). The largest bacterial genomes known so far,Sorangium cellulosum (13,033,779 bp), needed 5.3 GB of RAM, while a small plant chromosome (the first chromosome of Arabidopsis thaliana, 32Mb) needed 9GB of RAM. Thus, a desktop PC with 8GB of RAM is sufficient to map all microbial genomes in a single run.

Table 5.2.: Influence of the size of the reference sequence on SARUMAN. The ref-erence organisms wereNeisseria meningitidis (2.2 Mb),Escherichia coli (5Mb),Sorangium cellulosum(13 Mb), and the first chromosome of Ara-bidopsis thaliana (32 Mb). Calculated with a qgram-size of 9 and a read length of 75bp.

Genomesize 2.2 Mb 5 Mb 13 Mb 32 Mb

Memory footprint 2.1 GB 3 GB 5.3 GB 9 GB

To process larger reference genomes with limited resources, the reference sequence can be divided into several chunks that are processed iteratively. Using this tech-nique, it is possible for SARUMAN to run on computers with small amounts of RAM by dividing the qgram index into chunks fitting into available memory, e.g., the A. thaliana chromosome would be split into two 16MB chunks. A drawback is that the compute time is increasing with the number of chunks as for each chunk all reads have to be aligned, again. Thus, also with this workaround SARUMAN is not suitable for large eukaryotic genomes as the number of needed iterations would be too high.

5.3. SARUMAN: Semiglobal Alignment of short Reads Using CUDA and

NeedleMAN-Wunsch 109

Preceding the actual filtering step all reads are preprocessed. During this phase all located start positions of the 2∗c qgrams of a read in the reference genome are extracted from the qgram hash index. This list of positions is stored in an auxiliary data structure in order to minimize access to the hash index in later stages and therefore speed up the following filter step. Reads with perfect hits on the reference genome are still further processed as there may exist imperfect hits elsewhere on the reference, but for such reads the starting positions of qgrams representing the perfect hit are removed from the auxiliary data structure.

After this initial step, each read is mapped onto the reference genome using the al-gorithm described in Section 5.3.1.1. The start positions for the pairwise alignment are dependent on which two qgrams were successfully mapped during the filter mapping phase: A combination of first and last qgram (e.g. s₁ and k_c) exactly determines the start and end position of the read and the respective part of the reference sequence is extracted for alignment. If two “inner” qgrams (e.g. s₂ and kc−1) are mapped, the start and end positions of the read on the reference sequence are unknown, thus in order to find the correct start and stop position e additional bases have to be taken from the reference sequence at both sides to allow for in-sertions and deletions in the qgrams at the edge of the read., i.e. an end-gap free alignment. Start positions for possible mappings are stored and transferred to the CUDA module in a later stage. In order to not only exploit the parallel architecture of graphics cards but also the availability of multi core CPUs, the matching phase uses two threads. The first thread handles mapping on the sense strand whereas the second thread processes mapping on the antisense strand. In contrast to many other approaches, which employ a 2bit encoding for the four DNA letters, SARU-MAN is able to handle Ns in reads as well as in the reference genome. Since many genomes contain a small number of bases with unknown identity, especially the rising number of unfinished draft genomes (see Section 2.3.2), it is of advantage to correctly treat these bases asN. Workarounds used by other approaches like replac-ing these positions with random or fixed bases to maintain the 2bit encodreplac-ing may lead to wrong and in case of random replacements even to irreproducible results.

5.3.2.2. Alignment phase

The feasibility of sequence alignments using GPU hardware was demonstrated by different tools like SW-CUDA (Manavski and Valle, 2008) or CUDASW++ (Liu et al., 2009b). Compared to our solution, existing implementations focused on the search for similar sequences in a huge set of other sequences, which corresponds to a BLAST-like use. In contrast, SARUMAN searches for local alignments of millions of short sequences against one long reference sequence. Employing the filtering algorithm described in the previous section all possible alignment positions in the reference genome can be identified, and thereby the problem is reduced to global alignments of a read sequence with a short substring from a reference genome. Thus, compared to SW-CUDA or CUDASW++, SARUMAN does not align sequences

110 Chapter 5. Comparative genomics on single nucleotide level: SARUMAN

Figure 5.9.: CUDA hardware layout and memory organization. Each multiproces-sor consists of 8 procesmultiproces-sors, each with an own set of registers. Shared, constant, and texture memory can be used by each of the 8 processors, while device memory is globally accessible between multiprocessors.

against a database of templates, but is designed as an alignment application to perform thousands of short pairwise global sequence alignments in parallel.

To efficiently use CUDA, it is of great advantage to understand the underlying hardware of CUDA capable graphics adapters. A GPU consists of a variable number of multiprocessors reaching from one in entry level graphics adapters up to 192 multiprocessors in high end video cards (Nvidia GeForce 600 Series, Kepler architecture released in March 2012). Each of these multiprocessors has access to registers of 8 or 16 kB size and is divided into 8 small processors. The available registers are divided and equally assigned to processors. This small amount of writable memory should be used for data processed in the currently active thread while texture memory and constant memory are read only and can be used to prefetch data from the much slower device memory. An overview of the CUDA hardware and memory model is given in Fig. 5.9.

Implementing code for the execution on a GPU is very similar to standard ap-plication development. Some additional keywords extending the C programming language are used to define on which device a function should be executed. A function on the GPU is mapped to one thread on one processor located on the graphics card, whereas one function can and should be executed on many different datasets in parallel. This execution scheme is called SIMT (Single Input Multiple Threads) due to its relation to the similar SIMD (Single Instruction Multiple Data) scheme used in classical parallel programming. Parallel programming with CUDA is transparent and (within NVIDIA products) device independent for developers and users. Launch of a CUDA application is controlled using only a few parameters defining the total number of threads. Those threads are organized hierarchically into grids, which themselves consist of threadblocks. Each threadblock is a collec-tion of a given number of threads. A threadblock must be executable in any order and therefore must not have any dependencies on other blocks of the same grid.

As the pairwise alignments computed by SARUMAN are completely independent of each other it is not difficult to adhere to this constraint.

5.3. SARUMAN: Semiglobal Alignment of short Reads Using CUDA and

NeedleMAN-Wunsch 111

The alignment process on the graphics adapter is organized in batches of a de-fined size. The filter algorithm collects all necessary data of the sequence pairs that should be aligned until a sufficient number of candidate hits has been found. Read and genome sequences are stored together with auxiliary data structures. Subse-quently, all required data for the alignment phase is copied to the GPU as one large unit in order to minimize I/O overhead. The maximal number of alignments fitting into the GPU memory heavily depends on read length. The memory available on the graphics adapter is a second crucial limiting factor. SARUMAN automatically calculates an upper limit for the number of parallel alignments. As the number of alignments that can be computed in parallel drastically influences the overall per-formance, SARUMAN does not use the quality information of reads as this would double up the memory usage and thereby would divide in half the number of paral-lel alignments. Due to the huge output of modern sequencing systems, for datasets with sufficient coverage and quality it is feasible to simply remove all reads with low quality bases.

For 36bp reads a value of 200,000 alignments (100,000 for each mapping thread and direction) can be achieved on a standard GPU with 1 GB of VRAM. Once all data of the candidate hits has been copied to the GPU for each pair of genome and read sequence, the edit distance is computed using the user-defined values for match and mismatch positions (substitutions, insertions, and deletions have uniform mis-match costs). By comparing the distance with the supplied maximal error rate, all candidates with values above this threshold are discarded. Complete alignments are computed in a second backtracing step only for candidates with a edit distance below the defined threshold e. Typically the alignment phase for one batch takes only a few seconds to complete, including the whole process of copying raw data to and processed alignments from the GPU. Before any output is written, alignments are postprocessed by clipping gaps at the start and end that originate from the end-gap free alignment. For each possible start position of each read the optimal alignment is reported, in contrast to other available tools which only deliver a fixed number of n best positions or do not even guarantee to report any optimal align-ment.

A great advantage of the CUDA alignment phase is that it can be executed asyn-chronously to the filter algorithm on the host computer, i.e., while the graphics card is executing a batch of pairwise alignments, the host CPU simultaneously executes the filter algorithm to collect alignment positions for the next batch. SARUMAN produces tab separated output by default which includes the start and stop posi-tion of the mapping together with the edit distance and the complete alignment.

The package includes an easy to use conversion tool to generate the widely used SAM alignment format. The SARUMAN software is available for download at http://www.cebitec.uni-bielefeld.de/brf/saruman/saruman.html.

112 Chapter 5. Comparative genomics on single nucleotide level: SARUMAN

Im Dokument Comparative genomics on gene and single nucleotide level (Seite 115-120)