• Keine Ergebnisse gefunden

In this chapter, I have presented an use case of analyzing large amounts of public genomic data on the Amazon cloud. In total, my framework processed 100 TB of genomic data on a 100-node Spark cluster in 21 hours. The entire use case provided a thorough instruction on how to easily setup a cluster, download all datasets, and rapidly analyze the data with my framework. Furthermore, I have processed the entire HMP sequencing data in 2 hours, presented a proof of concept association study between public ‘big data’ and private datasets.

90 Chapter 5 Large scale genomic data analyses

Spark and Hadoop based bioinformatics tools are not widely used in genomic studies as they require users to comprehend certain amount of knowledge on cloud computing. Therefore, I particularly focus on providing a simple and comprehensive cloud application to directly access and analyze public genomic datasets. I described a simple way to setup a Spark cluster on the Amazon cloud with one line of command.

I also facilitated downloading and preprocessing large amounts of data by leveraging distributed Amazon S3 storage and optimizing parallel data decompression. In addition, Sparkhit enables users to parallelize their own tools or public bio-containers.

Our large scale data analyses on 100 TB of data presented how I completed the entire cloud utilization cycle in only 21 hours. Moreover, the fragment recruitment application on all WGS data of HMP was completed in less than 2 hours on Amazon EC2.

The fragment recruitment application presented a study model in which users can easily query the entire HMP data to a personalized reference dataset on the cloud. In microbial studies, combining metagenomic data with microbial reference genomes has been commonly used (Eloe-Fadrosh et al.,2016). In our use case, I recruited all WGS data of the HMP to two different strains ofE. coli: a toxic strain and a non-toxic strain. The intention is to find genome sequence segments that are not presented in the metagenomic samples. These sequence segments, which are unique for the toxic strain, might have functional impact that differs from the non-toxic one. I applied a functional analysis using the sequence segments and reproduced two toxic genes that caused the 2011 GermanE. colioutbreak. The same method can be applied to other isolates or single cell genomes.

5.6 Discussion 91

6

Conclusion and outlook

6.1 Conclusion

In this thesis, I have developed a cloud based bioinformatics framework tackling two computational challenges introduced by large scale NGS data: (i) sequence mapping, a computationally intensive task and (ii)de novo genome assembly, a memory intensive task. By leveraging the powerful distributed computing engine, the Apache Spark, I have implemented two native applications, Sparkhit and Reflexiv, to address the two challenges. Both tools have better performances compared to existing tools. I have also integrated a series of analytical modules that enables users to carry out various data mining tasks on the cloud. Using the framework, I am able to rapidly analyze large amounts of genomic data on the Amazon EC2 cloud.

In the first part of my work, I have presented Sparkhit, a Spark based distributed computational framework for large scale genomic analytics. Sparkhit mainly focuses on addressing the computationally intensive challenge in sequence mapping. It incorporates a variety of tools and methods that are programmed in the Spark extended MapReduce model. In chapter 3, I have described (i) the algorithms and pipelines of a fragment recruitment tool and a short-read mapping tool, (ii) the implementation of a general tool wrapper to invoke and parallelize external tools and docker containers, and (iii) the integration of Spark’s machine learning library for downstream data mining. I also presented the architecture of Sparkhit and the utilities that I used for deploying Spark clusters and downloading public datasets.

In the result section, I have carried out a series of performance benchmarks on Sparkhit. Our tool had excellent run time performance on data preprocessing comparing to Crossbow (18 to 32 times faster) and significant run time improvement on fragment recruitment comparing to MetaSpark (92 to 157 times faster). Although I recruited 10% to 12% less reads than MetaSpark, I can adjust to a smaller K-mer size that recruits slightly more reads than MetaSpark, while still running 47 to 124 times faster. In addition, my tool has a reasonable accuracy and sensitivity on sequence mapping. Sparkhit-recruiter scaled linearly with the increasing amount of input data, as I have used Spark RDD to balance data distribution and optimized the computational parallelization. When scaling out to more worker nodes, Sparkhit still keeps a good scaling performance with a minor slowdown at more then 60 worker nodes.

93

In the second part of the thesis, I have presented Reflexiv, a distributed de novo genome assembler that is built on top the Apache Spark platform. I have invented a new distributed data structure and implemented it in the Reflexiv assembler. The data structure is called Reflexible Distributed K-mer (RDK), which is a higher level abstraction of the Spark RDD. It uses the Spark RDD to distribute large amounts of reflexible k-mers across the Spark cluster and assembles the genome in parallel.

I have introduced a random k-mer reflecting method to retrieve the adjacencies between overlapped k-mers and to assemble the genome in an iterative way. I have also described how to solve repeats in the genome and pop bubbles during the assembly.

In the results section, I have carried out a series of benchmarks on Reflexiv. My tool had similar assembly quality with both Ray and Abyss. Although Velvet assembles longer contigs than the other tools, it has more mis-assembled contigs in its result.

Reflexiv has excellent run time performances on ethernet connected Spark clusters.

Compared to existing tools, Reflexiv runs 8-17 times faster than the Ray assembler and 7-18 times faster than the AbySS assembler on the clusters deployed at the de.NBI cloud.

In the third part of the thesis, I presented a use case to rapidly analyze a collection of 100TB genomic dataset. In the genomic field, Spark and Hadoop based bioin-formatics tools are not widely used, as users have little knowledge on distributed computing. Therefore, the use case focuses on providing a simple and comprehensive cloud application to directly access and analyze public genomic datasets. I presented an easy way to setup a Spark cluster on the Amazon cloud with just one line of command. I also presented parallel downloading and decompression methods to optimize the preprocessing of large amounts of genomic data on the cloud.

In the use case, I downloaded and analyzed 100 TB data in only 21 hours. Moreover, the fragment recruitment application on all WGS data of HMP was completed in less than 2 hours on Amazon EC2. The fragment recruitment application presented a study model that users can easily query the entire HMP data to a personalized reference dataset on the cloud. In the use case, I recruited all WGS data of the HMP to two different strains ofE. coliand applied a functional analysis using the result.

As a proof of concept application, I have found two toxic genes that caused the 2011 GermanE. colioutbreak.

In summary, my work contributes to the interdisciplinary research of life science and distributed cloud computing by improving existing methods with a new data structure, algorithms, and distributed implementations. The entire study involves theoretical algorithmic development, distributed software engineering and proof of concept biological applications. As a result, I have successfully accelerated run time

94 Chapter 6 Conclusion and outlook

performances of two specific bioinformatics applications, sequence mapping and de novogenome assembly. I have also facilitated genomic research communities to engage large scale NGS data studies on the cloud.