• Keine Ergebnisse gefunden

The innovative dual-library sequencing approach enables the effective removal of false positive in an early stage of the pipeline, thus reducing unnecessary computa-tions and sophisticated removal of false positive byin silicomeasures. The bioinfor-matics pipeline was specifically tailored to meet the requirements of the dual-library sequencing strategy and thus profits from the library design. Although the pipeline requires a reference genome, results showed that even draft status genomes are capable of delivering a wide range of information. In case of annotated genomes with gene predictions, a precise assignment of transcription start sites to genes is possible. However, the pipeline also reported by now unknown transcription start sites, either within genes or in previously uncharted genome regions. The presented pipeline’s output provides promoter information on different scopes, reaching from nucleotide-level to genome-wide observations for known regulatory elements.

Promoter element

Reads Assigned annotation Chr. Class BREu TATA BREd INR DCEI DCEII DCEIII MTE DPE

921,977 L28-like protein 9/10* SP - X - - X X - -

-911,238 H4-like protein 1 SP - - X - - X - -

-849,029 H4-like protein 3 SP - - X X - - - -

-719,851 S24-like protein 1 SP X X - - X X X -

-635,045 H1,3-like protein 3 MU - - X - - X - - X

627,213 histone H2B type 3 SP - - - X X - - - X

490,529 S26-like protein 2 SP - - X - X - - - X

466,529 S18-like protein 1 SP - X X X - - - -

-376,649 S17-like prneotein 3 PB - X X - - - X

357,815 45S RNA X SP - - - X

315,382 L3-like protein 2 SP - X - - - X - - X

286,123 glucose-regulated precursor 6 SP - X X X - X - - X 285,072 L22e containing protein 2 SP X X - - X X - -

-278,889 L13 protein 3 PB - X - - - X - - X

277,277 clusterin-like protein 1 MU - X X - - X - X X

218,538 H4-like protein 8 SP - X X - X X - -

-205,055 S16-like protein 9/10* SP - - - X X X X

201,381 transketolase-like protein 1 SP - - - - X X - - -188,862 S15a-like protein 9/10* SP - - - - X X - - X 181,819 RIKEN D130020L05 cDNA 1 SP - X X - - - X - X

Table 5.3:List of highly expressed genes in the dataset, sorted by number of mapped reads. The number of reads mapped to the TSS are shown in column 1, column 2 shows the annotation, column 3 the chromosomal location. Type and detected regulatory elements (Figure 2.4) are outlined in columns 4 - 9. Four different peak types are shown: SP - sharp peak, PB - single dominant peak, MU - multimodal peaks, BR - broad peaks (see Figure 5.13 for corresponding peak shapes). Rows are colour coded, ribosomal-like genes are shown in red, histone-related genes have blue colour, all other gene annotations have green background. *: Due to size restrictions, chromosome 9 and 10 were not separated prior to sequencing [Brinkrolf et al., 2013].

Discussion

This thesis presented two approaches focussing on different aspects of transcrip-tion start site analyses in eukaryotic genomes. While the SATYR (Seed Assisted Targeted assembly of Yield increasing Regions) software is a targeted assembly approach which can be used to reconstruct the incomplete 5’ and 3’ regions of the cDNAs, the bioinformatics pipeline used to process dual library RNA sequenc-ing runs employs also external data sources such as BLAST to annotate verified transcription start sites. Within this last chapter will review both approaches, sum-marise the results and will highlight advantages as well as possible disadvantages of both approaches.

6.1 TSS identification in the Chinese hamster by RNA sequencing

By employing a two-library-based RNA sequencing approach and a specifically tailored bioinformatics pipeline, in Chapter 5 light was shed on the previously unstudied promoter regions in the Chinese hamster genome on the global scale.

For this purpose, EST and CAGE based approaches from previous studies were replaced with a state-of-the-art RNA sequencing technique. This cost-effective method of TSS exploration combined with a specific dual-library setup is ide-ally suited, because the sequence information is enriched directly at potential transcription start sites rather than distributed throughout complete transcripts.

The modular bioinformatics pipeline developed for this study automates sequence data preprocessing, TSS discovery, TSS annotation, and TSS visualization in one workflow. The software detected 6,547 TSSs and assigned 93.66% of these to

known genes. Furthermore, it uncovered 2,227 transcription start sites of genes not yet annotated for the Chinese hamster genome. This fact emphasizes the current draft status of the Chinese hamster genome, especially when compared to the high-quality annotation status of the mouse and the human genome.

An advantage of the approach presented here in this work is that a single experiment can be used to provide insights into several aspects. Notwithstanding that TSS identification is the primary goal, the CHO community can now be supplied with promoter structures for several thousand genes, including promoter types, regulatory elements, expression height, and exact locations of the TSSs.

Both knowledge of expression strength and regulatory elements, such as TATA box or DPE, are also valuable parameters when searching for potential high yield promoter constructs. Here, further experiments have to be conducted for those constructs, eventually resulting in a list of endogenous CHO promoters able to replace classical SV40 and similar constructs together with their unintended side effects. In addition to a gene-centric promoter view, this work also took genome wide promoter architecture into account. It was possible to verify motif patterns for seven of the nine tested regulatory elements, including important motifs such as TATA box and INR. Difficulties occurred for the three DCE subunits as well as for BREd, possibly caused by imprecise PSSMs or a lack of activity given the conditions used for this approach.

The work carried out within this project represents a first step in global promoter studies within the Chinese hamster, which may contribute to a more exact and verified annotation of transcription start sites. The combination of experimental and bioinformatics setup has proven to deliver data with high information density usable in several scenarios and is expected to provide even deeper insights when performed on larger input data sets.

Outlook

The initial RNA sequencing run used to detect possible transcription start sites was based on a pooled DNA sample, therefore representing a virtual state of the cell with various combined parameters. It would be of great interest to employ the developed pipeline in conjunction with RNA sequencing experiments based on several conditions and perform a kind of differential promoter study. The results could be used to generate lists of promoters either active in a series of conditions, e.g. in a series of pH concentrations or only active und certain parameters. Such a strategy would allow for the detection of possible inducible promoters which are of great use in biotechnological productions environments.

In direct contrast to the separated approach outlined above, the pooling strategy could be extended, thus combining as many conditions as possible into one RNA sequencing experiment in order to increase the number of detected transcription

start sites. This way, the annotation of the Chinese hamster reference genome in terms of transcription start sites could further be improved.

This however, brings up a question concerning the dynamic range of the se-quencing experiment. As discussed, RNA sese-quencing offers a very broad range of detection, starting from a few reads up to several million mapped reads. While the naive method to increase the number of detected TSSs would only require to an increase of the sequencing output, the question remains if the additional sequencing coverage provided is able to capture a portion of very low expressed genes or if the sequencing coverage is accumulated within the highly expressed genes. In the latter case, no significant increase of the overall TSS count is expected and the number of reads generated for this work can be used as an upper bound for further sequencing experiments.

Following bioinformatics analyses within this work, the verification of the ob-tained results should be performed using biotechnological methods. These experi-ments involve the cloning of several of the “Top 20 promoters” (Table 5.3) into CHO cells and the combination of these promoters with reporter genes. This validation process has already started and is currently performed by Anna Wippermann as part of her Ph.D. thesis. First results show that the cloned promoters are indeed active and produce significant amounts of transcripts. However, in comparison to the classical CMV promoters which is used as control promoter, the initial set of endogenous CHO promoters only reaches ≈ 10% of the CMV activity.