Methods of transcription start site identification

2.2 Promoter analysis

2.2.4 Methods of transcription start site identification

In order to gather knowledge of promoter regions and their general architecture, it is crucial to obtain precise location information, if possible exact on nucleotide level. On the one hand, in silico predictions based on sophisticated models and algorithms can be used to search DNA sequence on genome or local scale for auspi-cious positions of transcription start sites. On the other hand, biological sequencing experiments and subsequent data analyses may either verifyin silico predictions or add TSS positions not recognized by computational methods.

2.2.4.1 Computational methods

First approaches for eukaryotic promoter prediction started around 1995 with PromoterScan [Prestridge, 1995] and PromFind [Hutchinson, 1996] (Table 2.3).

During this phase different computational methods were evaluated, including structural features like hexamer frequency differences between coding regions and promoter areas [Hutchinson, 1996], Markov chains [Audic and Claverie, 1997], TATA box position weight matrices (PWMs) [Prestridge, 1995], or transcription factor binding site densities [Prestridge, 1995]. However, these tools rely in large portions on extrinsic data sources like transcription factor databases or verified TATA box consensus sequences to build reasonable models. Even though these first bioinformatics approaches paved the way for more sophisticated implementations, none of the first generation predictors achieved sensitivity values >60 %, while most tools not do exceed 30 % [Fickett and Hatzigeorgiou, 1997].

Over the years, molecular biological knowledge of promoter structures, transcrip-tion process and DNA sequence features increased and allowed for the development of novel in silico approaches. Due to sequencing technology advances, the human genome set the new gold standard for promoter and TSS prediction, effectively ren-dering most previous software tools infeasible. Two novel approaches subsequently appeared, the first designed for genome scale application, the second also able to work on single gene level. With the advent of large genomes, such as the human genome, a common approach is the screening and scoring of each nucleotide of the underlying genome, while the scoring is mostly realised through classification algorithms with cross-validation [Abeel et al., 2009]. Typical representatives of this class are ARTS [Sonnenburg et al., 2006], ProSOM [Abeel et al., 2008b], and EP3 [Abeel et al., 2008a] (Table 2.3). Another possibility to detect promoter regions or TSSs employs a much more local scope and does not accumulate scores throughout the whole genome. Therefore only auspicious start/stop positions for promoter regions, potentially combined with a confidence scores, are reported. This method is used for instance in PromoterExplorer [Xie et al., 2006] and a proposed software by Wu et al. [2007].

In order to assess performance and accuracy of this second generation promoter and transcription start site prediction tools, a first proposed gold standard was established by Abeel et al. [2009]. The study was able to confirm a bias for most tools towards CpG containing promoters, commonly associated with housekeeping genes [Carninci et al., 2006], while other promoters not exhibiting CpG islands seem to be under-represented. Further bias is caused by over-represented promoters of highly transcribed genes compared to promoters of relatively weak expressed genes.

Tool Method description

PromFind Based on differences between hexamer frequencies in pro-moter regions, coding, and non-coding regions [Hutchinson, 1996]

TSSG & TSSW Linear discriminant function combines TATA box scores and triplet preferences around the TSS [Solovyev and Salamov, 1997]

PromoterScan Uses TATA PWMs (position weight matrices) and densities of transcription factor binding sites [Prestridge, 1995]

Nameless tool Promoter recognition algorithm based on Markov transition matrices [Audic and Claverie, 1997]

PromoSer Promoter and transcription start site identification, web based, source genome data dates to 2003 [Halees, 2003]

CoreBoost HM TSS prediction based on histone modification signals, web based, 100 Kb maximal input [Wang et al., 2009]

NNPP2.2 Neural network based, utilises difference between TSS and translation start site (TLS) [Burden et al., 2005]

MotifLab Combines several data sources like chromatin accessibility and epigenetic state of the cell [Klepper and Drabløs, 2013, 2010]

McPromoter Based on stochastic segment models (SSMs) and interpo-lated Markov chains [Ohler et al., 2000; Ohler, 2006]

EP3 Uses large scale DNA structural features to predict promot-ers [Abeel et al., 2008a]

Eponine Based on a hybrid machine-learning algorithm, developed for mammalian genomes [Down and Hubbard, 2002]

GPMiner Meta tool, identifies TSSs and regulatory features, uses McPromoter, Eponine, and NNPP2.2 [Lee et al., 2012]

ProSOM Facilitates unsupervised clustering by using self-organizing maps to recognise promoter regions [Abeel et al., 2008b]

ARTS Employs Support Vector Machines (SVMs) with advanced sequence kernels [Sonnenburg et al., 2006]

Table 2.3: Selection of TSS / promoter prediction tools. The first four tools have been chosen exemplarily as representatives for the first generation of prediction tools [Fickett and Hatzigeorgiou, 1997]. The second part of the table features web-based implementations, part three is dedicated to more recent works. An extensive review ofin silico solutions for promoter and TSS discovery was conducted by Narlikar and Ovcharenko [2009].

Figure 2.5: Overview of transcription start site (TSS) detection methods. Left (green): TSS can be detected by full length cDNA sequencing. After assembly of sequencing reads, full length cDNA sequences are mapped onto a suitable reference genome. The leftmost (5’) mapping position corresponds to the TSS (given the cDNA assembly yielded a full length sequence). Middle (blue): Cap analysis of gene expression is an approach more focused on 5’ mapping since several 5’ end tags (≈ 21 nt) of different genes are fused and sequenced in one read, therefore increasing the overall throughput of detectable TSS. After sequencing the tags have to be mapped onto a suitable reference. Right (purple): Expressed sequence tags (ESTs) are randomly distributed tags much longer than CAGE tags (about 500-800 nt). Therefore not all tags can be used for TSS mapping while additionally more sequencing is performed for non 5’ specific tags, thus lowering the overall TSS yield.

2.2.4.2 Biotechnological methods

As shown, in many cases results obtained byin silicomethods can give first insights and a general idea about TSS positions and possible promoter regions. However, all these methods are biased in one way or another and as such will not be able deliver a complete and correct picture of the TSS landscape of eukaryotic organisms. Al-though computational methods are typically more cost efficient due to the fact that no expensive reagents are required, biological experiments can provide new results or findings which cannot be predicted by computational methods since algorithms will usually only report those results which are related to their programming.

Full length cDNA Generally, transcriptome sequencing focuses on the reconstruc-tion of complete complementary DNAs (cDNA), in order to gain informareconstruc-tion about the protein structure and therefore possible functions. As such, ideally the cDNA is sequenced completely from the 5’ UTR up to the 3’ UTR and later assembled into its prior form. In order to precisely locate the transcription start site of a given transcript, it is mapped against a reference genome with BLAST [Altschul et al., 1990] or similar tools. Transcription start site identification in this case is a byproduct and not the intended use case for this strategy. Full length cDNA sequencing was introduced in times of Sanger sequencing, hence the throughput of this technology is very limited. Additionally the vast majority of sequencing information is used for coding parts of the mRNA rather than to identify as many TSSs as possible. All processing steps of this method are summarised in Figure 2.5 Expressed sequence tags - ESTs As previously mentioned, full length cDNA methods has two significant drawbacks. First, the amount of required sequencing data is relatively high. Second, a subsequent assembly process is mandatory to obtain a correct transcript which can be mapped back to the reference genome.

Indeed, both disadvantages were addressed in a study by Adams et al. [1991] pre-senting an effort to get a broad overview of transcripts in a large number of samples by using the limited Sanger technology. In contrast to full length sequencing which employs several reads per transcript to fully cover its sequence, so called expressed sequence tags (ESTs) consist of one read only, yielding a typical length of 500-800 nt. This single read starts either from the 5’ or the 3’ end of the transcript, leaving large portions of the transcript untouched (seed Figure 2.5 for a graphical summary). However, reads longer than 150 nt are already sufficient for similarity searches and genome mapping on a human genome scale [Adams et al., 1991]. Al-though ESTs were intended as a tools for expression profiling and are widely used even today, due to their transcript end focused sequencing strategy they proofed to be an excellent tool for TSS identification purposes.

Cap analysis of gene expression - CAGE The approach of non-complete se-quencing was enhanced and optimised to further increase the possible yield of tran-scription start sites. The typical tag size of EST sequencing was reduced by using

restriction enzymes and varies between 21 - 23 nt. These tags are concatenated into a single vector where several tags can be sequenced in a serial way. Compared to previous approaches cap analysis of gene expression (CAGE) [Kodzius et al., 2006;

Shiraki et al., 2003] was able to reduce costs while at the same time increasing overall yield of tags. This came at the price of a negligible coverage of the original transcript. The approach therefore results in much higher throughput, since only concatenated tags are sequenced rather than full transcripts. Sequenced tags can be mapped onto a suitable reference genome, which in turn reveals transcription start sites due to the 5’ aligned location of the CAGE tags. Recent studies showed however, that the CAGE protocol is prone to non-specific G at the tag’s 5’ end, therefore leading to flawed mapping positions within the reference genome [Zhao et al., 2011] and finally to bogus tag to gene mappings.

Im Dokument Bioinformatic methods for eukaryotic RNA-Seq-based promoter identification (Seite 33-38)