• Keine Ergebnisse gefunden

2.3 Genome-wide Sequencing Assays

2.3.1 Assays of Chromatin Structure

ChIP-Seq [29, 30] (Protein-DNA Binding, Fig. 2): This assay requires an antibody that is specific to a certain protein of interest like a transcription factor or a histone with a certain post-translational modification. The first step is to cross-link all DNA-bound proteins to the DNA thereby fixing those interactions, akin to freezing those interactions in time. This is usually accomplished using formaldehyde treatment. The

22 CHAPTER 2. TECHNICAL BACKGROUND

Figure 2.2: An overview of ChIP-Seq Protocol. Figure adapted from Tam, W.-L. and Lim, B., Genome-wide transcription factor localization and function in stem cells (September 15, 2008), StemBook, ed. The Stem Cell Research Community, StemBook, doi/10.3824/stembook.1.19.1, http://www.stembook.org.

next step is to randomly fragment those DNA-Protein complexes via sonication or enzymatic digestion. The DNA molecules resulting from the fragmentation process are usually called “input” and often used as background or a “null” assay. The fragments of the Protein-DNA complexes that contain the protein of interest are isolated using antibodies specific to that protein and purified. The proteins are reverse-crosslinked from DNA and the remaining DNA fragments are sequenced. When those sequences of DNA fragments are aligned back to a reference genome, the locations in the genome to which the protein of interest was bound can be identified. Prominent recent advances in ChIP-Seq technology include higher resolution variants (ChIP-exo [31] and X-ChIP [32]), single-cell ChIP-Seq [33] and a technology for assaying for two proteins co-bound to the same locations called Co-ChIP [34].

Sources of uncertainty in ChIP-Seq data:

1. It is important to note that since genomic locations bound to large protein com-plexes are protected from sonication and enzymatic digestion. The ends of ChIP-Seq “input” fragments might be enriched in locations not bound by large protein complexes [35]. Therefore, input fragments are not guaranteed to be uniformly distributed across the genome but might favor open chromatin thereby biasing ChIP-Seq against genomic domains that are “closed”. This is perhaps one con-tributing factor to the observation that locations with tightly packed arrays of nucleosomes result in broad low signal-to-noise ratio signal when assayed with ChIP-Seq, rather than sharp signal. Furthermore, sonication is affected by the 3D

2.3. GENOME-WIDE SEQUENCING ASSAYS 23 structure of the genome in ways that are not yet understood. Finally, sonication time and detailed parameters affect the extent to which the DNA is fragmented.

Although most researchers seem to isolate fragments with a certain length range after sonication to avoid such variability, differences in sonication strategies can still introduce variabilities and biases that are not understood.

2. After fragmentation, an antibody is used to select DNA fragments bound to a protein of interest. This introduces an important source of uncertainty since the specificity of antibodies is variable between different antibodies with some having more off-targets than others [36]. Furthermore, polyclonal antibodies (typically used in ChIP-Seq) vary between lots since each lot is obtained from a different animal [36]. Monoclonal antibodies are obtained from a single puri-fied cell linein vitroand can overcome polyclonal antibody limitations to a great extent [37]. [38] provide a database of commercially available antibodies and their specificities. One should still note however, that an antibody can recognize a target protein in a specific conformation but not in another. So for example, an antibody can fail to recognize a certain histone acetylation if there is also methy-lation on the same histone or fail to recognize a transcription factor when it is cobound with another factor. Therefore, all such aspects introduce problems in the interpretation of the data. Ideally, researchers would reproduce their data us-ing different antibodies to ensure that the result is not an artefact of the antibody used.

3. The antibody-bound DNA molecules need to be purified. This purification step often involves magnetic beads. The primary antibody is incubated with magnetic beads such that the antibody-bound chromatin fragments can be attached to those beads as well and then isolated using what essentially is magnetic chromatog-raphy: the bound chromatin is stuck to the wall of the test tube using a magnet and unbound chromatin is eluted. This step is repeated 2 or 3 times. However, it is of course not perfect. DNA molecules that were highly abundant in input chromatin might remain in the final isolated fraction even if they not bound by the antibody, leading to “phantom” ChIP-Seq peaks [39] and DNA molecules that were antibody-bound might get erroneously completely eluted if they were lowly abundant in chromatin input.

4. ChIP-Seq DNA fragments are typically around 150 basepairs to 300 basepairs in length. This dictates the resolution of ChIP-Seq. When data from multiple fragments are aggregated, the resolution of ChIP-Seq can be improved to ap-proximately 50 basepairs in aggregate. But fragment lengths present another problem: since sequencing reads are typically short, only the starts of the iso-lated fragments are sequenced, giving rise to a pattern where the location of the bound factor is depleted from reads but the locations adjacent to it are enriched in plus-stranded reads on one side and minus-stranded reads on the other side. In the analysis of ChIP-Seq data, an average fragment length is typically estimated and the reads are extended to that length or shifted by half of that length. Of

24 CHAPTER 2. TECHNICAL BACKGROUND course, this average fragment length is an approximation since the actual frag-ments have a probabilistic unknown distribution. This problem can be resolved using paired-end sequencing where the length of each fragment can be measured.

DNase-Seq [40, 41] and ATAC-Seq [42] (Chromatin Structure): This is in fact another variant of protein-DNA binding assays, except that the focus here is on ge-nomic locations where proteins arenotbound. Those assays usually capture genomic locations that are depleted of nucleosomes and other protein complexes of comparable sizes. This is often extremely useful because such genomic locations are usually active regulatory elements regulating gene expression such as enhancers and promoters. In DNase-Seq, the DNA is digested using the DNaseI enzyme which will preferentially cut the DNA in locations where the DNA is “accessible” or “open” (that is, not bound by a protein or other molecules). The ends of the resulting fragments are sequenced and mapped back to a reference genome thereby providing a measure of genome “ac-cessibility” on a local level. In ATAC-Seq, the same approach is followed except that a hyperactive version of the Transposase Tn5 is used to fragment the DNA instead of DNaseI. The advantage of ATAC-Seq is that it requires far fewer cells than DNase-Seq and is far less tedious. Other assays for DNA accessibility include FAIRE-Seq [43]

and Sono-Seq [35]. It should also be noted that although those assays intend to mea-sure the same property, at least conceptually, they in fact result in different profiles of

“accessibility” due to the vast differences between the protocols.

Sources of uncertainty in DNase-/ATAC-Seq data include:

1. DNase-Seq and ATAC-Seq are essentially enzymatic digestions of chromatin, and they also are potentially affected in an unknown manner by the 3D structure of the genome like sonication in ChIP-Seq.

2. Also like sonication, the concentration of the enzyme and the incubation time with chromatin will affect the amount of digestion and the fragment length distri-bution obtained. In DNase-Seq one selects a perceived optimal digestion pattern from a pulsed field gel [44]. In ATAC-Seq, one manipulates the Tn5 concen-tration, the number of cells used and the incubation time to obtain a desired fragment length distribution. However, this is often somewhat subjective.

3. DNaseI and Tn5 transposase are not completely random in their targeting of DNA but have specific sequence preferences [45, 46]. Meaning that they prefer to recognize and cut certain DNA sequences over others. Therefore, this biases the overall distribution of fragments obtained from those assays toward open re-gions containing the preferred sequences [46]. It is important to note that such sequence preferences might also be dependent on cell type and experimental condition [47]. Recently, researchers have attempted various strategies to cor-rect for DNase-Seq and ATAC-Seq sequence bias, especially in the context of transcription factor footprinting (see [48, 49] for examples).

4. It is important to note that protein DNA contacts are not fixed. Transcription fac-tors are continuously binding and leaving their binding site with different facfac-tors

2.3. GENOME-WIDE SEQUENCING ASSAYS 25 having different residence times [50]. It is possible to envision that factors with higher residence time and locations where a factor has higher residence time have a higher contribution to ChIP-Seq signal than other locations. This is also relevant for histone modification ChIP and for DNase-Seq; a location with sta-ble nucleosomes that do not change their locations can contribute higher clearer ChIP/DNase signal than another location where nucleosomes are unstable or are more frequently remodeled.