Engineering of DNA polymerases for higher discrimination between single nucleobase variations

(1)

Engineering of DNA polymerases for higher discrimination between single

nucleobase variations

Dissertation zur Erlangung des akademischen Grades des Doktors der Naturwissenschaften

(Dr. rer. nat.)

vorgelegt von Matthias Drum

an der Universität Konstanz

Mathematisch-‐Naturwissenschaftliche Sektion Fachbereich Chemie

Tag der mündlichen Prüfung: 24.07.2015

1. Referent: Herr Prof. Dr. Andreas Marx 2.Referent: Herr Prof. Dr. Martin Scheffner

(2)

(3)

Publications

Teile dieser Arbeit sind veröffentlicht in:

M. Drum, R. Kranaster, C. Ewald, R. Blasczyk, and A. Marx, “Variants of a Thermus aquaticus DNA Polymerase with Increased Selectivity for Applications in Allele-‐ and Methylation-‐Specific Amplification,” PLoS ONE, vol. 9, no. 5, p. e96640, May 2014

J. Aschenbrenner, M. Drum, H. Topal, M. Wieland, and A. Marx, “Direct Sensing of 5-‐Methylcytosine by Polymerase Chain Reaction,” Angewandte Chemie International Edition, Jun. 2014.

Patentanmeldungen:

A. Marx, M. Drum, K. Streichert, J. Mayer, R. Kranaster, M. Wieland, Means and Methods for the detection of DNA Methylation, Application number: WO2013EP662220130801

A. Marx, M. Drum, R. Kranaster, Mutated DNA Polymerases with high selectivity and activity, Application number: 92320 (in Luxembourg)

weitere Publikationen:

R. Kranaster, M. Drum, N. Engel, M. Weidmann, F. T. Hufert, and A. Marx, “One-‐step RNA pathogen detection with reverse transcriptase activity of a mutated thermostable Thermus aquaticus DNA polymerase.,” Biotechnology Journal, vol. 5, no. 2, pp. 224–231, Feb. 2010.

(4)

Danksagung

An dieser Stelle möchte ich mich bei all denjenigen bedanken die sichtbar oder unsichtbar zum Gelingen dieser Arbeit beigetragen haben.

Prof. Dr. Andreas Marx danke ich für das interessante Promotionsthema, seine Unterstützung in den vergangenen Jahren, die stets offene Türe und die exzellenten Forschungsbedingungen.

Prof. Dr. Martin Scheffner danke ich für die Übernahme des Zweitgutachtens und Prof.

Dr. Jörg Hartig für die Übernahme des Prüfungsvorsitzes.

Für unzählige unterhaltsame Stunden, die hervorragende Arbeitsatmosphäre, aber auch für viele kleiner und größere wissenschaftliche Diskussionen und Hilfestellungen danke ich allen jetzigen und ehemaligen Mitgliedern der AG Marx! Allen voran danke ich meinen vielen Laborkollegen der „Biologen“ Labore (Tobi, Tatjana, Silvia, Hüsnü, Ramon, Vani, Nadine, Bac, Nina, Daniel und Daniel, Marina, Eugenia, ...) aber auch den Chemikern (Janina, Holger, Hacker, Anna, Sascha, Samra, Frank, Anna-‐Lena, ...) von L9 und den ehemaligen „Exil-‐Marxisten“ von M12 (Konrad und Karin)!

Anna-‐Lena, Ramon und Konrad danke ich für das Korrekturlesen dieser Arbeit.

Allen Freunden, ob nun dauerhaft sesshaft in Konstanz (und Umgebung inklusive österreichischem Alpenvorland), auf der Flucht über Münster nach Kanada oder Rückkehrer aus Frankreich danke ich für die Ablenkung von der Wissenschaft! Vergesst nicht was Ihr alles über DNA-‐Polymerasen lernen durftet ;-‐).

Meiner Familie und Anna-‐Lena möchte ich für ihre andauernde Unterstützung und ihre Geduld in stressigen Zeiten danken. Ohne Euch wäre das hier wohl nicht möglich gewesen.

(5)

1 Introduction

Except of the annual awarding of the Nobel Prize, fundamental science hardly ever makes it to the daily top news. That was different when at a gala televised press conference in June 2000 the leader of the international publicly funded Human Genome Project Francis Collins and Craig Venter, the leader of the private for-profit company Celera Genomics, announced the completion of their first draft of the human genome sequence attended by the then US President Bill Clinton and UK Prime Minister Tony Blair.^[1][2][3] The White House press statement articulated the hope, felt by many, that this landmark achievement would “lead to a new era of molecular medicine, an era that will bring new ways to prevent, diagnose, treat and cure disease”.^[3][4] Nearly fifteen years later some of this has come true but also geneticists have discovered that such basic concepts as “gene” and “gene regulation” are far more complex than they ever imagined.^[4]

On one hand the huge progress in sequencing technology raises unbelievable amounts on genomic data. While one of the first published individual genomic DNA sequence of James D. Watson cost around US$1 million in 2008 the new Illumina HiSeq X Ten (http://www.illumina.com/systems/hiseq-‐x-‐sequencing-‐system.html) enables human whole-‐genome sequencing for less US$1000 within a day.^[5] By analysing the sequences of individuals the International HapMap Project charted the points at which human genomes commonly differ.^[6][7] Today we know that two randomly selected individuals of European descent will differ at roughly 3 million points in their genome, or roughly 0.1% of their >3 billion bases of DNA opening the field for personalized medicine.^[8]

On the other hand the total number of expected genes in the human genome dropped dramatically from over 100,000 to roughly 21,000 identified protein-‐coding genes in human cells.^[4] This caused a shift of interest from the identification of genes towards the understanding of genes and gene regulation. Identification of new mechanisms that can alter gene function and are heritable by daughter cells without changes in DNA sequence opened a complete new field of epigenetics.[9][10][11]

(10)

1.1 DNA and DNA base modifications

In all living organisms the polymeric macromolecule, deoxyribonucleic acid (DNA) is used to store and replicate genetic information. Arranged in a well defined order four different building blocks, the deoxyribonucleotides that consist of 2’-‐deoxyribose, phosphate and of one of the four nucleobases adenine (A), guanine (G), cytosine (C) and thymine (T) code the genetic information. Linked via phosphodiester bonds between the 2’-‐deoxyribose sugar moieties and the phosphates the deoxyribonucleotides form DNA single strands. Two antiparallel strands are coiled together to form a characteristic, right-‐handed, double-‐helical structure in which the four nucleobases form specific base pairs by hydrogen bond interactions (adenine pairs with thymine and guanine with cytosine Watson-‐Crick base paring). A set of three neighboured bases is called a triplet codon.

1.1.1 5-‐methylcytosine and 5-‐hydroxymethylcytosine

Methylation of cytosines at the C5-‐atom is the most abundant DNA modification in vertebrates and a major epigenetic mark.^[12] Methylated cytosines are found as symmetrical 5-‐methylcytosine (5mC) of the dinucleotide CpG within promoter regions, in which 75% are methylated throughout the mammalian genome.^[13] There are about 30,000 so called CpG islands in the human genome.^[14] Beside X-‐inactivation, genomic imprinting, the development of primordial germ cells methylation of cytosine is also directly linked to diseases like cancer.[15][16][17][18][19] The pattern of DNA methylation is established and maintained by DNA methyltransferases. For further information on the epigenetic influence of 5mC also see chapter 1.2.

Firstly discovered in the bacteriophages T2, T4 and T6 in 1952^[20]

5-‐hydroxymethylcytosine (hmC) was first described in mammalian brain and liver tissue twenty years later^[21]. In 2009 hmC was simultaneously detected in cerebellar Purkinje neurons^[22] and in mouse embryonic stem cells and human embryonic kidney cells^[23]. The ten-‐eleven translocation 1 (TET1) protein, was identified as a 2-‐oxoglutarate-‐ and Fe(II)-‐dependent enzyme that catalyses the conversion of 5-‐methylcytosine to 5-‐hydroxymethylcytosine in vitro, as well as in cultured cells.^[23] In an initial hype hmC was believed to be an important epigenetic marker itself and named the sixth base of the genome.^[24] Following the discovery of hmC, 5-‐formylcytosine (5fC) and 5-‐carboxylcytosine (5caC) were revealed in mouse embryonic stem cells (ESCs) and

(11)

mouse tissues as products from a stepwise oxidation of 5mC and hmC by TET family dioxygenases.[25][26][27][28]

While 5mC is generally viewed as a “silencing” epigenetic mark^[29], hmC is regarded as an intermediate in an active demethylation pathway.[23][26][27][30][31][32][33][34] Passive demethylation occurs for example during DNA replication. The postulated active demethylation pathway of 5mC is shown in Figure 1.

Figure 1: Postulated active demethylation pathway of 5mC: The pattern of DNA methylation is established and maintained by DNA methyltransferases. Demethylation TET family proteins can oxidize 5mC to hmC, hmC to 5fC, and then 5fC to 5caC. The oxidation products 5fC and 5caC can be removed by TDG to generate an abasic site. This abasic site can be repaired to a cytosine by the base excision repair (BER) pathway. Alternatively, hmC may be deaminated by AID or APOBEC to 5hmU, which can subsequently be removed and repaired by TDG or SMUG1 and then enter BER, respectively. 5caC may also be removed in a decarboxylation pathway. Solid arrows indicate biochemically validated pathways whereas dotted arrows are pathways yet to be confirmed biochemically. hmU has not been detected in the mammalian genome so far. Modified from literature^{[34] [35]}.

While 5fC and 5caC are thought to be strictly demethylation intermediates, hmC accumulates to relative high abundance and it may also have unique functions of its own that directly affect gene expression.^[35]

(12)

1.2 Epigenetics

While genetics describes the studies of genes, heredity and genetic variation in living organisms based on the genetic code of the DNA, epigenetics copes with differential gene expression causing broad functional and morphological diversity of cells even so they all have the same genetic material. In other words epigenetics was described in the BBC television science program: “At the heart of this new field is a simple but contentious idea – that genes have a ‘memory’. That the lives of your grand parents -‐ the air they breathed, the food they ate, even the things they saw -‐ can directly affect you, decades later, despite your never experiencing these things yourself.”^[36]

An initial scientific definition was given by Arthur Riggs and colleagues: epigenetics is

“the study of mitotically and/or meiotically heritable changes in gene function that cannot be explained by changes in DNA sequence”.^[37] With epigenetics becoming more and more popular the field broadened and the definition of epigenetics blurred, making an exact definition hard. Especially the use of the term epigenetic to describe processes that are not heritable is controversially discussed.[36][38][39]

One of the reasons why there is no general definition on epigenetics is the fact that epigenetic marks are represented by a variety of molecular mechanisms:

posttranslational histone modifications, ATP-‐dependent chromatin remodelling, small and other noncoding RNA (siRNA, miRNA), binding of histone variants and non-‐histone proteins and last but not least DNA base modifications like methylation.^[40]

Methylated cytosines are the most common DNA base modification in eukaryotes and are found as symmetrical 5-‐methylcytosine (5mC) of the dinucleotide CpG within promoter regions, in which 75% are methylated throughout the mammalian genome (also see chapter 1.1.1).^[13]

In recent years, it became evident that promoter methylation changes the interactions between proteins and DNA, which leads to alterations in chromatin structure and either a decrease or an increase in the rate of transcription.^[41] The post-‐synthetic addition of methyl groups to cytosines alters the appearance of the major groove of DNA, to which the DNA binding proteins bind, resulting in alternative effects on transcription.^[41] The position of the methylation change relative to the transcription start site is critical to the outcome: on the one hand methylation of a promoter CpG island leads to binding of methylated CpG binding proteins and transcription repressors leading to a block of transcription initiation. On the other hand methylation of silencer or insulator elements blocks the binding of the cognate binding proteins, potentially abolishing their

(13)

repressive activities on gene expression.^[41] Furthermore, DNA methylation is closely interconnected with chromatin remodelling and histone modifications. As transcription does not act on naked DNA, but on chromatin, it is a system of multiple layers of epigenetic modifications to modulate gene expression through chromatin structure.^[42]

Moreover, dynamic changes of methylation patterns are important for mammalian embryogenesis.^[43] During this process methylation levels change dynamically: in mammals there are at least two developmental periods -‐ in germ cells and in preimplantation embryos -‐ in which methylation patterns are reprogrammed genome wide, generating cells with a broad developmental potential.^[44]

Also genomic imprinting and X-‐inactivation is epigenetically regulated.^[15] Imprinted genes are expressed in a parent-‐of-‐origin-‐specific manner and are normally located in clusters where the alleles are differently labelled by DNA methylation, histone acetylation or deactylation and histone methylation.^[45]

In female mammals normally one of the two X-‐chromosomes is silenced.^[46] Otherwise the difference in X-‐chromosome dosage, would lead to an expression of X-‐linked genes in females twice as high as in males. The X-‐inactivation process converts one X-‐chromosome from active euchromatin into transcriptionally silent and highly condensed heterochromatin through a series of events that include the coating of the X-‐chromosome by Xist RNA, DNA methylation and histone modification.^[15]

Furthermore, alterations in DNA methylation can be an integral event in the onset of diseases like cancer.^[18][47] Cancer, in general, is caused by dysfunction of genes that control the cell cycle, apoptosis and migration. During carcinogenesis oncogenes are activated and enhance division or prevent cell death. Normally controlled by tumour suppressor genes their inactivation leads to cancer. At least three pathways are known for the inactivation of tumour suppressor genes: A mutation disables the function. A gene can get lost and is, thus, not available. Beside these two classical genetic mechanisms also epigenetic changes can switch off genes by inappropriate cytosine methylation in CpG motifs within control regions of gene expression. Found in virtually every type of human neoplasm the hypermethylation of these promoter regions is now the most well categorized epigenetic change to occur in tumours.[47][48][49] Surprisingly such promoter hypermethylation is at least as common as the disruption of classic tumour suppressor genes in human cancer by mutation.^[47]

Besides the silencing of genes by hypermethylation in promotor regions also cytosine methylation in the coding region of genes can increase mutation rates because of the

(14)

spontaneous hydrolytic deamination of methylated cytosine, which causes C to T transition mutations at methylated CpG sites.^[47] Methylation also changes the absorption wavelength of cytosine, into the range of incident sunlight, resulting in CC to TT mutations, which commonly occur in skin cancers. Methylated CpGs are also preferred binding sites for benzo(a)pyrene diol epoxide and other carcinogens that are found in tobacco smoke causing DNA adducts and G to T transversion mutations, which are often found in the aerodigestive tumours of smokers.^[47]

The discovery that particular hypo-‐ or hypermethylation events are unique for human malignancy suggests 5mC as a promising biomarker for cancer diagnosis.^[50][51] In reality, many DNA methylation based biomarkers have been evaluated in cancer research (also see chapter 1.6.3).^[52][53]

1.3 Biological role of DNA polymerases

The definition of what is life is hard to capture. Yet one basic requirement in almost all definitions is the capability of self-‐replication and the ability to pass genetic information to the next generations.^[54] In all known species from archaea to mammals this is achieved by the polymerisation of monomeric ribonucleotides to long polymer chains catalysed by polymerases.

The bacterial Pol III catalyses the polymerisation of up to 1000 bp/s with an error rate of only 1:10⁵ of the catalytically subunit.^[55] Applying proofreading factors can significantly reduce this error rate.^[55][56] In eukaryotes replicative DNA polymerases are slower by the factor of 20 but still reach speeds of approximately 50 bp/s.^[57] For a long time it was believed that the high accuracy is based on the Watson-‐Crick hydrogen bonding. In the meanwhile it is known that efficient and selective replication is also possible without hydrogen bonding. For a long time minor groove hydrogen bonding, base stacking, solvation, and steric effects were underestimated.^[58] The incorporation of nucleotides in the new DNA strand is accompanied by a series of conformational changes in the DNA polymerase thought to be a checkpoint control. Mismatched bases cause a steric problem when fitting into an active site preventing incorporation.^[58] Also kinetic mechanisms prevent mismatch nucleotide incorporations. Mismatches are bound and processed more slowly allowing the internal proofreading domains (e.g.

3’-‐5’ exonuclease) to repair the damage or leads to the separation of the DNA polymerase from the substrate, giving other repair enzymes access to the DNA.^[59]

(15)

Based on their amino acid sequence DNA dependent polymerases are divided into six families: A, B, C, D, X and Y. Members of the first three families are involved in DNA replication.^[60] While family C is only described in eubacteria all replicative polymerases of archaea and eukaryotes are found in family B. Family D is only known in archaea and little of the functions is known.^[60] The last families X and Y include polymerases specialized on DNA damage repair and translesion bypass synthesis.^[61]

In the meanwhile crystal structures for all polymerase families except family D are described.^[62][63] These structures revealed that most polymerase have the same overall structure often referred to as a right hand with finger, thumb, and palm domain.^[64][65]

While the finger domain interacts with the incoming nucleotides and the single stranded DNA (ssDNA) template the thumb domain binds the double stranded DNA (dsDNA) product. The active centre is placed in the palm domain with the magnesium ion binding domain needed for the phosphoryl transfer. While thumb and finger domains are unique within the families for the palm domain two types can be distinguished.^[64][65]

1.3.1 Reaction mechanism of DNA polymerases

In a simplified model the kinetic mechanism of most DNA polymerases can be described in five steps (see Figure 2):[66][67][68][69] Firstly the DNA polymerase binds the DNA primer/template complex. In the second step an incoming dNTP is weekly bound. This complex is named open ternary complex. Step three is a large conformational change, leading to tight binding of substrates and optimal alignment of the catalytic residues.

This complex is referred to as closed ternary complex. Step four is the nucleotidyl transfer and the phosphodiester bond formation. Finally another conformational change occurs and pyrophosphate is released allowing the dissociation of the polymerase from the elongated DNA primer template complex or a further catalysis cycle. The number of added nucleotides by a polymerase during an association and dissociation at a single DNA substrate is defined as processivity. While replicative DNA polymerases tend to be very processive and add several hundred nucleotides upon binding, polymerases involved in DNA repair have low processivity adding only a single or a few nucleotides.^[70][71]

(16)

Figure 2: Reaction mechanism of DNA polymerases. In a simplified model of DNA polymerase catalysed nucleotide incorporation the first step is the binding of a DNA polymerase DNA and the primer template complex (P/T). Secondly an incoming 2’-‐deoxynucleoside-‐5’-‐triphosphate (dNTP) binds followed by a conformational change of the enzyme. The chemical bond formation is step four. The last step is another conformational change including pyrophosphate (PPi) release. Afterwards either another cycle of catalysis is started or the enzyme dissociates from the primer template complex.

Modified from literature ^[72].

While explaining the overall mechanism of polymerase catalysed nucleotide incorporation this model lacks the selectivity of nucleotide incorporation. For a long time the conformational change in step 3 was believed to be the rate-‐limiting step of selective nucleotide incorporation. Recent studies however point out that a variety of steps along the reaction pathway could be envisaged as acting as “kinetic checkpoints”.

Non-‐covalent transitions and conformational changes in the early pathway serve to test the incoming dNTP for complementarity and facilitating rejection of incorrectly paired dNTPs.[73][74][75] In the case of a matched incoming dNTP the initially loose bound state is followed by a fast conformational change leading to tight dNTP binding, followed by chemical bond formation and pyrophosphate release.[76][77][78][79] In the case of a mismatched nucleotide it is postulated that the enzyme is not fully closed and active misalignment of catalytic residues is slowing down the rate of catalysis and promoting dNTP release.^[80] Fluorescence data suggest, that the mismatch recognition state is not an intermediate between the open and closed conformational states that occurs upon correct nucleotide binding, but a discrete state itself.^[80][81] The crystal structure of a high fidelity DNA polymerase I bound to DNA primer-‐template caught in the act of binding a mismatched (dG:dTTP) nucleoside triphosphate shows that the polymerase adopts a conformation in-‐between the open and closed states.^[82] In this so-‐called "ajar"

conformation, the template base has moved into the insertion site but misaligns an incorrect nucleotide relative to the primer terminus. The displacement of a conserved

(17)

active site tyrosine in the insertion site by the template base is accommodated by a distinctive kink in the polymerase O-‐helix, resulting in a partially open ternary complex.^[82] This ajar conformation allows the template to probe incoming nucleotides for complementarity before closure of the enzyme around the substrate indicating a three-‐state reaction pathway in which nucleotides either pass through this intermediate conformation to the closed conformation and catalysis or are misaligned within the intermediate, leading to destabilization of the closed conformation.^[82][83] Using NMR spectroscopy unique recognition states when encountering matched, mismatched, and abasic template sites in KlenTaq DNA polymerase could be shown under close-‐to-‐

physiological conditions and in a virtually label-‐free manner.^[84] This is a further hint, that differences in local dynamics or conformational heterogeneity caused by incorrect base pairing might contribute to selectivity of DNA polymerases by reducing the efficiency of incorporation and promoting substrate release.

1.3.2 KlenTaq DNA polymerase

In analogy to E. coli DNA polymerase I the large fragment of the DNA polymerase I from Thermus aquaticus (Taq DNA polymerase) is termed KlenTaq DNA polymerase (Klenow fragment of the Taq DNA polymerase) or KTQ. KlenTaq DNA polymerase is a thermostable, exonuclease deficient, A family DNA polymerase composed of 540 amino acids. Compared to full length Taq DNA polymerase KlenTaq DNA polymerase lacks the 292 N-‐terminal amino acids that build up the 5’-‐3’ exonuclease function.[65][85][86][87][88]

KlenTaq DNA polymerase shows the typical right hand structure with the three basic DNA polymerase domains: fingers, thumb, and palm (see Figure 3).^[88][89] In the open conformation KlenTaq DNA polymerase is bound to a DNA primer template complex to allow dNTP binding (see Figure 3A). Upon correct nucleotide binding KlenTaq DNA polymerase undergoes significant conformational changes to close the active site.

Thereby, especially the fingers domain moves inward to allow active site formation and tight nucleotide binding. In particular the O and the N helix of the fingers domain reflect these conformational changes (see Figure 3B).

(18)

Figure 3: KlenTaq DNA polymerase. A) Binary complex in presence of a DNA primer template complex (pdb file 4KTQ). B) Ternary complex in presence of DNA and an incoming nucleotide (pdb file 3KTQ). The KlenTaq DNA polymerase domains finger, palm, and thumb are colour coded in cyan, blue, and green, respectively. The bound ddCTP is shown in magenta. The O helix is depicted in orange, the N helix in red.

KlenTaq DNA polymerase is structurally and mechanistically well descript and is frequently used as a model system making it perfectly suitable for enzyme engineering.[76][88][89][90][91][92][93][94][95][96][97][98]

1.4 DNA polymerases in biotechnology

Not only in nature DNA polymerases are of fundamental importance. They are the workhorses in multiple biotechnical applications like molecular cloning, DNA sequencing or nucleic acid diagnostics.^[99] Daily labour routines (e.g. PCR), standard diagnostics (e.g. virus titter detection or single nucleotide variation diagnostics) but also next next generation sequencing methods depend on the unique properties of DNA polymerases.[100][101][102]

1.4.1 Polymerase chain reaction

The most widely used key technology for DNA polymerases in biotechnology is beyond doubt the polymerase chain reaction (PCR). Developed by Mullis and co-‐workers in 1987^[103] the breakthrough of PCR was unstoppable. Only seven years after the publication of PCR Mullis was honoured with the novel price of chemistry in 1993 for the development. PCR allows the exponential amplification of a specific DNA sequence

(19)

from a single or few copies of template DNA. As DNA polymerases are not capable of de novo synthesis specific primers (short DNA fragments of typically around 20 bp) that are needed, guaranteeing sequence specify of the target amplicon. During repeated cycles of heating and cooling dsDNA is formed and itself used as template in the next amplification round. Under consumption of the primers and deoxynucleotide triphosphates (dNTPs) the selected DNA sequence flanked by the primers is exponentially amplified catalysed by the DNA polymerase. As high temperatures during thermal cycling steps are necessary to physically separate the two strands of the DNA double helix (usually at high temperatures ~95°C) in almost every PCR application, thermostable DNA polymerases are employed. Nowadays, high-‐fidelity DNA polymerases (e.g. Phusion^®) have significantly shortened conventional PCR methods.

Additional fluorescent dyes (e.g. SYBRGreenI) or fluorescence resonant probes (e.g.

TaqMan)[104][105][106][107] in the PCR reaction mix report the amount of amplified DNA in real-‐time. By addition of a reverse transcriptase or with engineered DNA polymerases RNA detection is possible.[108][109][110] Consequently, real-‐time PCR methods are the method of choice for the detection and quantification of DNA and RNA targets such as retroviruses, other viral pathogens^[107], or mRNA expression levels^[111]. Today based on the principle of PCR numerous biotechnological and diagnostic applications are described: Allele-‐specific PCR (ASA) for the detection of single nucleotide variations[112][113][114], multiplex PCR for the simultaneous amplification of multiple DNA fragments in one reaction^[115], nested PCR which increases the specificity of the DNA amplification reaction^[116], isothermal loop-‐mediated amplification^[117], and many more.

1.5 Directed evolution of DNA polymerases

To get access to enzymes with properties not found in nature or to shape existing properties like activity, stability or selectivity directed evolution of proteins is a powerful and widely used method in protein engineering.[97][118][119][120][121][122] As DNA polymerases are the key enzymes in biotechnological applications like PCR, PCR-‐based methods and DNA sequencing (see chapter 1.4) they are interesting target enzymes for directed evolution approaches.^[123] Many DNA polymerase mutants with increased fidelity^[124][125], altered thermostability^[126][127] or increased substrate spectra are known.

Polymerases with increased substrate spectra spread over a wide range from the incorporation of ribonucleotides[128][129][130][131], modified nucleotides[132][133][134][135]or

(20)

nucleotides of unnatural DNA analogues^[136][137]. Also DNA polymerases with increased reverse transcriptase^[108][138] or lesion-‐bypass activity^[139] and with enhanced fidelity in mismatch extension^[140][141] have been successfully evolved.

The process of directed protein evolution can be described as a series of three basic, iterative steps: Firstly mutations are introduced into the DNA sequence of the target enzyme. Secondly the enzyme mutants are expressed. Finally a screening or selection step identifies the protein and the gene of the most improved variant. Either this is the final protein or the gene of the most improved variant is then used as the template for further rounds of mutagenesis, expression and screening/selection until the desired level of improvement has been achieved. (Also see Figure4)

Figure 4: Directed protein evolution. After an initial diversification step the proteins of the gene library are expressed and the improved protein variants are identified either by screening or selection. Either the desired level of improvement has been achieved or the process is started again with the best variants gene.

The mutagenesis step can either address the entire target gene (error-‐prone PCR and DNA shuffling) or selected amino acid positions (saturation mutagenesis).[142][143][144][145]

An overview of all three widely used methods is shown in Figure 5. In Saturation mutagenesis a single amino acid at a defined position is replaced by multiple other or all other amino acids (see Figure5 A). With PCR amplification with increased error rates (error-‐prone PCR) mutations are introduced randomly over the whole target gene (see Figure 5B). Once multiple mutants or a single mutant with multiple mutation sites is identified further improvement can be realized by DNA shuffling: the genes of existing mutants or of a mutant with multiple mutation sites and the wild-‐type gene are

(21)

fragmented and reassembled in a random fashion resulting in single mutants al well as random combinations of all mutation sites (see Figure 5 C). All three methods were also used in this thesis.

Figure 5: Mutagenesis strategies. A) Saturation mutagenesis: selected amino acid positions are mutated. A single amino acid at a defined position is replaced by multiple other or all other amino acids. B) Error-‐prone PCR: mutations are introduced randomly over the whole target gene by PCR amplification with increased error rates. C) DNA shuffling: the genes of existing mutants or of a mutant with multiple mutation sites and the wild-‐type gene are fragmented and reassembled in a random fashion. (Figure modified from ^[146])

The resulting diversified gene library is afterwards transferred into a host organism e.g.

E. coli for protein expression. In this step the linking of the genotype and phenotype of the target enzyme are of fundamental importance. Commonly this is achieved by compartmentalization in multi-‐well plates^[147], by generation of discrete compartments formed by a water-‐in-‐oil emulsion[148][149][150] or by linking the expressed protein on the surface of the host cells (e.g. phage or yeast display)[148][151][152][153][154][155][156].

Clearly the key step for each directed protein evolution is the adequate high-‐throughput screen or selection to identify improved protein variants. Selection methods like phage display and compartmentalized self-‐replication (CSR) are based on the concept that each polymerase mutant has to replicate its own encoding gene resulting in the enrichment of active variants that can further be selected. The fact that the enzymes that are selected

(22)

need to be directly involved in processes required for cell survival of the host organism or in the process of DNA or RNA replication in CSR limits the number of possible target enzymes.

For enzymes that are neither required for cell survival of the host organism nor are linked to DNA or RNA replication screening approaches are required. An example is the directed evolution of green fluorescent protein.[157][158][159] Also for DNA polymerases different screening approaches including nucleotide incorporation assays^[160], primer-‐

extension reactions^[139] or PCR^[141] have been established.

1.6 Personalized medicine

The overall response of humans to different environmental impacts and stressors is quite complex and often not predictable for single individuals. The same holds true for the response of an organism to drugs.^[161] This is where pharmacogenomics joins the game. Pharmacogenomics focuses on the clinical translation of genomic data to predict and evaluate disease risk and progression, as well as the pharmacological response to drugs in individuals patients or groups of patients.^[162] One of the most obvious genetic variables that distinguish half of the population is the sex. Sex is not only a fundamental aspect of human physiology, but also greatly influences the genetic predisposition for diseases and influences patient outcomes.^[163] One well-‐known example is the red-‐green colour vision defect that occurs in about 8 % of males but only in 0.5 % of females of Northern European ancestry.^[164][165] Other widely known examples are the risk of myocardial infarctions that is higher for men at any given age^[166] or the risk of breast cancer were about 99 % of the cases occur in females^[166].

Personalized medicine provides improvement of prognosis, diagnosis and therapy outcomes adapted to each patient’s genetic predisposition.[167][168][169][170][171] Both the prevention and cure of disease is potentially achievable in personalized medicine by predicting the disease risk among healthy individuals and the therapeutic response among patients^[3]. In disease prevention, the key step is the identification of high-‐risk individuals that may develop major common diseases, such as cardiovascular disorder, diabetes and cancer, and then selecting the most appropriate preventive intervention to protect them from these diseases.^[3] This strategy can substantially reduce disease incidence and it is particularly important for hard-‐to-‐treat disorders, such as cancer.^[3]

Engineering of DNA polymerases for higher discrimination between single nucleobase variations