• Keine Ergebnisse gefunden

Accurate detection of tumor copy number variations in high-throughput sequencing data / eingereicht von Ing. Patrick Praher BSc

N/A
N/A
Protected

Academic year: 2021

Aktie "Accurate detection of tumor copy number variations in high-throughput sequencing data / eingereicht von Ing. Patrick Praher BSc"

Copied!
90
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

JOHANNES KEPLER UNIVERSITÄT LINZ Altenberger Straße 69 4040 Linz, Österreich www.jku.at DVR 0093696 Eingereicht von Ing. Patrick Praher BSc Angefertigt am

Institut für Bioinformatik Beurteiler / Beurteilerin Univ. Prof. Dr. Sepp Hochreiter

Mitbetreuung

Mag. Dr. Günter Klambauer Jänner 2015

Accurate detection of

tumor copy number

variations in

high-throughput

sequencing data

Masterarbeit

zur Erlangung des akademischen Grades

Master of Science

im Masterstudium

(2)

Eidestattliche Erklärung

Ich erkläre an Eides statt, dass ich die vorliegende Masterarbeit selbstständig und ohne fremde Hilfe verfasst, andere als die angegebenen Quellen und Hilfsmittel nicht benutzt bzw. die wörtlich oder sinngemäÿ entnommenen Stellen als solche kenntlich gemacht habe.

Die vorliegende Masterarbeit ist mit dem elektronisch übermittelten Textdoku-ment identisch.

Ort, Datum Unterschrift

(3)

Kurzfassung

Obwohl seit den 1970ern bekannt ist, dass sich Tumore evolutionär entwickeln, wird dieser Mechanismus wenig verstanden. Aufgrund der rapide sinkenden Kosten für DNA-Sequenzierung erönen sich neue Möglichkeiten um intratumorale Hetero-genität zu untersuchen und besser zu verstehen. Die Bioinformatik spielt eine Schlüsselrolle um diese groÿen Datenmengen nutzbar zu machen, indem sie die Algorithmen zur Analyse und Visualisierung zur Verfügung stellt. Aufgrund der hohen Anzahl an genetischen Veränderungen ist die Analyse von Tumorgenomen sehr anspruchsvoll und neue Methoden, die über bestehende Techniken zur Anal-yse von Kopienzahlvariationen von Keimbahn DNA hinausgehen, sind notwendig. Dabei stellt die Bestimmung der Reinheit der Tumorprobe und des Ploidiegrads die zentrale Herausforderung dar.

Zu diesem Zweck wurden die beiden Methoden absCN-seq und PyLOH an-hand eines Prostatakrebs-Datensatzes vom International Cancer Genome Consor-tium verglichen. PyLOH lieferte eindeutig bessere Ergebnisse und wurde daher für die weiteren Analysen herangezogen. Auÿerdem wurde PyLOH verbessert um Seg-mente ohne Information über den Verlust der Heterozygotie analysieren zu können. Mithilfe eines Leukämie-Datensatzes des Children's Cancer Research Institute aus Wien konnten die Verbesserungen an PyLOH gezeigt werden. Dabei wurde für 80% der Proben eine Sensitivität von mindestens 90% und für 70% der Proben eine Genauigkeit von ebenfalls mindestens 90% für die Erkennung der exakten Kopienzahl erreicht.

Die 2012 publizierte Methode cn.MOPS ermöglicht es kürzere Kopienzahlvari-ationen zu erkennen und damit feingranularere Ergebnisse zu erzielen. Damit cn.MOPS auf Tumordaten angewendet werden kann, wurden neue Methoden zur Normalisierung und Vorverarbeitung entwickelt. Für unsere Projektpartner vom Krankenhaus der Barmherzigen Schwestern in Linz wurde zusätzlich eine neue, in-tuitiv verständliche Form der Visualisierung von Tumor-Kopienzahlen entwickelt. Basierend auf einem Datensatz mit räumlich getrennten Proben des Primärtumors von mehreren Darmkrebs Patienten, konnten wir das Vorgehen bei der Analyse der evolutionären Tumorentwicklung anhand intratumoraler Heterogenität demonstri-eren. Unser Ansatz zeigt neue Möglichkeiten zur automatisierten Verarbeitung dieser Proben, um zukünftig DNA-Sequenzierung besser für klinische Anwendun-gen im Bereich der Krebsprognose und Krebsbehandlung nutzbar zu machen.

(4)

Abstract

Although it is known since the 1970s that tumors show an evolutionary develop-ment this mechanism is still little understood. Due to the rapidly declining cost for DNA sequencing new possibilities to investigate and understand the underlying mechanisms of intra-tumor heterogeneity arise. In order to utilize this new volume of data, bioinformatics plays a key role by providing the necessary algorithms for the data analysis and visualization. Because of the high number of genetic mu-tations the analysis of tumor genomes is very demanding and improved methods exceeding the available techniques for copy number analysis of germline DNA are necessary.

For this purpose the methods absCN-seq and PyLOH were compared by the means of a prostate cancer data set from the International Cancer Genome Con-sortium. PyLOH delivered clearly better results and therefore was chosen for all further analysis. Furthermore PyLOH was improved to allow the analysis of segments without loss of heterozygosity information. Based on a leukemia data set from the Children's Cancer Research Institute in Vienna the improvements of PyLOH could be shown. In this analysis the recall for 80% of the samples is over 90% and the precision for 70% of the samples is also over 90% for the detection of the exact copy number.

The method cn.MOPS is able to detect shorter copy number variations and therefore produces more ne grained results. To apply cn.MOPS on the tumor data it was necessary to develop new methods for normalization and preprocess-ing. For our project partners from the Krankenhaus der Barmherzigen Schwestern in Linz we additionally developed a new and intuitive way to visualize tumor copy numbers. Based on a data set of spatially separated samples from the primary tumor of multiple colon cancer patients the analysis of evolutionary tumor de-velopment based on intra-tumor heterogeneity was demonstrated. Our approach gives new possibilities for the automated processing of such samples to bring DNA sequencing closer to a clinical application in the eld of cancer prognosis and treatment.

(5)

Acknowledgments

First I would to thank my supervisor Sepp Hochreiter for training me in the elds of bioinformatics and machine learning.

Mainly I would like to thank Günter Klambauer who introduced me to the topic, who guided me through the whole processes of writing this thesis and who had convincing answerers to all my questions.

Many thanks go to my girlfriend Verena Haunschmid who always encouraged me with her motivation and helpfulness to get started. Without her this thesis would not be nished by now. I love you.

Special thanks go to my parents, Brigitte and Georg, who supported me through-out my whole live in every way, which gave me the time, endurance and determi-nation to write this thesis.

Last but not least I would like to thank my grandmother, siblings and friends for taking their time whenever I needed a moment to rest.

(6)

Abbreviations

BAF B allele frequency

BAM binary alignment mapping

CGH comparative genomic hybridisation CN copy number

CNV copy number variation CNA copy number alteration

ITH intratumor heterogeneity LOH loss of heterozygosity

MLE maximum likelihood estimator MSE mean squared error

NGS next generation sequencing SCNA somatic copy number alteration

SNV single nucleotide variant WES whole exome sequencing WGS whole genome sequencing

(7)

Contents

Eidestattliche Erklärung i Kurzfassung ii Abstract iii Acknowledgments iv Abbreviations v 1 Introduction 1 1.1 DNA sequencing . . . 1 1.1.1 Sanger sequencing . . . 1

1.1.2 Polymerase chain reaction . . . 2

1.1.3 Next generation sequencing . . . 3

1.1.4 Whole exome sequencing . . . 4

1.2 Characteristics of DNA data from tumor samples . . . 5

1.3 Evolutionary development of tumors . . . 6

1.4 Tumor heterogeneity . . . 7

2 Methods 9 2.1 PyLOH . . . 10

2.1.1 The model . . . 10

2.1.2 Modeling the read counts . . . 11

2.1.3 Modeling the LOH . . . 12

2.1.4 Combining the models . . . 12

2.1.5 The likelihood model . . . 12

2.1.6 EM algorithm . . . 13

2.1.7 Implementation . . . 14

2.2 absCN-seq . . . 15

2.2.1 Denitions . . . 15

2.2.2 Copy-number-based ratio estimator . . . 15

2.2.3 SNV-frequency-based estimator . . . 15

2.2.4 Least squares objective . . . 16

2.2.5 Parameter estimation . . . 16

2.3 cn.MOPS . . . 16 vi

(8)

2.3.1 Processing pipeline . . . 16

2.3.2 Preprocessing . . . 17

2.3.3 The model . . . 17

2.3.4 Segmentation and post-processing . . . 17

2.4 Comparison of cn.MOPS and PyLOH . . . 18

2.4.1 Poisson mixture model . . . 18

2.4.2 EM update rule . . . 18

2.5 Optimized PyLOH . . . 20

2.5.1 Improved objective . . . 21

2.5.2 Implementation changes . . . 21

2.6 Normalization of cancer data . . . 22

2.7 Calculating expected fold change . . . 22

2.8 Visualization . . . 23

2.8.1 Karyograms . . . 23

2.8.2 Construction of phylogenetic trees . . . 23

3 Results 24 3.1 Early onset prostate cancer . . . 24

3.1.1 Segmentation . . . 24

3.1.2 Comparison . . . 25

3.1.3 Conclusion . . . 29

3.2 Children's Cancer Research Institute . . . 30

3.2.1 Approach . . . 30

3.2.2 Purity estimates . . . 31

3.2.3 Copy number estimates . . . 31

3.2.4 Samples in detail . . . 37

3.2.5 Improved PyLOH . . . 38

3.3 Colon Cancer Linz . . . 47

3.3.1 Approach . . . 47

3.3.2 Purity estimates . . . 47

3.3.3 Copy number estimates . . . 49

3.3.4 cn.MOPS results . . . 65

4 Discussion 70 A Appendix 74 A.1 Methods . . . 74

A.1.1 Phylogenetic trees . . . 74

(9)

1

|

Introduction

High-throughput DNA sequencing is the foundation of all the analysis tasks per-formed in this thesis. Therefore this chapter will focus on DNA sequencing and its historic development. Furthermore the most important characteristics of tumor genomes and the challenges arising therefrom will be described.

1.1 DNA sequencing

State of the art biotechnologies allow us to analyze the human genome in an unprecedented speed and precision. The most important technology in this respect is DNA analysis. Since the 1970s two methods turned out to be revolutionary for this task: (i) in 1977 Sanger et al. presented the paper DNA sequencing with chain-terminating inhibitors where they described a new method for determining DNA sequences [26] and (ii) the polymerase chain reaction (PCR) by Mullis et al. which allows to amplify the available DNA material [17]. Sanger has been awarded with the Nobel Prize in chemistry in 1980 for his fundamental studies of the biochemistry of nucleic acids, with particular regard to recombinant-DNA [20] and Mullis in 1993 for PCR [19].

1.1.1 Sanger sequencing

Sanger sequencing dates back to the 1970s but is still commonly used and repre-sents the basis for state of the art DNA sequencing methods. The rst step of Sanger sequencing is to split up the double stranded DNA in order to get sin-gle strand DNA templates (e.g. denaturation). The available DNA templates are split up in four separate vessels. The following four deoxynucleotide triphos-phates (dNTPs) are all added to each vessel so they can be build the new DNA strand by the polymerase: deoxyadenosine triphosphate (dATP), deoxycytidine triphosphate (dCTP), deoxyguanosine triphosphate (dGTP) and deoxythymidine triphosphate (dTTP). Additionally the four equivalent dideoxynucleotide triphos-phates (ddNTPs) are necessary: ddATP, ddCTP, ddGTP and ddTTP. In contrast to the dNTPs, these are missing the 3'-OH group so the chain can not be extended any further by the polymerase. Each of the ddNTPs is added to one of the vessels only. With the use of DNA primers and DNA polymerase, DNA segments, which are terminated by one of the ddNTPs, are formed in each vessel. In order to sep-arate the DNA fragments with high-resolution gel electrophoresis it is necessary

(10)

Figure 1.1: Schematic representation of Sanger sequencing. The gure shows the result from two dierent forms of gel electrophoresis. (a) shows the result from manual Sanger sequencing. For each of the four bases, one vessel is used that contains one of the ddNTPs. (b) shows the result from automated Sanger sequencing. A dierent color label is used for each ddNTP. This allows to use only one vessel in the experiment.

that the ddNTPs are labeled with radioactive or uorescent markers. The DNA sequence can be read from the resulting gel electrophoresis pattern as shown in Figure 1.1(a) [26, 15].

In order to automate this method it was necessary to change the labeling. In contrast to the original experiment, each of the ddNTPs is marked with a dierent color as shown in Figure 1.1(b). This allows to use only one reaction vessel instead of four and thus allows to reduce the necessary time for the experiment. The improved technique is called dye-terminator sequencing [15]. Automated Sanger sequencing is also called rst generation sequencing [16].

1.1.2 Polymerase chain reaction

In order to amplify the available DNA a repeated cycle of heating and cooling is used:

i DNA is heated to 95◦C to initiate the denaturation. Thereby the

hydro-gen bonds between the two DNA strands release to get single stranded DNA molecules.

(11)

ii the temperature is lowered to about 55◦C so that the DNA primers in the

solution can bind to the single strands. It is important that the minimum temperature in the cycle is not to low, so the primers do not start to bind to unintended sites.

iii the mixture is heated up to about 70◦C so that the polymerase can start to

elongate the new DNA strand with deoxynucleotide triphosphates (dNTPs) beginning at the primers. Usually a Taq-polymerase is used because it can resit the temperature needed for the denaturation step.

Each step in the scheme takes between a few seconds and a few minutes depending on the experiment and the reagents. Commonly the whole process is repeated 20 to 30 times. This procedure can produce a very large number ob genetic material because it exponentially amplies the DNA available at the beginning of the exper-iment. For example 10 single strands at the beginning of the experiment amplied

in 20 PCR cycles theoretically lead to 10 ∗ 220 = 10485760 DNA strands under

ideal conditions [17, 24]. PCR is used for many dierent application including forensic analysis, paternity tests and preparations for sequencing experiments.

1.1.3 Next generation sequencing

Next generation sequencing (NGS) methods are the second generation of DNA sequencing succeeding automated Sanger sequencing. There are multiple dierent

NGS technologies on the market such as Illumina1, 454 by Roche2, Ion Torrent3 or

upcoming technologies like Nanopore.4 In this section, Illumina sequencing will be

described because it is currently the most widely used method for whole genome sequencing [9] and the colon cancer data set described in this thesis was sequenced with Illumina HiSeq 2000.

NGS experiments typically require three steps: (i) library preparation and cluster amplication, (ii) sequencing and (iii) data analysis. In the rst step (i) the primers are ligated to the DNA fragments. Thereby the fragments can be amplied using PCR (Figure 1.2: Library Preparation). The amplied segments are then immobilized by xing them separately to a cluster on a glass slide called ow cell. Via bridge amplication the DNA molecules are amplied in each cluster (Figure 1.2: Cluster Amplication). In the sequencing step (ii) a method called cyclic reversible termination is applied. In each cycle the DNA fragments are extended by one of four color labeled and reversible terminated dNTPs. After washing, the clusters are exposed to lasers in order to take the images. Depending on the color the base assembled in the cycle can be determined for each cluster. Finally the dye and terminating group is separated from the last dNTP and washed away so the cycle can start over again (Figure 1.2: Sequencing). In the nal step (iii) the sequence of bases can be determined for each cluster based on the sequence

1http://www.illumina.com/ 2http://www.454.com/

3https://www.thermofisher.com/at/en/home/brands/ion-torrent.html 4https://www.nanoporetech.com/

(12)

Figure 1.2: Next generation sequencing steps [10].

of images from each cycle. Usually this reads are mapped to a reference genome. When no reference genome is available for an organism the process is called de novo assembly.

1.1.4 Whole exome sequencing

In contrast to WGS, WES covers only the exons which comprise about 2% of the whole genome. The other 98% are called non coding regions and have mostly structural information. Nevertheless this 2% of the genome contains around 85% of the disease causing mutations [22]. The main advantage of WES is cost. For the same cost of sequencing one whole genome it is possible to sequence several whole exoms [4, 6]. Additionally WES is faster, because the runs of the sequencing machines are shorter and time and money is saved because of the lower demands for storage and processing power for data analysis.

(13)

Figure 1.3: This gure shows the dierent types of CNVs.5

On the other hand WGS could give new possibilities in data analysis by provid-ing additional information for example to discover structural variants. Additionally the sequencing and data processing costs are continuously declining. Therefore WGS may be the method of choice for research projects in the future. From a medicolegal standpoint WES may be preferred for clinical applications, because the generation of unnecessary data could be minimized [4]. This thesis analyses WGS and WES data sets as the methods can be applied to both kinds of data with the correct settings.

1.2 Characteristics of DNA data from tumor

sam-ples

In genetics, the term copy number variation (CNV) refers to a DNA sequence that is found at dierent copy numbers in the germline DNA of two individuals. [3]. CNVs can occur as deletions or duplications as Figure 1.3 shows. The term copy number alteration (CNA) is primarily used in oncogenomics studies and also refers to CNVs. CNAs can occur in the germline, as a predisposition to cancer, and in somatic cells where it is called a SCNA (somatic copy number alteration). These are present in tumor cells but not in the normal diploid cell from the same donor [30].

The analysis of a cancer genome is dicult because of the high number of segments with deletion and/or (multiple) duplication events. To detect the exact copy numbers for all segments based on NGS data, it is essential to determine the purity and the ploidy of the tumor sample rst. The purity is the amount of cancerous cells within a heterogeneous tumor sample and the ploidy is the

(14)

Figure 1.4: A karyogram of a colon cancer patient based on a CGH analysis is shown. The chromosomes are identied by the blue numbers below. Red lines beside the chromosomes indicate deletions and green lines indicate duplications.6

baseline copy number of genomic segments or entire chromosomes of the tumor sample [13]. An established technique to analyze large structural changes in tumor genomes without DNA sequencing is comparative genomic hybridisation (CGH) [32]. An example for the results of an CGH experiment is shown in Figure 1.4. The sample is taken from a colon cancer tumor and shows typical mutations like the deletion of the p-arm of chromosome 18.

1.3 Evolutionary development of tumors

A new tissue with abnormal growth is called neoplasm. If this new tissue forms a solid mass it is commonly called tumor [21]. Since the 1970s it is known that the development of cancer can be seen as an evolutionary process and that neoplasms have a heterogeneous structure [14, 7]. This means that neoplasms consist of many dierent branches with dierent DNA and that these branches show an evolution-ary development similar to well-known models from population genetics [14]. This knowledge is possibly the key to understand the complex development process of cancer. Moreover the rapid improvements in the eld of NGS give us new possi-bilities for analyzing this intricate processes.

The time span evolutionary forces are present is normally measured form

gen-6http://www.ncbi.nlm.nih.gov/projects/sky/skyquery.cgi?form_type= display_hdr_form&case_det=-1&karyo_det=-1&cgh_det=2363&CGH_Details= CGH

(15)

erations to thousands of years. In a cancer, mechanisms like natural selection and genetic drift can be observed within years. Natural selection means, that organ-isms with a better biologic tness will establish over other organorgan-isms. This leads to the question, how it is possible that multiple mutations within one tumor can coexist? Merlo et al. [14] suggest the following possibilities: Mutations can be evo-lutionary neutral, which means that they have no selective advantage against each other; some branches specialize on a specic micro environment; the environment changes very fast, so no clone can reach xation; they are physically separated and do not inuence each other or ght for the same resources.

Genetic drift means that the allele frequency changes by chance. In larger population this eect can be neglected because of the large genetic variety. But in small populations, where it is statistically more likely that the random mutation with no selective advantage reaches xation, genetic drift can be an important force. In neoplasms genetic drift could take eect in the beginning of cancer development or after therapy because the population size can be small in both scenarios [14].

Another important factor that should be considered in the analysis of cancer data is the dierence between driver and passenger mutations. A driver mutation brings a selective advantage for the cancer, for example by improved growth factors. A passenger mutation does not have a functional advantage for the cancer. The ability to distinguish between driver mutations, which can also be called cancer genes, and passenger mutations would be very helpful in order to classify dierent branches of tumors [29].

In this context the evolutionary dynamics lead to the heterogeneous nature of a tumor, therefore it is dicult to get distinct samples of the dierent tumor branches [14]. Although histologists try to take samples of dierent tissue branches due to dierent criteria (such as visual dierences), especially with a small number of samples it is dicult to at least take samples of the most dominant tumor branches.

1.4 Tumor heterogeneity

The evolutionary development of neoplasms and the resulting heterogeneous va-riety of tumors can occur between separate tumors (intertumor heterogeneity) or within tumors which is called intratumor heterogeneity (ITH). Dierent outcomes of cancer treatments for a similar cancer type were successfully explained by in-tertumor heterogeneity in some cases. Therefore a personalized cancer treatment based on the knowledge of specic mutations could be developed (e.g. lung can-cer). But the proling of the tumor can additionally be complicated by ITH. In this case dierent tumor branches within the same tumor react dierently to the treatment[8]. Gerlinger et al. [7] found out that more than 60% of all mutation are not present in all samples which means that a single biopsy was not representative of the mutational landscape of the entire tumor bulk. Therefore ITH makes it necessary to analyze multiple samples of a tumor to get a good overview of the

(16)

somatic mutations in the dierent branches.

Note that the larger the biopsy samples are, the higher the probability that dierent sub-clonal tumor populations are mixed and analyzed in the same se-quencing experiment. This additionally adds complexity to the analysis of NGS tumor data. An alternative approach to identify sub-clonal tumor populations is single-nucleus sequencing. The problem with this methods is a very low sequenc-ing depth which negatively inuences the accuracy of the subsequent copy number analysis [18].

(17)

2

|

Methods

The approach presented in this section is based on the analysis of NGS tumor data to determine purity, ploidy and the CNVs. The copy number information of spatially or temporally separated samples of a single tumor allows the further analysis of intratumor heterogeneity.

Currently there are two classes of methods to determine the purity and/or the ploidy from NGS data. Methods from the rst group use the b-allele frequency (BAF). At a heterozygous site the BAF is the number of reads matching the b-allele[1]. The change of the BAF at specic sites between a normal and a tumor sample can be used to determine the purity and ploidy. Methods from the second group compare the number of reads mapped to the segment between the normal and the tumor sample. Because methods that rely on the read count information are statistically more stable and because it is dicult to nd somatic mutations, methods from the second group tend to work better[13].

Identiability problem Another factor which complicates the determination of purity and ploidy is the so called identiability problem. Thereby a certain number of read counts in a tumor sample can be explained by multiple purity and ploidy constellations. As can be see in an articial sample in Figure 2.1 the read counts in the tumor sample for segment 2 (1400) can be explained either by a purity of 30% and a homozygous deletion or a purity of 60% and a heterozygous deletion. The two plots in the bottom left corner of Figure 2.1 show the BAF of heterozygous sites of the normal sample on the y-axis and of the tumor sample on the x-axis. One dot in the middle means that the BAF in the tumor sample has not changed and suggests an even copy number. You could also state that no loss of heterozygosity (LOH) occurred. Two or more separated dots indicate a change in the BAF, which suggests an uneven copy number and therefore cause an LOH. In this sample the BAF information helps to determine the correct solution and therefore to solve the identiability problem[13]. Based on this knowledge the idea is to develop a model which combines the read count and the BAF information determine most likely purity and ploidy combination and therefore solve the identiability problem.

(18)

Figure 2.1: Identiability problem. The heatmaps in the bottom left corner show the BAF of the normal (y-axis) vs. the tumor (x-axis) sample[13].

Overview The rst two sections of this chapter will summarize the PyLOH [13] and the absCN-seq [2] framework which both aim to determine the tumor purity and ploidy based on WGS data. AbsCN-seq is a statistical framework whereas PyLOH is a more complex probabilistic framework. Additionally cn.MOPS [11] was used in order to get more detailed results (e.g. shorter CNVs). Therefore it was necessary to develop a new preprocessing so cn.MOPS can work with the tumor data. Also a comparison between cn.MOPS and PyLOH is drawn and improvements for PyLOH are presented. Finally the methods for visualization are explained.

2.1 PyLOH

PyLOh is a probabilistic framework presented in the paper Deconvolving tumor purity and ploidy by integrating copy number alterations and loss of heterozygos-ity. by Li and Xie.

2.1.1 The model

To solve the identiability problem the the model considers two circumstances: (i) the read counts in each segment and (ii) the loss of heterozygosity (LOH) by investigating the BAF. Because the number of variables in PyLOH is high an overview is given in Box 2.1.

(19)

Box 2.1: Variables in PyLOH

Variable Description

φ purity

Cj copy number of segment j (ploidy)

g genotype

Dj total number of reads mapped to segment j

λj expected read count in segment j

N normal sample

T tumor sample

S set of segments without loss of heterozygosity

¯

Cj average absolute copy number of segment j

θ parameter for genomic length and mappability

DTs total number of reads mapped to segment s

dT

ij mapped read

bT

ij mapped reads matching B allele

¯

µg Weighted average of BAFs

ng copy number for g

µg BAF for g

ρjc probability of having Cj = c

Θ = (φ, ρ) model parameters

Qgc probability for a site having a certain genotype g

conditional on the underlying copy number c

2.1.2 Modeling the read counts

At rst the authors model the mapped reads in each segment. As shown in Equa-tion 2.1 a Poisson distribuEqua-tion is used to model the propability of observing a

certain number of reads mapped in a segment (DT

j ) conditional on the copy

num-ber of the segment (Cj) and the purity (φ). The reads from a tumor samples are

denoted by superscript T e.g. DT

s. The parameter λj of the Poisson distribution

describes the expected read counts in segment j. To compute λj at rst a set S

of segments without LOH is determined. This set is used as a baseline, because segments without LOH must have an even copy number (e.g. 2, 4 ...). Accord-ing to Equation 2.2 the expected read count of a segment is the average ratio of

the average absolute copy number of a segment ( ¯Cj) and all segments in S. θs is

a coecient to account for the genomic length and the mappability of segment, because this factors can vary across the genome[12].

P (DjT|Cj, φ) ∼ Poisson(λj) (2.1) λj = 1 |S| X s∈S ¯ Cjθj ¯ Csθs DsT (2.2)

(20)

2.1.3 Modeling the LOH

The probability for observing a certain BAF bT

ij is dened by a binomial

distri-bution (Equation 2.3) conditional on the genotype (g) and the purity (φ). The

parameters for the distribution are dT

ij (the mapped reads at site i in segment

j) and ¯µg. Equation 2.4 shows that ¯µg is calculated as the average of the BAFs

weighted by the purity. ng is the copy number and µg is the BAF for a certain

genotype in the tumor sample. In the normal sample the expected copy number is 2 and the BAF is 0.5.

P (bTij|Gij = g, φ) ∼ Binomial(dTij, ¯µg) (2.3) ¯ µg = φngµg+ (1 − φ)2 × 0.5 φng + (1 − φ)2 (2.4)

2.1.4 Combining the models

In order to combine the models the authors redene the probability for the BAF

conditional on the copy number (Equation 2.5) by introducing the matrix Qgc. Qgc

is precomputed and contains the probabilities for a site having a certain genotype

g conditional on the copy number c. The combined model (Equation 2.6) gives the

probabilities for the read counts (DT

j) and the BAF (bTj) in segment j conditional

on the copy number and the purity as product of Equation 2.1 and Equation 2.5.

The product over i is necessary because bT

j is a vector of all heterozygous sites i

in segment j. P (bTij|Cj = c, φ) = X g∈G QgcP (bTij|Gij = g, φ) (2.5) P (DTj, bTj|Cj = c, φ) = P (DjT|Cj = c, φ) × Ij Y i=1 X g∈G QgcP (bTij|Gij = g, φ) (2.6)

2.1.5 The likelihood model

To develop the likelihood model, the authors model the probability for having a certain copy number with a categorical distribution (Equation 2.7) by limiting

the possible copy numbers (C = {0, 1, 2, 3, 4}) and dening a vector ρj with the

probabilities for observing a certain copy number. Based on this they determine the model parameters: Θ = (φ, ρ). With the combined model and the model parameters the likelihood model (Equation 2.7) was deduced as the product of the probability for a certain copy number and the combined model (Equation 2.6).

(21)

Dj|ρj ∼ Categorical(C, ρj) (2.7) P (D, b|φ, ρ) = J Y j=1 X c∈C P (Cj = c|ρj)P (DjT, b T j|Cj = c, φ) = J Y j=1 X c∈C  ρjc PPoisson(DTj|λj) Ij Y i=1 Qgc PBinom(bTij|µ, dij)   (2.8)

2.1.6 EM algorithm

To estimate the model parameters the authors decide to use the expectation max-imization framework[5]. The EM algorithm alternatingly updates the model pa-rameters θ = (φ, ρ) and the hidden variables, in this case ξ and ψ. At the beginning the log likelihood Q, as shown in Equation 2.9, is determined based on the likelihood model from Equation 2.8. Q is used to determine the model parameters in the M-step.

Q = J X j=1 X c∈C

ψjc{log ρjc+ [DjT log λj − λj − log DTj!]

+ Ij X i=1 X g∈G

ξijgc[log Qgc+ log

DT ij bT ij  + bTijlog(¯µg+ aTijlog(1 − ¯µg)]} (2.9) E-Step

In general, the bound is tight if the distributions of the hidden variables become the posterior given the current estimate of the parameters. The posterior distributions for the hidden variables given the old parameter estimates are:

ξt+1ijgc= P (Gij = g|Cj = c, D, b, φ(t), ρ(t)) ψjct+1= P (Cj = c|D, b, φ(t), ρ(t)) ξijgc = QgcP (bTij|Gij = g, φ) P g∈GQgcP (bTij|Gij = g, φ) ψjc = P (Cj = c|ρj)P (DTj, bTj|Cj = c, φ) P c∈CP (Cj = c|ρj)P (DjT, bTj|Cj = c, φ)

(22)

M-Step

The model parameters are estimated based on the latest estimates for the hidden variables from the E-step. According to the authors the maximum likelihood estimator (MLE) for ρ is given by:

ρ(t+1)jc,MLE= ψjc(t+1)

In order to incorporate the knowledge that mutations that require a larger edit distance are less likely to occur, the authors introduce a Dirichlet prior for ρ:

ρ(t+1)jc,prior= P γc− 1

c∈C(γc− 1)

To calculate ρ, a weighted average of the MLE and the prior is calculated as shown in Equation 2.10. This rises the questions whether it is a probabilistically clean solution to calculate a weighted average of two estimators for the same

variable with the weight w that does not depend on γc from the prior. This

problem will be discussed later in the chapter.

ρ(t+1)jc = wρ(t+1)jc,MLE+ (1 − w)ρ(t+1)jc,prior (2.10)

Because a solution for dQ

dφ = 0 is intractable it is necessary to perform a

numer-ical search for φ. This is done by trying dierent values for φ and taking the one that maximizes Q.

2.1.7 Implementation

The implementation of PyLOH consists of three stages (i) preprocessing, (ii) model and (iii) postprocessing.

(i) The reads for each segment and the reads for all alleles of all heterozygous sites for the tumor and normal sample are counted. The heterozygous sites are lter with a predened quality criteria so only reliable information is taken into account when the model is applied. Additionally an LOH call is made for each segment. This call can be TRUE, FALSE, UNCERTAIN or NONE depending on the BAFs and the quality of the data. If segments have to few or no heterozygous site, the call will be NONE.

(ii) The information collected in step (i) is used to the apply the model. Therefor the variables in the E-Step and the M-Step of the EM algorithm are calcu-lated iteratively. This is done until the algorithm converges which means that the change of the calculated likelihood is below a predened value or the maximum number of iterations is reached. The copy numbers for the segments and the purity estimate is then chosen by selecting the result with the highest probability.

(23)

(iii) This step can be used to generate heat-maps which illustrate the BAF of each segment.

2.2 absCN-seq

Compared to PyLOH, absCN-seq is a statistical framework presented in the paper AbsCN-seq: a statistical method to estimate tumor purity, ploidy and absolute copy numbers from next-generation sequencing data [2].

2.2.1 Denitions

Conicting variable names between PyLOH and absCN-seq were adapted in Sec-tion 2.2 to simplify the terminology. In absCN-seq the purity is denoted as φ and

the ploidy as τ. Based on the segment copy number (Cj) the ploidy is dened in

Equation 2.11 where wj is the length of a segment in base pairs and ηj is a

param-eter that compensates for capture eciency. Next they dene ρ as the proportion of DNA in the sample from the cancer cells (Equation 2.12).

τ = PN j Cjwjηj PN j wjηj (2.11) ρ = φτ φτ + 2(1 − φ) (2.12)

Similar to PyLOH the authors dene one estimator based on the read counts and another based on the SNV-frequency.

2.2.2 Copy-number-based ratio estimator

Based on the denition of ρ the authors derive an estimator for the tumor to normal

reads ratio per segments (Equation 2.13). The term R−1 is used for normalization.

Here R = Tt

Tg where T∗ are the corresponding total number of reads for the tumor

and normal (germline) sample. The reads for tumor and normal samples in segment

j are denoted by DT j and DNj , respectively. R−1E[tDT,j] E[tDN,j] = R−1E[D T j ] E[DN j ] = φCj + 2(1 − φ) φτ + 2(1 − φ) (2.13)

2.2.3 SNV-frequency-based estimator

As shown in Equation 2.14 the SNV-frequency-based estimator is similar to the

(24)

somatic allele at locus i the equation gives the proportion of the variant allele out of the total aligned reads.

E[fi] ≈

φsi

φCj+ 2(1 − φ)

(2.14)

2.2.4 Least squares objective

Equation 2.15 shows the least squares objective based on the copy-number- and

SNV-frequency-based estimators. rj = R−1 D

T j

DN

j is the normalized ratio between

the read counts from the tumor and the normal sample and fi are the observed

variant allele frequencies. λ is used as a weight between the estimators. If λ = 1 only the read count information is taken into account, e.g. if no SNV information is present. λX j  rj− φCj+ 2(1 − φ) φτ + 2(1 − φ) 2 + (1 − λ)X j m(j) X i=1  fj− φsi φCj+ 2(1 − φ) 2 (2.15)

2.2.5 Parameter estimation

The authors chose to estimate the model parameters with an iterative scheme.

Besides the initialization it consists of two steps: (i) the parameters Cj, si and

τ are estimated (ii) a none linear least square method is used to nd an φ that

minimizes the least square objective function. To try a variety of values for φ and

τ, a grid search is performed. Thereby the best solution is chosen by ranking the

dierent solutions by their mean-squared error of the non linear models tted in step (ii).

2.3 cn.MOPS

Based on the purity and ploidy estimates it is possible to use further methods to determine more detailed CNV results. For this purpose the framework cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation se-quencing data with a low false discovery rate by Klambauer et al. [11] was used. Therefore it was necessary to develop a new normalization method, which takes the purity and ploidy information for the specic tumor samples into account.

2.3.1 Processing pipeline

The CNV processing pipeline of cn.MOPS consists of three main steps:

(i) Preprocessing: Mapping and counting the reads from RAW data and sample normalization.

(25)

(ii) CNV detection: Create the local models per segment.

(iii) Postprocessing: Segmentation, CNV calling and copy number estimation.

2.3.2 Preprocessing

With the standard parameters the reads are counted in constant length segments along the genome. Then the samples are normalized to make the data comparable. One of the main dierences between cn.MOPS and previous methods is, that cn.MOPS creates a model for each of this segments separately across all samples. This makes a GC correction unnecessary because the GC content in a segment is the same across all samples.

2.3.3 The model

Conicting variable names between PyLOH and cn.MOPS were adapted in Sec-tion 2.3 to simplify the terminology. The CNV detecSec-tion mechanism of cn.MOPS is based on a generative probabilistic model. Equation 2.16 shows that the read counts for each segment are modeled as a mixture of Poisson distributions. Here

Dj are the read counts for segment j, αc is the percentage of samples with copy

number c and λ are the mean read counts for copy number 2.

P (Dj) = X c∈C ρcP (Dj; c 2λ) (2.16)

In order to nd the optimal model parameters, the expectation maximization framework by Dempster et al. is applied. To incorporate the prior knowledge that most segments will have a normal copy number of 2, a Dirichlet prior, as shown in Equation 2.17, is used. P (ρ) = 1 B(γ) Y c∈C ργc−1 c (2.17)

On basis of the probabilistic model, cn.MOPS gives an informative/non-informative (I/NI) call. If more copy numbers dier from 2 the I/NI call gets higher.

2.3.4 Segmentation and post-processing

For segmentation, adjacent segments with a similar signed I/NI call are joined (the sign distinguishes between loss and gain). The level of the I/NI call compared with the expected fold change is used for CNV calling and integer copy number estimation.

(26)

2.4 Comparison of cn.MOPS and PyLOH

The following section will examine similar aspects of cn.MOPS and PyLOH as both methods use a Poisson model to explain the read counts and both methods use the EM algorithm to estimate the model parameters.

2.4.1 Poisson mixture model

Both methods try to explain the read counts per segment by a Poisson mixture

model. Equation 2.16 shows the model of cn.MOPS. The read counts Dj are

explained by a mixture of Poisson distributions over all copy numbers. In PyLOH the rst part of Equation 2.8 shows the Poisson distribution for the read counts

Dj. Similar to cn.MOPS λ contains the expected read counts for each segment

and is used as parameter of the distribution. ρ contains the probability for each copy number like α in cn.MOPS.

2.4.2 EM update rule

As already mentioned in Section 2.1.6 it is not obvious whether it is probabilisti-cally correct to take the weighted average of the MLE and the prior to calculate the model parameter ρ (Equation 2.18). Therefore we try to reconstruct the de-duction of the calculation for ρ in the M-step. In PyLOH the weight w is set to

0.9 and does not depend on the prior.

ρ(t+1)jc = wρ(t+1)jc,MLE+ (1 − w)ρ(t+1)jc,prior (2.18) M-Step: Optimization w.r.t to ρ without prior

Minimize Q w.r.t. ρjc such that Pc∈Cρjc = 1

∂ ∂ρjc  − J X j=1 X c∈C  ψjclog ρjcpPoisson(DjT|λj(φ, c)) + Ij X i=1 X g∈G . . .    +ν X c∈C ρjc− 1 ! = −ψjc 1 ρjc + ν = 0

Summing over c gives ν = 1. Therefore ρjc = ψjc.

M-step: Optimization w.r.t to ρ with prior

(27)

∂ ∂ρjc  − J X j=1   X c∈C  ψjclog ρjcpPoisson(DTj|λj(φ, c)) + Ij X i=1 X g∈G . . .    + log pDiri(ρj.|γ)   + ν X c∈C ρjc− 1 ! =

In the above formula γ are the hyperparameters of the Dirichlet prior and ρj.

is the vector of ρjc across copy numbers.

pDiri(ρj.|γ) = 1 B(γ) Y c∈C ργc−1 jc

log pDiri(ρj.|γ) = − log B(γ) +

X

c∈C

(γc− 1) log ρjc

The derivative is, therefore:

−ψjc 1 ρjc − γc− 1 ρjc + ν = 0 ρjc = ψjc+ (γc− 1) ν Summing over c: 1 =X c∈C ρjc = X c∈C ψjc+ (γc− 1) ν ⇒ ν = X c ψjc+ γc− 1 Therefore ρjc = ψjc+ (γc− 1) P c(ψjc+ γc− 1) .

Theorem 1. This update formula is a weighted average between the maximum likelihood estimate and the prior's mode.

Proof. In this section, we leave out the index j to keep the notation uncluttered -note that we consider a xed segment j. In the publication, the authors suggest

to update the parameters ρ = [ρc] as a weighted average between the maximum

likelihood estimate vector Ψ = [ψj]and the prior's mode kΓkΓ1 =

h

γc−1

P

c(γc−1)i, i.e.,

there exist a scalar w such that for all c:

ρjc = ψjc+ (γc− 1) P c(ψjc+ γc− 1) =? wψjc+ (1 − w) γc− 1 P c(γc− 1) ∀c In vector form: ρ = Ψ + Γ kΨ + Γk1 =?w Ψ kΨk1 + (1 − w) Γ kΓk1

(28)

The question is whether there is a w, that satises these conditions. We suggest: w = kΨk1 kΨ + Γk1 Ψ + Γ kΨ + Γk1 = w Ψ kΨk1 + (1 − w) Γ kΓk1 Ψ + Γ kΨ + Γk1 = kΨk1 kΨ + Γk1 Ψ kΨk1 + (1 − kΨk1 kΨ + Γk1 ) Γ kΓk1

Since Ψ is a probability vector, kΨk1 = 1 and Ψ and Γ have only positive

entries, therefore kΨ + Γk1 = kΨk1+ kΓk1. w Ψ kΨk1 + (1 − w) Γ kΓk1 = kΨk1 kΨ + Γk1 Ψ kΨk1 + (1 − kΨk1 kΨ + Γk1 ) Γ kΓk1 = = Ψ kΨ + Γk1 + kΨ + Γk1− kΨk1 kΨ + Γk1  Γ kΓk1 = = Ψ kΨ + Γk1 + kΨk1+ kΓk1− kΨk1 kΨ + Γk1  Γ kΓk1 = = Ψ kΨ + Γk1 +  kΓk 1 kΨ + Γk1  Γ kΓk1 = Ψ + Γ kΨ + Γk1 q.e.d. As we could proof a w exists which gives the same solution as the derivate of the negative log likelihood including the Dirichlet prior, the weighted average of the MLE and the prior is a valid solution to calculate ρ. Note that the weighting factor w can not be chosen arbitrarily as in PyLOH, but is determined by Ψ and

Γ.

2.5 Optimized PyLOH

Using PyLOH we discovered that the likelihood model of PyLOH (Equation 2.8) in combination with choices made for the implementation (Section 2.1.7) oered some room for improvements. Because PyLOH includes the BAF information in the model for every segment, short segments without or not enough heterozygous sites are excluded from the analysis. This occurs when the LOH status in the preprocessing stage is determined as NONE and leads to the condition that no copy number will be calculated for these segments. This is especially problematic when WES data is used, because the target regions are typically only a few 100 bases long. The authors suggest to work around this problem by modifying the segmentation, so that it only consists of longer segments, which contain enough

(29)

heterozygous sites.1 As this inevitably leads to a loss of detail in the results, we

suggest a dierent approach. In order to include all segments in the analysis, independent of the availability of heterozygous sites, we extended the likelihood model of PyLOH.

2.5.1 Improved objective

The new likelihood model is shown in Equation 2.19. The product of all segments normally used in the PyLOH implementation ranges from j = 1, ..., J. In compar-ison to Equation 2.8 we extended the model by a product of k = 1, ..., K for all segments without LOH information. This new term omits the model for the BAF and only uses the model for the read counts (Equation 2.1).

Based on this new model we can calculate the complete likelihood including all segments in the data, independent of the occurrence of heterozygous sites. It shall be noted that this change allows for a much higher utilization of the data, because the read count information from segments without BAF would otherwise be neglected. P (D, b|φ, ρ) = = J Y j=1 X c∈C P (Cj = c|ρj)P (DjT, b T j|Cj = c, φ) + K Y k=1 X c∈C P (Ck= c|ρk) = = J Y j=1   X c∈C ρjcPPoisson(DjT|λj) Ij Y i=1 X g∈G QgcPBinom(bTij|µ, dij) ! + K Y k=1 X c∈C ρkcPPoisson(DTk|λk) ! (2.19)

2.5.2 Implementation changes

The original GitHub repository containing the complete source code of the PyLOH

implementation was forked as PyLOH_Opt 2. Like PyLOH the forked repository

is publicly available.

The implementation was changed so that all segments get a copy number call. This was done by implementing functions to additionally calculate the likelihoods for segments without LOH information. Because of design decisions for the EM algorithm in the original implementation, it was necessary to introduce dummy sites with allele frequencies of 1/1 (allele A/B). This represents the prior knowledge for a segment without specic information (LOH status NONE) for a BAF of

0.5.

1https://github.com/uci-cbcl/PyLOH/issues/1 2https://github.com/patrick-praher/PyLOH_Opt

(30)

Figure 2.2: Normalization of cancer data. The individual samples are shown on the y-axis. The x-axis shows the copy numbers along the genome. The read counts in segments with losses or gains (red or blue) are inuenced by the cancer. Green marked segments, which have no CNVs on any sample, are used to calculate the normalization constants.

2.6 Normalization of cancer data

As already mentioned, it is necessary to perform a normalization of the read count data, to make read counts comparable across samples[11]. Because of the change events that happen in cancer genomes, standard normalization meth-ods of cn.MOPS could not be used. In this new approach only segments with copy number 2 are used as as basis for the normalization. Although it would be more ac-curate if the whole genome could be used for normalization, these segments should be informative enough to represent each sample. Figure 2.2 shows that only seg-ments without CNVs in any sample are used to calculate the normalization factors for each sample.

2.7 Calculating expected fold change

The default vector for the expected fold changes for copy number 0 to copy number

8 in cn.MOPS is Id = (0.025, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4). This means that for copy

number 3, 1.5 times the read counts of copy number 2 are expected and so on. For copy number 0 a value of 0.025 is used to account for measurement noise. For cancer data it is necessary to correct this values according to Formula 2.20 because of the mixture of cancerous and normal cells in the sample. φ is the purity between

0 and 1 and Icn2 is a vector with the same length as Id containing just 1. If the

purity is low, the fold change values have to be closer to 1 because normal cells are expected to have copy number 2.

ˆ

(31)

2.8 Visualization

We developed a visualization technique that is able to display the high number of CNVs for tumor samples in a clear and intuitive way. Additionally we illustrate the mutational dierences of the samples as phylogenetic trees based on our copy number analysis.

2.8.1 Karyograms

Because of the high number of changes in the DNA of a tumor, it is very dicult to interpret and explain the results of the data analysis. In our project it was necessary to communicate the results to medical doctors. Therefore we decided to plot the copy number variations along the chromosome like a karyogram known from comparative genomic hybridisation (CGH) [32]. The 22 chromosomes are plotted below each other along the x-axis. The copy numbers are shown in dierent colors. Shades of red indicate deletions and shades of blue show amplications. It is important to note that this karyograms do not necessarily show the actual condition of each chromosome. They are a visualization of our analysis which is based on the assembled sequence reads that are mapped along a reference genome. Due to the bad condition of the cancer genome it is not always certain if the reads are mapped to the original position. Therefore it is possible that the reads from parts with deletions are mapped to other positions or that parts from highly amplied segments belong to new chromosomes which are not know in the reference genome.

2.8.2 Construction of phylogenetic trees

In order to illustrate the dierences between the changes of the individual samples we construct trees, in the fashion of phylogenetic trees. In dierence to common genetic trees, which use dierences in the DNA sequences as distance measure, we use the copy numbers of the tumor genome to construct the distance matrix as shown in Tumour evolution inferred by single-cell sequencing [18] by Navin et al. Therefore we generate a segmentation which contains all break points from the segmentations from all samples. The distance between two samples is the sum of the dierences between the copy numbers of the segments as shown in Listing A.1. Based on the distance matrix, the neighbor joining algorithm [25] is used to construct the tree.

In addition to the phylogenetic tree, the following knowledge can be used for interpretation: Mutations of the same segment present in multiple samples indi-cate the same ancestors in terms of tumor branches. It is very unlikely, though not impossible, that mutations are undone as the tumor develops. Only double deletions (CN0) can not be undone. Therefore mutations unique to one sample can be used to identify dierent branches and/or later states in the development.

(32)

3

|

Results

The rst part of this chapter is a comparison of PyLOH and absCN-seq on a WGS data set. Afterwards two WES data sets are analyzed. The rst one is used as an evaluation data set for PyLOH and to evaluate improvements suggest in chap-ter 2.5. The last data set qualies for the analysis of the evolutionary development of cancer because spatially separated samples per patient are available.

3.1 Early onset prostate cancer

In order to compare PyLOH and absCN-seq, an early onset prostate cancer (EOPC)1

data set from the International Cancer Genome Consortium (ICGC) 2 was used.

The data set contains 11 WGS paired normal and tumor samples. Because of the large amount of data only three samples were taken into account (EOPC01, EOPC06 and EOPC010) for the comparison.

3.1.1 Segmentation

Because both methods depend on an existing segmentation for the data, this was the rst step in the comparison. For segmentation the DNAcopy package [27] was used.

The read counts were calculated using cn.MOPS [11] with a window length of 50000 bases. This is fairly long, but for the purpose of determining the tumor purity and ploidy it is a good compromise between performance and accuracy. The DNAcopy settings are shown in Box 3.1.

The resulting segmentations were used for all further tests with PyLOH and absCN-seq. Figure 3.1 shows the resulting segmentation of chromosome 16 for sample EOPC010.

1https://dcc.icgc.org/projects/EOPC-DE 2https://icgc.org/

(33)

−0.4 −0.2 0.0 0.2 0.4 EOPC010 − Chr16 log(relativ e CN) Genomic Position 50001 2650001 5250001 7850001 10450001 13050001 15650001 18250001 20850001 23450001 26050001 28650001 31250001 33850001 36450001 39050001 41650001 44250001 46850001 49450001 52050001 54650001 57250001 59850001 62450001 65050001 67650001 70250001 72850001 75450001 78050001 80650001 83250001 85850001 88450001

Figure 3.1: Segmenation of chromosome 16 for sample EOPC010. The dots are the log2 ratios of the read counts for each segment. The bars show the segments determined by DNAcopy in alternating colors (red and blue) for better visibility.

Box 3.1: Segmenation parameters

The read counts from cn.MOPS were used to calculate the log2 ratios as input for DNAcopy with the following parameters:

• data.type="logratio" • min.width=5 • alpha=0.001 • undo.splits="sdundo" • undo.SD=3

3.1.2 Comparison

Because PyLOH oers a fully integrated tool chain for processing binary align-ment mapping (BAM) les no further preprocessing was necessary. In the rst stage (called preprocessing) PyLOH determines the read counts and BAF for each segment and lters the data according to certain quality criteria. In the second stage the model is applied. As described in Section 2 a numerical search for the purity is performed. The parameters are shown in Box 3.2.

AbsCN-seq has no built-in functionality for working with BAM les. Therefore the read counts determined for the segmentation were used. Because absCN-seq does not contain the functionality for calling SNVs (single nucleotide variants) like PyLOH, and the impact of third party tools on the results should be as small as possible, absCN-seq has only been tested with the read count information. The

(34)

parameters are shown in Box 3.3.

Where PyLOH only presents the best solution determined, absCN-seq shows a list of possible solutions ordered by the mean squared error (MSE). The reason for this implementation seems to be a problem with ranking the possible solutions by their quality. In the tests performed, the rst, and therefore presumably best, solutions where pretty bad for all samples. The ploidy estimate was always very high and the purity very low. Therefore it seemed appropriate to select an alter-native best solution based on the overall ploidy and the MSE (ordered by the MSE the rst solution with a ploidy between 1.5 and 2.5 was selected).

The reference for the comparison of the copy number estimates of the segments was determined by an algorithm called DALLY [23] and published on ICGC. Al-though the DALLY estimates are also based on the same WGS data, it is justied to use this results for the comparison, because DALLY uses the paired-end and split-read information and not the read counts. Figure 3.2 shows a precision/recall plot of the results for PyLOH, the default solution for absCN-seq and the manu-ally chosen solution as described above. Because the segmentation is not always correct, and therefore some CNVs are split in multiple fractions, adjacent segments with the same copy number are joined for the comparison.

PyLOH performed really well on this data set with a recall of almost 1 on every sample and a precision between 0.9 and 1. For sample EOPC01 and EOPC06 the manually chosen results from absCN-seq were good as well, but the rest of the absCN-seq results were poor. Figure 3.3 shows a sample for which PyLOH estimated all the copy numbers correct and absCN-seq only estimated some CNs correct. Also for chromosome 18 of sample EOP010 absCN-seq did not manage to estimate the ploidy correctly (Figure 3.4).

Box 3.2: PyLOH parameters

Preprocessing parameters: • --min_depth 20 • --min_base_qual 10 • --min_map_qual 10 Model parameters: • --allelenumber_max 2 • --max_iters 100 • --stop_value 1e-7

Box 3.3: absCN-seq parameters

Model parameters: • alpha.min=0.2 • alpha.max=1.0 • tau.min=1.5 • tau.max=5.0 • min.sol.freq=0 • min.seg.len=0 • qmax=7

(35)

Table 3.1: Precision Recall for sample EOPC01

exact CN EOPC06 EOPC010

Precision Recall Precision Recall Precision Recall

PyLOH 0.968 0.997 0.998 0.978 0.908 0.993 absCN-seq default 0.001 0.009 0 0 0.042 0.803 absCN-seq best 0.962 0.996 0.997 0.995 0.014 0.038 ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 EOPC − PyLOH/absCN−seq Precision Recall ● PyLOH absCN−seq default absCN−seq best EOPC01 EOPC06 EOPC010

Figure 3.2: Precision/recall plot of the PyLOH, absCN-seq default and absCN-seq best results for all data sets.

(36)

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 EOPC010 − Chr16 log(relativ e CN) Genomic Position 50001 2650001 5250001 7850001 10450001 13050001 15650001 18250001 20850001 23450001 26050001 28650001 31250001 33850001 36450001 39050001 41650001 44250001 46850001 49450001 52050001 54650001 57250001 59850001 62450001 65050001 67650001 70250001 72850001 75450001 78050001 80650001 83250001 85850001 88450001 CN0 CN1 CN2 CN3 CN4 CN0 CN1 CN2 CN3 CN4 CN0 CN1 CN2 CN3 CN4 Segments Reference PyLOH absCN−seq best

Figure 3.3: Plot of chromosome 16 for sample EOPC010. The black dots are the read counts for each window. Only the CNs dierent from 2 are plotted. The green line and the blue line overlap completely, which means that PyLOH estimated all CNs correctly. AbsCN-seq did not estimate the correct CN for many segments.

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 EOPC010 − Chr18 log(relativ e CN) Genomic Position 1 2450001 4900001 7350001 9800001 12250001 14700001 17150001 19600001 22050001 24500001 26950001 29400001 31850001 34300001 36750001 39200001 41650001 44100001 46550001 49000001 51450001 53900001 56350001 58800001 61250001 63700001 66150001 68600001 71050001 73500001 75950001 CN0 CN1 CN2 CN3 CN4 CN0 CN1 CN2 CN3 CN4 CN0 CN1 CN2 CN3 CN4 Segments Reference PyLOH absCN−seq best

Figure 3.4: Plot of chromosome 18 of sample EOPC010. The whole chromosome has a ploidy of 1. PyLOH estimated the CN correct whereas absCN-seq estimated CN 0.

(37)

Table 3.2: Comparison of the the purity estimates.

PyLOH absCN-seq default absCN-seq best Reference

EOPC01 0.547 0.22 0.46 0.50

EOPC06 0.449 0.26 0.4 0.47

EOPC010 0.466 0.2 0.2 0.50

Mean average error 0.034 0.263 0.137

The purity estimates for the EOPC data set, which was used as reference, were published separately [31]. Table 3.2 shows that the estimates from PyLOH were quite accurate. Only in the cases where absCN-seq managed to nd a good purity estimate, the precision and sensitivity for the CNs were good too. This is a strong indicator that the identiability problem, discussed in Chapter 1, was not solved suciently in absCN-seq, at least without the use of SNVs.

3.1.3 Conclusion

In comparison to absCN-seq PyLOH performed much better on the early onset prostate cancer data set. The software implementation of PyLOH is of higher quality and it also delivers signicantly better results than absCN-seq. Therefore we decided to focus on PyLOH for all further analysis tasks.

(38)

3.2 Children's Cancer Research Institute

The Children's Cancer Research Institute (CCRI), which is associated with the St. Anna Children's Hospital in Vienna, is the largest center for research related to childhood cancer in Austria [28]. They provided a WES data set from 10 leukemia patients. For each patient three samples are available:

• Diagnose (dia): tumor sample at the point in time when leukemia was

diag-nosed

• Remission (rem): normal sample after remission of the initial leukemia

• Relapse (rel): tumor sample at the relapse point in time

Additionally to the 20 tumor samples, reference results, determined with SNP micro arrays, are available.

Goals We chose this data set to estimate the performance of PyLOH by compar-ing our results to the SNP micro array data. Furthermore we use this data set to evaluate our improved version of PyLOH. Additionally we show our visualization technique for the copy number analysis.

3.2.1 Approach

The following steps were taken in order to analyze the CCRI data set:

(i) The segmentation was determined by DNAcopy as described in Section 3.1.1 and later manually edited to get one single segmenation for all samples. (ii) PyLOH and Improved PyLOH: Based on the same segmentation we applied

PyLOH and our improved version of PyLOH on the data set with the pa-rameters shown in Box 3.4.

(iii) A critical factor in the analysis of data sets is visualization. If done right, it can be very helpful to make results comprehensible, especially for people in other scientic domains. Therefore we decided to plot the copy numbers as karyograms. The two red colors show losses and the dierent shades of blue show gains along the chromosomes.

(iv) In order to calculate a numerical score for the quality of the copy number estimates we calculated the overlap between the reference results and our analysis. Because this is an exome data set, the overlap was calculated only in the exome target regions. Our method additionally enables us to compare the exact copy number or only loss/gain information.

(39)

Box 3.4: PyLOH parameters Preprocessing parameters: • --min_depth 20 • --min_base_qual 10 • --min_map_qual 10 • --WES Model parameters: • --allelenumber_max 2} • --max_iters 100 • --stop_value 1e-7

3.2.2 Purity estimates

Table 3.3 contains the purity estimates for all samples with PyLOH and improved PyLOH in comparison to the reference results. The reference purity values were determined by a pathologist at the Children's Cancer Research Institute. Samples P02 and P04 showed the lowest estimates with about 50% and 60%. Also these two samples seemed to be dicult to analyze for CNVs. The purity estimates for all other samples were higher than 70% which is favorable for the further analysis. In general the purity estimates obtained with PyLOH were lower than the estimates of the pathologist which in our experience is usually the case.

3.2.3 Copy number estimates

For the copy number estimates the exact copy number and only the loss/gain information were compared separately. The comparison between both attempts is of interest because established methods like comparative genomic hybridisation (CGH)[32] only give a loss/gain call. The exact copy number is obviously still the preferred information.

Figure 3.7 and Figure 3.8 show precision and recall of our original and improved PyLOH results in comparison with the reference data on basis of the loss/gain information. The plots show that samples P01 diagnose, P01 relapse, P02 diagnose, P02 relapse and P04 diagnose require additional attention. These samples will be analyzed in detail in Section 3.2.4. The other samples performed well in both cases. The results shown in Figure 3.5 and Figure 3.6 are based on the comparison of the exact copy number. Interestingly only the results for the sample P01, P02 and P04, which also did not perform to well when comparing the loss/gain information for at least original PyLOH, are clearly worse than the other samples. The detailed results are shown in Table 3.4. For the original implementation the precision for 70% of the samples is over 90% and the recall for 75% of the samples is over 90% for the comparison of the exact copy number. We managed to enhance this results with our improved version of PyLOH, where the recall for 80% of the samples is over 90% without a noteworthy loss of precision.

(40)

Table 3.3: Purity estimates for CCRI samples. The P02 diagnose and relapse sample (red) showed a purity below 60% which is the lowest in the data set and also proved to be the most dicult to analyze.

Sample PyLOHPurity Purity Imp.PyLOH Purity Reference

P01 diagnose 0.787 0.714 0.97 P01 relapse 0.748 0.720 0.81 P02 diagnose 0.572 0.572 0.81 P02 relapse 0.525 0.525 0.72 P03 diagnose 0.726 0.727 0.85 P03 relapse 0.814 0.815 0.94 P04 diagnose 0.626 0.625 0.96 P04 relapse 0.676 0.676 0.76 P05 diagnose 0.899 0.897 0.97 P05 relapse 0.844 0.845 0.86 P06 diagnose 0.824 0.823 0.98 P06 relapse 0.751 0.752 0.94 P07 diagnose 0.835 0.836 0.92 P07 relapse 0.765 0.765 0.92 P08 diagnose 0.817 0.816 0.95 P08 relapse 0.922 0.920 0.89 P09 diagnose 0.724 0.727 0.94 P08 relapse 0.781 0.785 0.90 P10 diagnose 0.877 0.870 0.97 P10 relapse 0.927 0.927 0.98

(41)

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

CCRI Precision/Recall − exact CN − original PyLOH

Precision Recall P01_dia P01_rel P02_dia P02_rel P03_dia P03_rel P04_dia P04_rel P05_dia P05_rel P06_dia P06_rel P07_dia P07_rel P08_dia P08_rel P09_dia P09_rel P10_dia P10_rel

Figure 3.5: Precision/recall by comparing exact copy numbers with reference data for CCRI samples from original PyLOH

(42)

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

CCRI Precision/Recall − exact CN − improved PyLOH

Precision

Recall

Figure 3.6: Precision/recall by comparing exact copy numbers with reference data for CCRI samples from improved PyLOH

(43)

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

CCRI Precision/Recall − Loss/Gain − original PyLOH

Precision Recall P01_dia P01_rel P02_dia P02_rel P03_dia P03_rel P04_dia P04_rel P05_dia P05_rel P06_dia P06_rel P07_dia P07_rel P08_dia P08_rel P09_dia P09_rel P10_dia P10_rel

Figure 3.7: Precision/recall by comparing losses/gains with reference data for CCRI samples from original PyLOh

(44)

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

CCRI Precision/Recall − Loss/Gain − improved PyLOH

Precision Recall P01_dia P01_rel P02_dia P02_rel P03_dia P03_rel P04_dia P04_rel P05_dia P05_rel P06_dia P06_rel P07_dia P07_rel P08_dia P08_rel P09_dia P09_rel P10_dia P10_rel

Figure 3.8: Precision/recall by comparing losses/gains with reference data for CCRI samples from improved PyLOH

(45)

3.2.4 Samples in detail

Sample - P01 The CNVs in the reference data are much longer (often whole chromosomes) than the CNVs found with the original PyLOH. The CNVs on chro-mosome 4, 17, 21 and 22 from diagnose (Figure 3.9(a)) and relapse (Figure 3.10(a)) sample are similar in the reference data (Figure 3.9(c) and Figure 3.10(c)) and are also similar in our analysis. This is an indicator that a method produces consistent results. Like in the other sample, PyLOH did not show short deletions which is a result of the segmentation.

The most interesting discovery from sample P01 is, that our improved version of PyLOH delivered much better results regarding recall than the original implemen-tation. For sample P01 diagnose (Figure 3.9(b)) the recall could be improved from 0.593 to 0.954 for the loss/gain comparison and 0.588 to 0.919 for the exact copy number comparison. The improvements on sample P01 relapse(Figure 3.10(b)) are similarly well.

Sample - P02 Sample P02 was the most problematic for both the diagnose (Figure3.11) and relapse (Figure3.12) state. The comparison of the exact copy numbers gives a quite poor result. When only the losses/gains are compared the results seem a bit better. The method does not seem to work properly on this sample, because too many segments are falsely classied with a deletion. With about 55%(Table 3.3), which is the lowest value for the whole CCRI data set, also the purity indicates a dierence to the other samples. Also previous analysis of this sample with cn.MOPS proved dicult. In our impression this sample suers from bad quality which limits the potential for a proper analysis.

Sample - P04 The P04 diagnose sample (Figure 3.13) shows a similar problem as sample P02. Many segments are classied with a loss or gain higher than copy number three although there is none. Purity estimates also show a low value in comparison to the other samples (Table 3.3). The results for the P04 relapse sample (Figure 3.14) are good. The gains on chromosome 1, 6, 9, 10, 14, 18 and 21 are all found with mostly the exact copy number.

Sample - P03 and P05 - P10 All these sample show really good results with both the original and the improved version of PyLOH. As an example sample P03 is shown. The karyograms for diagnose (Figure 3.15) and relapse (Figure 3.16) both show a very good estimation of the actual copy numbers. Precision and recall of about 94% were reached when comparing the exact copy numbers for the P03 samples.

(46)

Table 3.4: Precision/recall for CCRI samples. The comparisons were performed with loss/gain information and the exact copy numbers for original and improved PyLOH. The green rows show sample P01 where the recall is clearly better for our improved version of PyLOH.

Loss/Gain Exact CN

orig. PyLOH imp. PyLOH orig. PyLOH imp. PyLOH

Prec. Recall Prec. Recall Prec. Recall Prec. Recall

P01_dia 0.943 0.593 0.945 0.954 0.934 0.588 0.91 0.919 P01_rel 0.947 0.579 0.941 0.981 0.845 0.516 0.772 0.805 P02_dia 0.447 0.866 0.445 0.868 0.255 0.494 0.254 0.496 P02_rel 0.462 0.88 0.46 0.882 0.265 0.504 0.264 0.506 P03_dia 0.969 0.961 0.967 0.973 0.945 0.937 0.939 0.945 P03_rel 0.968 0.971 0.967 0.982 0.937 0.941 0.937 0.951 P04_dia 0.299 0.961 0.299 0.978 0.189 0.607 0.19 0.62 P04_rel 0.994 0.969 0.991 0.994 0.981 0.956 0.978 0.981 P05_dia 0.981 0.979 0.975 0.998 0.981 0.979 0.971 0.995 P05_rel 0.969 0.976 0.966 0.997 0.967 0.974 0.963 0.995 P06_dia 0.965 0.95 0.96 0.965 0.931 0.916 0.927 0.931 P06_rel 0.969 0.975 0.963 0.989 0.968 0.975 0.962 0.989 P07_dia 0.862 0.94 0.841 0.959 0.862 0.94 0.841 0.959 P07_rel 0.772 0.966 0.771 0.984 0.763 0.954 0.762 0.971 P08_dia 0.988 0.977 0.986 0.993 0.968 0.957 0.964 0.971 P08_rel 0.997 0.975 0.995 0.998 0.979 0.958 0.973 0.975 P09_dia 0.994 0.973 0.993 0.991 0.988 0.967 0.983 0.981 P09_rel 0.995 0.972 0.994 0.99 0.995 0.972 0.99 0.987 P10_dia 0.994 0.974 0.99 0.996 0.994 0.974 0.99 0.996 P10_rel 0.994 0.972 0.994 0.994 0.994 0.972 0.994 0.994

3.2.5 Improved PyLOH

On the basis of this data set we could show the improvements of our version of PyLOH on real live data. The changes did not show any negative inuence where PyLOH already performed good, because there only a small fraction of segments is skipped. In this cases primarily the original object is applied. Our optimizations aect the model only for segments which would be skipped due to a low number of heterozygous site. The P01 samples suered from this problems where many segments did not receive an LOH call in the preprocessing and therefore were skipped by the model. In the P01_dia sample 163 segments and in the P01_rel 198 segments got an 'NONE' LOH call. In none of the other samples more than 85 segments got excluded although the same segmentation was used throughout the whole experiment. For this segments the extended objective ensured that copy number calls were made on basis of the read counts. As already mentioned this greatly improved the sensitivity by additionally using available information without a huge loss in precision.

Referenzen

ÄHNLICHE DOKUMENTE

The sunburst chart including features of venous invasion for the different final classifications shows that these features were most often scored as present in the nodules that were

[6] Thus the Asia-Pacific meeting set out three goals: (1) Universal civil registration of births, deaths and other vital events; (2) All individuals are provided with

While this doctrine is not an exception to United Nation’s Article 2(4), 17 it is also equally clear that contemporary international law, and the UN Charter prohibit states

basic, APL, fortran, and SPL, experience with digital electr. Need: mass memory such as floppy disk and a hard-copy device. Designing automated Measurement system

WITH THE POWER CONTROL MODULE ORIENTED AS SHOWN IN FIGURE 2, CAREFULLY ROUTE THE MODULE CONNECTORS THROUGH THE OPENING BETWEEN THE FAN HOUSING AND THE POWER SUPPLY BOARD.. THE

The red-green government of Chancellor Gerhard Schröder enforced promotion of electricity produced from renewable energy sources and the gradual restriction of

In this work, a versatile framework is pre- sented which provides data preprocessing and visualiza- tion approaches for the analysis of high-throughput screen- ing

If Iran blames the United States for supporting the Syrian rebels, the US’ Arab allies argue that Washington’s failure to supply moderate Syrian rebels with