• Keine Ergebnisse gefunden

3.3 Colon Cancer Linz

3.3.4 cn.MOPS results

LM samples

Sample LM4 (Figure 3.49) was chosen because it showed the fewest muations in the PyLOH analysis. Also in the cn.MOPS results it shows less mutations than LM1 (Figure 3.48). This can be seen by the additional amplications on chromosome 1, 4, 5 and 6 in LM1. Additionally LM1 has deletions on chromosome 15 and 16 that do not occur in LM4. The general picture of many short segments with mutations is alike for both methods.

cn.MOPS overview

Overall the results from PyLOH and cn.MOPS are very similar for the samples of all donors. Except for one sample (HH3) cn.MOPS conrms our expectation that the regions at 50Mb on chromosome 18, where the DCC gene is located, is deleted.

For the HH3 sample cn.MOPS calls a deletion from 21Mb to 47Mb which is not far apart. In contrast to PyLOH, cn.MOPS delivers more detailed results including CNV calls for shorter segments, which more accurately meet the exceptions for the analysis of exome data.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

0 Mb 50 Mb 100 Mb 150 Mb 200 Mb 250 Mb

copynumber 0 1 2 3 4 5 6 7 8 kar_BF1_cnmops

Figure 3.42: Karyogram BF1 cn.MOPS

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

0 Mb 50 Mb 100 Mb 150 Mb 200 Mb 250 Mb

copynumber 0 1 2 3 4 5 6 7 8 kar_BF3_cnmops

Figure 3.43: Karyogram BF3 cn.MOPS

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

0 Mb 50 Mb 100 Mb 150 Mb 200 Mb 250 Mb

copynumber 0 1 2 3 4 5 6 7 8 kar_HH1_cnmops

Figure 3.44: Karyogram HH1 cn.MOPS

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

0 Mb 50 Mb 100 Mb 150 Mb 200 Mb 250 Mb

copynumber 0 1 2 3 4 5 6 kar_HH2_cnmops

Figure 3.45: Karyogram HH2 cn.MOPS

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

0 Mb 50 Mb 100 Mb 150 Mb 200 Mb 250 Mb

copynumber 0 1 2 3 4 5 6 7 kar_KI3_cnmops

Figure 3.46: Karyogram KI3 cn.MOPS

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

0 Mb 50 Mb 100 Mb 150 Mb 200 Mb 250 Mb

copynumber 0 1 2 3 4 5 6 8 kar_KI5_cnmops

Figure 3.47: Karyogram KI5 cn.MOPS

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

0 Mb 50 Mb 100 Mb 150 Mb 200 Mb 250 Mb

copynumber 0 1 2 3 4 5 6 7 kar_LM1_cnmops

Figure 3.48: Karyogram LM1 cn.MOPS

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

0 Mb 50 Mb 100 Mb 150 Mb 200 Mb 250 Mb

copynumber 0 1 2 3 4 5 6 7 8 kar_LM4_cnmops

Figure 3.49: Karyogram LM4 cn.MOPS

4 | Discussion

In order to analyze NGS tumor samples it was necessary to determine which of the available methods provide the best performance for copy number detection in can-cer. Therefore we evaluated absCN-seq and PyLOH on WGS data. Because both methods depend on an existing segmentation, we tried to automate the segmen-tation in order to streamline the whole process. During the evaluation of PyLOH we discovered the potential for improvements. This improved version of PyLOH was evaluated on a WES data set and used to analyze a data set with spatially separated samples of a tumor. This data set was used to study the evolutionary development of colon cancer based on the samples of four patients. Dierent vi-sualization methods were applied to illustrate the results. The ability to analyze a tumor genome based on NGS data bears high potential for improvements in the prognosis and the treatment of tumor patients.

Method evaluation The comparison of absCN-seq and PyLOH was done using a subset of the ICGC EOPC data set. With this the rst challenge was the segmentation of the genome. The workow for getting an accurate segmentation for a tumor genome normally includes the time-consuming task of tuning the parameters of the segmentation algorithm or manually editing the segment borders until a good result is achieved. This is a problem because it gets much more dicult to streamline the process of automatically processing tumor genomes.

The segmentations for the EOPC samples were not edited manually. Therefore in some cases sections with the same copy number were split by the segmentation algorithm. Although this is not a desired behavior the consequences do not seem to be dramatic, because at least in case of PyLOH a good result could be achieved by simply joining adjacent segments with the same estimated copy number afterwards.

Nevertheless segmentation is a very demanding task which requires further work in the future.

The software implementations of PyLOH and absCN-seq are rather contrast-ing. PyLOH has a fully implemented toolchain for processing BAM les, so no additional algorithms are required. The code quality seems to be quite good and the overall impression of using the program feels solid. Furthermore PyLOH deliv-ers intermediate results after preprocessing and allows to generate heatmap plots for the BAFs of each segment. For absCN-seq the normalized read count ratios have to be calculated manually. The same applies to SNV calling. The verbose mode is not very helpful either. Without modications it was not possible to

re-70

trieve the copy numbers for the alternative purity and ploidy estimates, which in all cases contained the better results.

The rst impressions based on the software quality are conrmed in the quality of the results. PyLOH performs good on all data sets without special optimization of the segmentation. In some cases PyLOH did not give results for short segments because they contained only a small number of heterozygous sites. Because the segmentation for the EOPC data set did not contain many short segments, the results were not inuenced too badly. Problems related to the occurrence of het-erozygous sites will be discussed later. As described in Section 3.1 the results of absCN-seq were mixed. It is very interesting that in some cases absCN-seq had a good alternative solution at hand but failed to rank the results correctly. It is very dicult to determine why this problem arises but maybe a prior based on the CN2 for selecting the best result would have helped. From the perspective of result quality the additional eort of the PyLOH authors in order to solve the identiability problem seems to be justied.

PyLOH improvements As already mentioned one of the Problems of PyLOH is the inability to deal with short segments that do not contain sucient heterozygous sites for the analysis. In this case the segment is completely ignored, and will not be taken into account by the probabilistic model. The authors seem to be aware of this problem because they subsequently added an option called WES which lowers the threshold for the number of necessary heterozygous sites per segment to 20 to at least allow the analysis of such data sets. Although this improves the functionality it still excludes segments which do not meet the criteria completely from the analysis. This behavior seems odd because PyLOH ignores the read count information for this segments even though it is possible to make a call based only on the reads.

At this point we decided to improve the main object of PyLOH, by distinguish-ing between segments (i) with or (ii) without heterozygous sites. In this rst case the segments are handled as before, but in the second case the call is only based on the read counts. This change enables the method to use all available informa-tion for determining the purity and make a copy number call for every segment.

Although in some cases this could negatively aect the precision, our analysis of the CCRI data set shows large improvements of the recall for samples which were dicult to analyze with standard PyLOH. At the same time our changes had no inuence on the results of the samples which already work well with standard PyLOH. This enhancement is very important for the whole pipeline because it reduces the dependency on the segmentation since the method is now able to also deal with shorter segments in a satisfying manner.

The CCRI data set was used to evaluate PyLOH against reference results from SNP arrays. We used one segmentation for the whole data set, which was manually edited to t all samples. Therefore we could reduce the impact of the segmentation in our evaluation. In general the purity estimates were lower than expected by the pathologist. This was not surprising because the samples were taken from the blood of the leukemia patients which is dicult to analyze. Most samples showed

good results with the standard PyLOH version. As already mentioned, the recall for two samples could be enhanced by our improved version of PyLOH whereas the precision only decreased slightly for this samples. A negative eect on the other samples could not be observed. Therefore our version of PyLOH became our method of choice for further analysis tasks.

Evolutionary development of colon cancer To study the evolutionary de-velopment of a tumor we were give a data set containing ve colon cancer patients.

One samples was excluded right at the beginning of the analysis because the pa-tient had a disease not suited for this task. The purity indicated by PyLOH was a lot lower than desired by the pathologists who took the samples from the tumor.

The selection of the areas used for the DNA analysis is primarily based on visual aspects which does not always deliver optimal results. The segmentation with DNAcopy worked well for three samples. The segmentation for the LM samples had many small segments in any conguration. As expected standard PyLOH had problems with this samples. The diculties became apparent in them 50Mb re-gions of chromosome 18 where we expected to see a loss on all samples. Due to the short segments in this region, PyLOH refused to call this segments. Our improved version found the loss in this part for all samples. To conrm our expectation we used cn.MOPS on basis of the PyLOH purity estimates, which called a deletion in this region for all except one sample.

Comparison to cn.MOPS In general cn.MOPS calls only shorter segments.

On the one hand this results from a dierent approach of the segmentation (cn.MOPS performs the segmentation after the model has been applied), on the other hand this could result from the exome data. The construction of long and consecutive segments could sometimes be misleading, because only a small portion of the DNA was actually sequenced for this analysis. Overall the results of the comparison be-tween PyLOH and cn.MOPS show that the analysis delivered solid results. The nal results of cn.MOPS are presumably more realistic than from PyLOH, because of the more subtle segmentation that would cause problems for PyLOH.

As mentioned earlier, cn.MOPS and PyLOH are very similar in the way they model the read counts and how they estimate the model parameters by using the EM framework. To solve the identiability problem, the authors of PyLOH constructed a more comprehensive model. This is necessary for a satisfying analysis of NGS cancer data. But the additional complexity of the model comes at the price of computational eort. PyLOH takes multiple hours on a multi core CPU for preprocessing alone because the counting of the heterozygous site is an expensive task. The model additionally takes about an hour on a single core depending on the number of segments. As expected the improved version takes even longer, because it has to process additional segments. In contrast cn.MOPS takes about a minute for the same sample with the purity and ploidy estimates from PyLOH as preprocessing. Nevertheless cn.MOPS could not deliver satisfying results for tumor data on its own.

Synergy eects Based on the advantages and disadvantages of PyLOH and cn.MOPS a combination of both methods showed good results. The weakest points of PyLOH are the dependence on the segmentation and the high computational eort. Due to the rst fact, the details are not as ne grained as seen in the cn.MOPS results. To further optimize this combination it should be evaluated whether it is possible to speed up the preprocessing of PyLOH. Additionally a rough segmentation with less than 300 segments seems sensible to avoid the prob-lem of very short segments without heterozygous sites and to shorten the time for the PyLOH model. In order to retrieve more ne-grained copy number estimates we recommend to apply cn.MOPS like shown in the analysis of the CCL data set.

Visualization The karyograms used to visualize the copy number estimates were very helpful in order to get an overview of the dierent change events occurring for each patient. Only by comparing the dierent CNVs in the karyograms, it was possible to construct a rough arrangement of the relations between the samples for each donor. Moreover they proved to be very helpful in explaining our copy number analysis to the medical doctors at the hospital. Despite all this advantages it is important to keep the dierence of typical karyograms from CGH and our interpretation in mind. It is possible that parts of the chromosomes are mapped to the wrong position in the reference genome if parts have many mutations or even new chromosomes developed. This can hardly be analyzed with the given NGS data. A comparison to karyograms from a CGH analysis would be interesting.

As additional technique for visualization we constructed phylogenetic trees.

The outcome from using the euclidean distance in combination with neighbor joining is similar to the arrangement we would have done by inspecting the karyo-grams. There is potential for further discussion regarding the distance measure.

Longer mutation have a larger impact on the distance in a base by base compar-ison of the copy numbers. This could create a misleading picture of the situation as deletions or amplications of single exons or even bases could have a similar evolutionary advantage for a tumor branch as mutations of a whole chromosome arm. Therefore it could be rewarding to evaluate an alternative distance measure based on mutational events where an event is not determined by the number of bases changed but the uniqueness compared to other or earlier branches of the tumor.

A problem we discovered in the construction of the phylogenetic trees is that the samples are not necessarily directly related. For some donors all samples shared some mutations but every samples had its own unique mutations. There-fore a common ancestor as root in the tree or a sample being the descendant of another sample could not always be identied. The reason for this problem are the spots the samples are taken from. The pathologist chooses samples from dierent regions by visual aspects which do not allow to draw inferences to the evolutionary development. In this case only a larger number of samples would help to create a more complete picture of the dierent development stages and branches of the tumor.

A | Appendix

A.1 Methods