Applications - Finite Alphabet Blind Separation

FABS appears in many different areas, for instance in digital communications and multiuser detection (Proakis, 2007; Talwar et al., 1996; Verdu, 1998; Zhang and Kassam, 2001; Sampath et al., 2001). In wireless digital communication, several digital signals (e.g., binary signal with A={0,1}) are modulated (e.g., with pulse amplitude modulation (PAM)), transmitted through several wireless channels (each having different channel response), and received by (several) antennas. In signal processing this is known as MIMO (multiple input multiple output) and (ig-noring time shifts, i.e., considering instantaneous mixtures) can be described by FABS when the channel response is unknown, see (Talwar et al., 1996; Love et al., 2008). Here, the m sources correspond tom digital signalsf¹, . . . ,f^m and the M mixing vectors ω_·1, . . . ,ω·M

correspond to the response of Mdifferent channels. TheMmixture signalsg·1, . . . ,g·M corre-spond to the received signals at Mdifferent antennas (usually corrupted by noise).

2 Introduction

f

ω

₁

ω

₂

ω

₃

g

1 1 1 2

2 3

Figure 1.1: Illustration of a FABS problem in cancer genetics.

The major motivation, however, for this thesis comes from a cooperation with the Wellcome Trust Centre for Human Genetics at the University of Oxford in the field of cancer genetics, namely, from assigning copy number aberrations (CNA’s) in cell samples taken from tumors to its clones (Yau et al., 2011; Carter et al., 2012; Liu et al., 2013; Ha et al., 2014). In Chapter 7, we decompose a cancer tumor into its clones with the proposed method.

CNA’s refer to stretches of DNA in the genome of cancer cells which are under copy number (CN) variation, that is, some parts of the genome are either deleted or multiplied (relative to the inherited germline state present in normal tissue). This is illustrated in Figure 1.1. The yellow cartoon represents normal tissue (healthy cells). Each region of its DNA appears exactly twice, as there are two copies of each chromosome. Hence, the green, red, and blue marked regions in its DNA all have CN 2. The orange cartoon represents tumor cells with a duplication of the red region. Hence, its red region has CN 3, while the blue and the green region have (normal) CN 2. The pink cartoon represents tumor cells with a deletion of the blue region in its DNA. Hence, its blue region has CN 1, while the green and red region have CN 2. In total, the CN of a tumor (that is the number of copies of DNA stretches at a certain locus) of a single clone’s genome is a step function mapping chromosomal loci to a value i ∈ {0,1, . . . ,k} corresponding to i copies of DNA at a locus, with reasonable biological knowledge ofk. For instance, in the data example which will be analyzed in Chapter 7 the maximal CN isk = 5. CNA’s are known to be key drivers of tumor progression through the deletion of “tumor suppressing” genes and the duplication of genes involved in processes such as cell signaling and division. Understanding where, when and how CNA’s occur during tumorgenesis, and their consequences, is a highly active and important area of cancer research, see e.g., (Beroukhim et al., 2010).

CNA’s can be measures with whole genome sequencing (WGS), where the DNA is fragmented into pieces, the single pieces are sequenced using short “reads”, and the reads are aligned to a reference genome by a computer. Thus, for example, in a region with CN 1 there are (on average) only half as many reads aligned as in a region with CN 2 (see Figure 1.1 for an illustration). Modern high-throughput technologies allow for routine WGS of cancer samples and major international efforts are underway to characterize the genetic make up of all cancers,

1.1. Applications 3

for example The Cancer Genome Atlas,http://cancergenome.nih.gov/.

A key component of complexity in cancer genetics is the “clonal” structure of many tumors (heterogeneity), which relates to the fact that tumors usually contain distinct cell populations of genetic sub-types (clones) each with a distinct CNA profile, see e.g., (Greaves and Maley, 2012; Shah et al., 2012). This is illustrated in Figure 1.1, where the tumor sample originates from three different types of DNA: the normal tissue (represented by the yellow cartoon) and two different cancer clones each with different CNA’s (represented by the orange and pink cartoon). High-throughput sequencing technologies act by bulk measurement of large numbers of pooled cells in a single sample, extracted by a micro-dissection biopsy (or blood sample for hematological cancers). Hence, for WGS data of a heterogeneous tumor the number of reads at a certain locus is proportional to the sum of the CN’s of the single clones at that locus weighted by the relative proportion of each clone in the cell sample.

Summing up, with the notation of FABS in (1.1), in this example the number of sources m corresponds to the number of clones (plus normal tissue), the source functionsfⁱ correspond to the CN profile of the single clones (with CN’s only taking values in the finite alphabet {0,1,2, . . . ,k}), the mixing weightsωicorrespond to the relative proportion of the clone in the tumor, and the mixturegcorresponds to the overall CN of the tumor. If a cell sample of a tumor is taken at several locations or time points (each with a possibly different relative proportion of the single clones), this correspond to FABS with several mixtures, where Mis the number of different probes.

The estimation of the mixed functiong, i.e., estimating the locations of varying overall CN’s, has perceived considerable interest in the past, see (Olshen et al., 2004; Zhang and Siegmund, 2007; Tibshirani and Wang, 2008; Jeng et al., 2010; Chen et al., 2011; Yau et al., 2011; Niu and Zhang, 2012; Frick et al., 2014; Du et al., 2015). However, the corresponding demixing problem, that is, jointly estimating the number of clones, their proportion, and their CNAs, has been only recognized more recently as an important issue and hence received very little attention in a statistical context so far and is a major motivation for this thesis.

We illustrate the ability of the procedure which will be proposed in this thesis (called SLAM) to recover the number of clones, their relative proportion, and their CNA’s by utilizing it on real genetic sequencing data (see Chapter 7). In collaboration with the University of Oxford, we analyzed a data set from a colorectal cancer, which comes from two different clones and normal tissue. The data has the special feature that sequencing data of the single clones is available, something which is not the case for patient cancer samples. Figure 1.2 shows raw data of chromosomes 4,5,6,18 and 20. The x-axis represents the position on the chromosome and the y-axis the number of reads at a certain position (recall the illustration in Figure 1.1).

The top row shows data which comes from normal tissue (germline) and the subsequent rows show two different clones. As sequencing produces artifacts, we preprocess the data with a smoothing filter and binning (see Chapter 7 for details). Dividing the data by the average num-ber of reads per CN, which is 26 for normal tissue and 14 for the clones in this example, yields baseline correction. The resulting data is displayed in Figure 1.3, where the first row shows a mixture with mixing weights ω^> = (ω_Normal,ω_Clone1,ω_Clone2) = (0.2,0.35,0.45).Only the

4 Introduction

Figure 1.2: Raw WGS data from cell line LS411. Displayed are chromosomes 4,5,6,18, and 20. The x-axis represents the position on the chromosome and the y-axis the number of reads at a certain position. Top row: germline data. Row 2 and 3: two different clones.

Figure 1.3: Preprocessed WGS data from Figure 1.2. Top row: total CN of the mixture withω^> = (ωNormal,ωClone1,ωClone2) = (0.2,0.35,0.45). Second row: germline data. Row 3 and 4: two different clones. The red lines show SLAM’s estimates. Threshold parameters, as explained in the following, wereqn(α)=−0.15 (selected with MVT-method from Section 3.5) andqn(β)=20.

data in the first row of Figure 1.3 enters the estimation procedure, the data of the single clones in subsequent rows serves as ground truth and is used for validation only. SLAM estimates the

Im Dokument Finite Alphabet Blind Separation (Seite 15-19)