LS approximation - Finite Alphabet Blind Separation

Figure 6.5: Histogram of ˆm_0.95,mˆ_0.9,mˆ_0.75as in Definition 3.6.4 (from top to bottom), forωas in Table 6.1 andf =(f¹, . . . ,f^m) as in (6.3), with standard deviationσ=0.05,n=1,000, for m = 1,2,3,4,5 (from left to right). The red vertical line indicates the true number of source functionsm.

σ 0.02 0.05 0.1

P( ˆm_0.95>m) (0,0,0,0,0) (0,0,0,0,0) (0,0,0,0,0) P( ˆm_0.90>m) (0.001,0,0,0,0) (0.001,0,0,0,0) (0,0,0,0,0) P( ˆm_0.75>m) (0.005,0,0,0,0) (0.002,0.001,0,0,0) (0.006,0,0,0,0)

Table 6.10: Overestimation probability for m = (1, . . . ,5) of SLAM selector ˆm_1−α in (3.24) forωas in Table 6.1 andf as in (6.3) withn=1,000.

σ 0.01 0.02 0.05 0.1

P( ˆm>m) (0,0,0,0,0) (0.07,0,0,0.004,0) (0.079,0,0,0,0) (0.088,0,0,0,0) Table 6.11: Overestimation probability form= (1, . . . ,5) of SLAM selector ˆmin Definition 3.6.8 forωas in Table 6.1 andf as in (6.3) withn=1,000.

6.3 LS approximation

In the following, the Lloyd’s algorithm from Figure 5.1 is explored in a simulation study. In particular, we want to compare its performance with the theoretical findings for the LSE from Chapter 4, which cannot be computed efficiently (see Section 5.2). Corollary 4.1.4 yields that the LSE achieves optimal rates for the maximal prediction error. The maximal prediction error cannot be simulated efficiently, as the maximum may be attained at any (ω,Π). Instead, we simulate the Bayes risk with uniform priors forωandΠ, respectively. As discussed in Remark 4.1.5, this ensures thatAS B(ω)≥c₁√

MandΠisc₂M-separable, for some constantc₁,c₂ >0

80 Simulations

Figure 6.6: Left: Normalized MSE E

θˆ−ΠAω

/Mfor the estimator ˆθ = ΠˆAωˆ of Figure 5.1 for m = 3 sources, a binary alphabet A = {0,1}, and n = 500 observations, for M ∈ {1, . . . ,1000} andσ = 0.2,0.5,1 (lightgray, darkgray, and black line). Top right: MSE on a logarithmic scale. Bottom right: E

θˆ−ΠAω

/(Mσ²).

asymptotically almost surely¹. We simulate the estimator ˆθ=ΠˆAωˆ of Lloyd’s algorithm from Figure 5.1 (see Section 5.2) form=2,3,4 number of sources, alphabetsA={0,1},{0,1,2,3}, n = 500,1000 number of observations, for M ∈ {1, . . . ,1000}, and standard deviation σ = 0.2,0.5,1. Simulation runs were always 100,000.

Dependence onσ We simulated the mean squared error (MSE) for the estimator ˆθ = ΠˆAωˆ of Lloyd’s algorithm from Figure 5.1 for three sourcesm = 3, a binary alphabetA = {0,1}, n = 500 observations, for M ∈ {1, . . . ,1000} andσ = 0.2,0.5,1. The results are shown in Figure 6.6. For larger variances (σ = 1,0.5) the top left plot of Figure 6.6 shows a peak at M = 3 and for M > 3 an exponential decay to some limiting value. For small variance σ= 0.2 the exponential decay starts already atM =1. The peak can be explained as follows:

When the variance is large, i.e., the signal to noise ratio is small (recall that the alphabetA= {0,1} is fixed), the observations Y in (4.1) are likely to lie outside of the parameter space N ⊂ [0,1]^n×M. Thus, the Lloyd’s algorithm prefers parameters in N which are close to the boundary of [0,1]^n×M. This corresponds to mixing matrices ω with columns equal to unit vectors, which in general are not close to the true underlying mixing matrix and hence, lead to a larger MSE. However, ifM<msuchωhave zero alphabet separation boundaryAS B(ω)=0 and in particular, small number of possible mixture values. More precisely, forM < mwhen

1A uniform prior forω does not guaranteeWS B(ω) ≥ c1

√

Masymptotically almost surely. However, this is more of a technical issue, as one can always consider estimation up to permutation matrices P (recall that FPP⁻¹ω=F ω), that is, one considers the equivalence class (F,ω)∼(FP,P⁻¹ω) for any permutationP.

6.3. LS approximation 81

Figure 6.7: Left: Normalized MSE E

θˆ−ΠAω

/Mfor the estimator ˆθ= ΠˆAωˆ of Lloyd’s algorithm from Figure 5.1 for m = 3 sources, a binary alphabet A = {0,1}, for M ∈ {1, . . . ,1000}andn= 1000,500 (solied and dashed line) observations withσ= 0.2,0.5 (gray and black line). Top right: MSE on a logarithmic scale. Bottom right: E

θˆ−ΠAω

/(Mσ²).

the columns ofωare unit vectors and ˜ωis a different mixing matrix withAS B( ˜ω)>0 then

#eω:e∈A^m ≤k^M <k^m=#

eω˜ :e∈A^m .

Consequently, when M < m mixing matrices ω with columns being unit vectors are not (wrongly) selected by the Lloyd’s algorithm and hence, the estimate fits the true underlying parameters better. On the other hand, as the (maximal) number of mixture valuesk^mdoes not grow with M, when M mchoosing the columns ofωas unit vectors only explores certain small (relatively decreasing withM) areas of the boundary of the parameter spaceN and thus does not lead to a better fit of the data, in general. The peak in the MSE at M = mcan be misleading: One might think that for M≈mestimating each column ofωseparately from the corresponding column of the data matrixY leads to a smaller MSE than estimating the whole matrix ω at once from the whole dataY. This is not true in general, which can be seen by the following example. When M = 1 estimation of ω ∈ R^m×1 reduces to estimation of its entriesω₁₁, . . . , ω_m1, as their ordering is determined via the relationω₁₁ < . . . < ω_m1. How-ever, when M = 2 estimating its entriesωi j does not determine ωuniquely from the relation kω1·k< . . . < kωm·k. In particular, combining estimates of single columns ofωis not feasible.

The top right plot of Figure 6.6 shows the MSE on a logarithmic scale (where we subtracted the limiting value limM→∞E

θˆ−ΠAω

/M). Its linearity suggests an exponential decay con-firming the bounds in Corollary 4.1.4. As in Corollary 4.1.4 the slope in Figure 6.6 (top right) decreases withσ²and the intercept does not depend onσ². The bottom left plot of Figure 6.6

82 Simulations

Figure 6.8: Left: Normalized MSE E

θˆ−ΠAω

/Mfor the estimator ˆθ= ΠˆAωˆ of Lloyd’s algorithm from Figure 5.1 for a binary alphabetA={0,1}, withn=500 observations forM∈ {1, . . . ,1000} andm= 2,3,4 (dotted, dashed, and solid line) forσ = 0.2,0.5 (gray and black line). Top right: MSE on a logarithmic scale. Bottom right: E

θˆ−ΠAω

/(Mσ²(m−1)).

shows the limiting value of the MSE. In the bottom right plot of Figure 6.6 one observes that, just as in Corollary 4.1.4, it scales withσ².

Dependence onn We simulated the MSE for the estimator ˆθ = ΠˆAωˆ of Lloyd’s algorithm from Figure 5.1 for three sourcesm = 3, a binary alphabetA = {0,1}, for M ∈ {1, . . . ,1000}

andn=500,1000 (andσ=0.2,0.5). The results are shown in Figure 6.7. The top right plot of Figure 6.7 shows the MSE on a logarithmic scale (where we subtracted the limiting value). It clearly shows an exponential decay. As in Corollary 4.1.4 the slope in Figure 6.7 top right does not depend onnand the intercept increases with n. The bottom left plot of Figure 6.7 shows the limiting value of the MSE, just as in Corollary 4.1.4 it does not depend onn.

Dependence on m We simulated the MSE for the estimator ˆθ = ΠˆAωˆ of Lloyd’s algo-rithm from Figure 5.1 for a binary alphabetA = {0,1}, withn = 500 observations for M ∈ {1, . . . ,1000} and m = 2,3,4 (andσ = 0.2,0.5). The results are shown in Figure 6.8. The top right plot of Figure 6.8 shows the MSE on a logarithmic scale (where we subtracted the limiting value). It clearly shows an exponential decay. As in the upper bound of Corollary 4.1.4 the slope in Figure 6.8 top right decreases withmand the intercept does not depend on m. The bottom left plot of Figure 6.8 shows the limiting value of the MSE, just as in Corollary 4.1.4 it increases withm(scaling withm−1). This is in accordance with Corollary 4.1.4 which suggests a limiting constant proportional tom.

6.3. LS approximation 83

Figure 6.9: Left: Normalized MSE E

θˆ−ΠAω

/Mfor the estimator ˆθ= ΠˆAωˆ of Lloyd’s algorithm from Figure 5.1 form=3 sources withn= 500 observations forM∈ {1, . . . ,1000} and alphabetsA = {0,1},{0,1,2,3} (dashed and solid line) forσ = 0.2,0.5 (gray and black line). Top right: MSE on a logarithmic scale. Bottom right: E

θˆ−ΠAω

/(Mσ²m).

Dependence onA We simulated the MSE for the estimator ˆθ = ΠˆAωˆ of Lloyd’s algorithm from Figure 5.1 for m = 3 sources with n = 500 observations for M ∈ {1, . . . ,1000} and alphabetsA = {0,1},{0,1,2,3}(andσ = 0.2,0.5). The results are shown in Figure 6.9. The top right plot of Figure 6.9 shows the MSE on a logarithmic scale (where we subtracted the limiting value). It clearly shows an exponential decay. The slope in Figure 6.9 top right does not depend on the alphabet. This suggests that in Corollary 4.1.4 the additional (1+ma_k)²term in the upper bound is not necessary. The intercept in Figure 6.9 top right increases withak as in Corollary 4.1.4. The bottom left plot of Figure 6.9 shows the limiting value of the MSE. Just as in Corollary 4.1.4 it does not depend on the alphabetA.

Estimation error Figure 6.10 shows the simulation results for the estimation error of the es-timator ˆθ=ΠˆAωˆ of Lloyd’s algorithm from Figure 5.1 for the setting of Figure 6.6, i.e., form= 3 sources, a binary alphabetA= {0,1}, andn = 500 observations, forM ∈ {1, . . . ,1000}and σ=0.2,0.5,1. Again an exponential decay to some limiting value for E

d(( ˆΠ,ω),ˆ (Π, ω))² /M, P

Π,Πˆ , E

maxi=1,...,mkωi·−ωˆi·k²

/Mand E

ΠA−ΠˆA

2 2

asM→ ∞is observed. Whereas P

Π,Πˆ

→0 asM→ ∞, the limiting value of E

maxi=1,...,mkωi·−ωˆi·k²

/Mand, thus, also of E

d(( ˆΠ,ω),ˆ (Π, ω))²

/M, scales withσ², which is in accordance with Theorem 4.2.3.

84 Simulations

Figure 6.10: From top to bottom: E

d(( ˆΠ,ω),ˆ (Π,ω))²

/Mfrom (4.8); exact recovery prob-abilityP

Π,Πˆ

(left) and E

ΠA−ΠˆA

2 2

(right); and E

maxi=1,...,mkωi−ωˆik²

/M for the estimator ˆθ =ΠˆAωˆ of Lloyd’s algorithm from Figure 5.1 form= 3 sources, a binary alphabet A = {0,1}, and n = 500 observations, for M ∈ {1, . . . ,1000}andσ = 0.2,0.5,1 (lightgray, darkgray, and black line).

CHAPTER

7

Applications in cancer genetics

In the following we want to apply the SLAM procedure from Chapter 3 to some real genetic sequencing data from a cancer tumor. Recall from Section 1.1 that a cancer tumor often con-sists of a few distinct sub-populations , so called clones, of DNA with distinct CN profiles arising from duplication and deletion of genetic material groups. The CN profiles of the un-derlying clones in a sample measurement correspond to the functionsf¹, . . . ,f^m, the weights ω1, . . . ,ωmcorrespond to their proportion in the tumor, and the measurements correspond to observationsYas in the SBSR model (1.3).

The most common method for tumor DNA profiling is via WGS, which roughly involves the following steps:

1. Tumor cells are isolated, and the pooled DNA is extracted, amplified and fragmented through shearing into single-strand pieces.

2. Sequencing of the single pieces takes place using short “reads” (at time of writing of around 10²base-pairs long).

3. Reads are aligned and mapped to a reference genome (or the patient germline genome if available) with the help of a computer.

Although, the observed total reads are discrete (each observation corresponds to an integer number of reads at a certain locus), for a sufficiently high sequencing coverage, as it is the case in our example with around 55 average stretches of DNA mapped to a locus, it is well established to approximate this binomial by a normal variate, see e.g., (Liu et al., 2013).

In the following, SLAM is applied to the cell line LS411, which comes from colorectal can-cer and a paired lymphoblastoid cell line. Sequencing was done through a collaboration of Complete Genomics with the Wellcome Trust Center for Human Genetics at the University of Oxford. This data has the special feature of being generated under a designed experiment using radiation of the cell line (“in vitro”), designed to produce CNAs that mimic real world CN events. In this case therefore, the mixing weights and sequencing data for the individual clones are known, allowing for validation of SLAM’s results, something that is not feasible for patient cancer samples.

The data comes from a mixture of three different types of DNA, relating to a normal (germline) DNA and two different clones. Tumor samples, even from micro-dissection, often contain high proportion of normal cells, which for our purposes are a nuisance, this is known as “stromal

86 Applications in cancer genetics

Figure 7.1: As Figure 1.3, but withqn(β)=2.2 corresponding toβ=0.01.

contamination” of germline genomes in the cancer literature. The true mixing weights in our sample areω^>=(ω_Normal,ω_Clone1,ω_Clone2)=(0.2,0.35,0.45).

SLAM will be, in the following, applied only to the mixture data without knowledge ofωand the sequenced individual clones and germline. The latter (which serve as ground truth) will then be used only for validation of SLAM’s reconstruction. We restricted attention to regions of chromosome 4,5,6,18 and 20, as detailed below. Figure 1.2 shows the raw data. Sequencing produces some spatial artifacts in the data, and waviness related to the sequencing chemistry and local GC-content, corresponding to the relative frequency of the DNA bases{C, G}relative to{A, T}. This violates the modeling assumptions. To alleviate this we preprocess the data with a smoothing filter using local polynomial kernel regression on normal data, baseline correction, and binning. We used the local polynomial kernel estimator from the R packageKernSmooth, with bandwidth chosen by visual inspection. We selected the chromosomal regions above as those showing reasonable denoising, and take the average of every 10th data point to make the computation manageable resulting in n = 7,480 data points spanning the genome. The resulting data is displayed in Figure 1.3, where we can see that the data is much cleaned in comparison with Figure 1.2 although clearly some artefacts and local drift of the signal remain.

For the SLAM procedure we incorporated prior knowledge of constant CN 2 for the normal cells and considered the following separable regions in Algorithm CRW in Figure 3.1: to infer ωNormal we searched for regions wheref^Normal = 2 and f^Clone1 = f^Clone2 = 3 and to infer ω_Clone1 we search for regions where f^Clone1 = 3 and f^Clone2 = f^Normal = 2. ω_Clone2 was indirectly inferred viaω_Clone2=1−ω_Clone1−ω_Normal.

With σ = 0.21 pre-estimated as in (Davies and Kovac, 2001), SLAM yields the confidence region forα=0.1C_0.9 =[0.00,0.23]×[0.30,0.44]×[0.37,0.70]. Withqn(α)=−0.15 selected with the MVT-method from Section 3.5 we obtain ˆω = (0.11,0.36,0.52). Figure 7.1 shows SLAM’s estimates for q_n(β) = 2.2 (which corresponds to β = 0.01). The top row shows the estimate for total CNP

jωˆjfˆ^j and rows 2-4 show ˆf¹,fˆ², and ˆf³. We stress that the data for the single clones are only used for validation purposes and do not enter the estimation process. Inspection of Figure 7.1 shows that artifacts and local drifts of the signal result in an overestimation of the number of jumps. However, the overall appearance of the estimated CNA

Figure 7.2: Estimated number of source components ˆm(q) (y-axis) as in Definition 3.6.2 for different values ofq (x-axis), for the WGS-data from Figure 1.3 with true number of source components m = 3 (red horizontal line). The blue dot shows the SLAM selector ˆmas in Definition 3.6.8.

profile remains quite accurate. This over-fitting effect caused by these artifacts can be avoided by increasing SLAM’s tuning parameter qn(β) at the (unavoidable) cost of loosing detection power on small scales (see Figure 1.3, which shows SLAM’s estimate for q_n(β) = 20). In summary, Figure 1.3 and 7.1 show that SLAM can yield highly accurate estimation of the total CNA profile in this example, as well as reasonable CNA profiles and their mixing proportions for the clones.

Estimating the number of clonal components Recall from Section 1.1 that usually in can-cer genetics the number of clones is unknown. Therefore, we apply the SLAM selector ˆm(q) from Definition 3.6.2 to estimate the number of clonal components in this data example, where m=3. Figure 7.2 displays ˆm(q1−α) in dependence of the thresholdq_1−αand the probabilityα, which corresponds to the error to overestimatem, see Theorem 3.6.4. Largerq_1−αand hence, smallerα, provide a stronger guarantee in accordance with Figure 7.2. Remarkably, the estima-tor ˆm(q1−α) =3 is stable over the rangeα∈(0.001,0.999). This corresponds to the threshold q ∈(−0.2,3.8) in Definition 3.6.2. Finally, the SLAM selector ˆmfrom Definition 3.6.8 yields the correct number of sources ˆm = m = 3 in this example (see blue dot in Figure 7.2). The BIC and the AIC criterion, however, overestimate the number of sources in this example with

m_BIC,mˆ_AIC=7, in accordance to our simulation results in Section 6.2.1. As Figure 7.3 shows, misspecifying the number of clones as ˆm=2, leads to artificial jumps in the sources and mix-ture, respectively (recall Example 1.3.1). However, the estimate still remains quite reasonable in the sense that it tries to combine the two different clones into a single one.

88 Applications in cancer genetics

Figure 7.3: As Figure 1.3, but with ˆm=2 in SLAM.

CHAPTER

8

Outlook and discussion

This thesis considered a unifying treatment of finite alphabet blind separation (FABS) prob-lems, which, to the best of our knowledge, have not been analyzed in this general and compre-hensive form, so far.

In a first step, the identifiability issue was characterized. Separability was introduced and found to regularize FABS via the minimal ASBδ. In particular, it ensures identifiability for arbitrary alphabetsA, number of mixturesM, and number of sourcesm, including the situation where mis unknown.

In a statistical setting, we first considered c.p. regression in the SBSR model (1.3). The mul-tiscale procedure SLAM which estimates the mixing weightsω and the sourcesf (including the number of sources componentsm) at (almost) optimal rate of convergence was introduced.

Moreover, this procedure yields (asymptotically) honest confidence statements for all quantities (including lower confidence bounds form). Theoretical optimality results were accompanied by a simulation study and a real data example from cancer genetics.

Second, the statistical setting of a multivariate linear model with unknown finite alphabet de-sign was considered. Lower and upper bounds for both, the minimax prediction rate and the minimax estimation rate (in terms of the metric d in (4.8)), were derived. Both are attained by the LSE. In particular, the results reveal that the unknown design does not influence the minimax rates when the number of mixtures Mis at least of order ln(n), wherenis the num-ber of observations. This is in strict contrast to the computational issue. Whereas for known design computation of the LSE amounts to a convex optimization problem, for unknown finite alphabet design as in (1.6) it seems to be not feasible, as it amounts to minimization over a dis-joint union of exponentially many (inn) (convex) sets. Therefore, we propose a simple Lloyd’s algorithm for its approximation. Simulations indicate similar convergence properties as in the theoretical results for the LSE.

In the following we discuss further possible research directions in FABS.

Bayesian FABS This thesis considers FABS in a frequentist setting, where the data Y, in (1.3) and (1.6), has underlyingtruefixed and deterministic mixing weightsωand sourcesf. Alternatively, one may consider FABS in a Bayesian setting, whereω andf are themselves random variables. For any Bayesian procedure a prior distribution of the underlying parameters is fundamental. A natural prior distribution forω in this setting is a uniform distribution on

90 Outlook and discussion Ωm (for knownm) and Ω (for unknown m), which has been studied in Section 2.3.2 and 2.3.3, respectively. For the sourcesf a reasonable prior is a Markov process (including i.i.d.

sequences), which has been studied in Section 2.3.1. These may be used as a starting point for Bayesian FABS.

SLAM for general error distributions Although, we obtained a certain robustness of SLAM to misspecification of the error distribution in our simulation studies, it is natural to ask how the results of this work can be extended to other types of error distributions than the Gaussian distribution. A natural extension of the Gaussian are sub-Gaussian distributions, where is sub-Gaussian with scale parameterσ >0 if

E e^t

≤e^σ²^t²^/2, ∀t∈R. (8.1)

The consistency results for SLAM in Theorem 3.4.2 and 3.6.9 rely on a tail bound for the multiscale statisticT_nin (1.16). For the Gaussian case (Sieling, 2013, Corollary 4) yields that for alln ∈Nandq>C, for some universal constantC <∞,P(Tn >q) ≤exp(−q²/8). When the error termsj in (1.3) are only sub-Gaussian with mean zero and varianceσ², one can use strong Gaussian approximation results (Sakhanenko, 1985) to derive a similar bound.

Theorem 8.0.1. Let g ∈ M^δ,λ_m for some δ, λ > 0 and m ∈ N and consider observations Yj = g(xj)+j, j = 1, . . . ,n, from the SBSR model (1.3), but withj i.i.d. sub-Gaussian as in (8.1) with mean0and varianceσ². Let T_n(Y,g)be as in (1.16). Then, for some universal constants0<C,C₁,C₂ <∞it follows for all q>C that

P(Tn(Y,g)>q)≤exp(−q²/32)+(1+C₂√

n) exp(−C₁q

√nλ/4).

With Theorem 8.0.1 one can adapt the consistency results for SLAM in Theorem 3.4.2 for the sub-Gaussian case. Definingq_n(αn) := δ/(17makσ) ln(n) explicitly as in (A.43) one obtains that the assertions of Theorem 3.4.2 still hold true for sub-Gaussian noise, but with probability decreased to 1− exp(−qn(αn)²/32) −(1+C₂√

n) exp(−C₁q_n(αn)√

nλ/4) (compared to 1 − exp(−qn(αn)²/8) in the Gaussian case). Note that this still converges superpolynomially fast to one. Analog, the consistency rates for the SLAM selector in Theorem 3.6.9 can be adapted for the sub-Gaussian case, where nowP( ˆm,m) is of rateO√

n exp(−c(λ, δ)n)

(instead of O exp(−c(λ, δ)n)

as in the Gaussian case).

A different extension of a Gaussian error distribution is to consider general one-dimensional exponential families. That is, the observations Yj in (1.3) are assumed to be independently distributed with ν-density h_g(x_j₎(z) := exp

g(xj)z−Ψ g(xj)

, for some σ-finite measure ν on the Borel sets onR and a cumulant transformΨ, with the natural parameter spaceΘ = nθ∈R: R

Rexp(θz)dν(z)<∞o

. Analog toT_i^j in (1.15) one can consider for the local

hypothe-91

sis testing problems in (1.14) the log-likelihood ratio statistic T_i^j(Y,gi j) :=log







sup_θ∈_ΘΠ_l^j₌_ih_θ(Yl) Π_l^j₌_ihg_{i j}(Yl)







. (8.2)

Combining the local statistics in (8.2) in the same way as in (1.16) yields a corresponding multiscale statistic T_n(Y,g). Dümbgen and Spokoiny (2001); Dümbgen et al. (2006); Frick et al. (2014) give several results about this multiscale statistic, its limit distribution, and its geometric interpretation, which leads to the definition of boxesBas in (1.18). Combining this with the results of this work yields an extension of SLAM for such distributions. Especially for the case of Poisson and negative binomial distributedY in (1.3) this might be of particular interest in the context of cancer genetics, as these distributions are often used to model the noise of sequencing data (see (Liu et al., 2013) and references there). Proving consistency results as in Theorem 3.4.2 (and 3.6.11) for general exponential families basically involves three major steps. First, a modulus of continuity for the multiscale statisticT_n(Y,g) (as a function ofg) is needed, which boils down to a modulus of continuity of the corresponding cumulant transform

Im Dokument Finite Alphabet Blind Separation (Seite 93-109)