Statistical models in the presence of censoring

4. Statistical models for the quantification and analysis of cellular SAC phe-

4.2. Problem formulation

4.3.1. Statistical models in the presence of censoring

In this section, we derive the statistical models needed to model data which are subject to cen-soring. The statistical model of a dataset is the parametric probability density that describes the distribution from which the observations in this dataset are sampled. Censoring transforms this data generating density into an observable density. While we are interested in the data generating density, the experimental data are realizations from the observed density and first and foremost contain information on this density. However, probability theory allows for the derivation of the observed density as a function of the generating densities by taking into ac-count which type of censoring the data is subject to. Therefore, observed density as function of the generating densities is our statistical model for the variable of interest. In the follow-ing, we derive the observed densities in case of interval censorfollow-ing, right censoring and the combination of interval and right censoring in theith experiment within a set of experiments.

4.3. Multi-experiment mixture modelling of censored single-cell data (MEMO)

We consider random censoring in a general setup where cells can undergo two mutually exclusive randomly distributed events of interest, such as cell death and cell division. How-ever, one of the events may as well be a censoring event due to the end of the observation time. Then we have only one event of interest which is mutually exclusive with a censoring event. We denote Event I in theith experiment with the random variable X_i and Event II in theith experiment with random variableC_i. Theith experiment is described by the input vari-ableui. XiandCiare independent random variables with probability densities fX_i(xi|θ,ui) and f_C_i(c_i|θ,u_i), respectively. The cumulative distributions ofX_iandC_iare denoted byF_X_i(x_i|θ,u_i) andF_C_i(c_i|θ,u_i), respectively. The densities f_X_i(x_i|θ,u_i) and f_C_i(c_i|θ,u_i) are in general assumed to be given by a mixture model as defined in Equation (4.1). Censoring transforms these den-sities into the observed denden-sities. Therefore, we use additional random variables associated with the observed densities. They are introduced where needed in the following sections. The observed densities depend on the type of censoring and the respective models and correspond-ing likelihood functions will be derived in the followcorrespond-ing.

The statistical model in the absence of censoring

For completeness we start with the model for data in the absence of censoring. If only one event is possible or the supports of the event generating densities do not overlap, no censoring occurs, all observations are exact and the data are uncensored or complete. Under these circumstances the generating densities and observed densities are identical.

Consider the case in which Xi is the only event to occur and our measurement process does not cause censoring. We denote the random variable representing the observations with Yi. For i.i.d. observations, uncensored dataD_i=n

y_i^jo

j=1...ny,i of n_y,i observations are direct samples from the data generating density and the probability density ofYiis

f_Y_i(y_i|θ,u_i)= f_X_i(x_i|θ,u_i).

Here the data provide information about the full data generating probability density, enabling reliable reconstruction for sufficiently large sample numbers n_y,i. This does not ensure that the parametersθare identifiable. For mixture models, for instance, the problem of symmetry is well-known (Stephens, 2000).

In the absence of censoring, the likelihood function for dataD_iis given by P(D_i|θ)=

ny,i

j=1

f_Y_i(y_i^j|θ,u_i).

The statistical model accounting for interval censoring

For interval censoring we denote the random variable representing the observed censored quantity in theith experiment with Y_i. An interval censored observationy

i provides the in-formation that the corresponding exact value x_i lies in the interval (y

i−∆x,y

i]. The interval length is denoted by ∆x. Accordingly, for experimental condition ui the dataset consists of

4. Statistical models for the quantification and analysis of cellular SAC phenotypes

realizations from

f_Y

i(y

i|θ,u_i)=Z y

yi−∆xf_X_i(x_i|θ,u_i)dx_i

=F_X_i(y

i|θ,u_i)−F_X_i(y

i−∆x|θ,u_i) with cumulative distribution

F_X_i(x_i|θ,u_i) :=Z xi

−∞

f_X_i(x⁰_i|θ,u_i)dx⁰_i. Interval censored data D_i ={y^l

i}_l=1,...,n_y,iprovide information about the probability mass be-tween two observation points. The precise shape of the probability density bebe-tween observa-tion points cannot be reconstructed but is merely restricted by the chosen distribuobserva-tion type. In the presence of interval censoring, the likelihood function for dataD_iis

P(D_i|θ)=

n_y,i

l=1

f_Y

i(y^l

i|θ,u_i)

n_y,i

l=1

FXi(y^l

i|θ,ui)−FXi(y^l

i−∆x|θ,ui) .

Here we assume that the length of all intervals is identical. This can easily be generalized.

The statistical model accounting for right censoring

For the derivation of the model, we consider two competing processes, one generating actual observations of the process of interest and the second generating observations such as the end of recording. Mutual exclusiveness in the context of right censoring has the effect that only the event occurring first can be detected and recorded as described in Section 4.1.1. In the presence of random right censoring due to a competing process, observations of the quantity of interest {y_i^j} and observations of censoring {y^k_i} are recorded. These are realizations of the conditional random variables Yi:= Xi|Xi≤Ci and Yi :=Ci|Ci≤ Xi, respectively. In the following we derive the densities ofY_iandY_ifrom the densities ofX_iandC_i.

The densities of observed uncensored and right censoring observations for experimental conditionuiare the probability densities

f_Y_i(y_i|θ,u_i)= f_X_i_|X_i_≤C_i(x_i|θ,u_i)

= f_X_i_,X_i_≤C_i(x_i|θ,u_i) P(X_i≤C_i|θ,u_i) , f_Y

i(y_i|θ,ui)= f_C_i|Ci≤Xi(ci|θ,ui)

= f_C_i_,C_i_≤X_i(x_i|θ,u_i) P(Ci≤Xi|θ,ui),

4.3. Multi-experiment mixture modelling of censored single-cell data (MEMO)

with joint distributions (derivation provided in Appendix A)

P(X_i≤Ci|θ,ui)= Z ∞

−∞

fXi(x_i|θ,ui)(1−FCi(x_i|θ,ui))dx_i, P(C_i≤Xi|θ,ui)=

Z ∞

−∞

fCi(c_i|θ,ui)(1−FXi(c_i|θ,ui))dc_i.

As analytical solutions ofP(X_i≤C_i|θ,u_i) andP(C_i≤X_i|θ,u_i) are often not available, numerical integration might be necessary (Cook, 2008).

The density ofC_ican have different shapes. In the case of random censoring, meaning that f_C_i(c_i|θ,u_i) is a smooth distribution, the likelihood function for data

D_i=n y_i^jo

j=1,...,ny,i,n y^k_io

k=1,...,ny,i

is proportional to P(D_i|θ)∝







ny,i

j=1

f_Y_i(y_i^j|θ,u_i))













n_y,i

k=1

f_Y

i(y^k_i|θ,u_i)







∝







ny,i

j=1

f_X_i(y_i^j|θ,u_i)(1−F_C_i(y_i^j|θ,u_i))













n_y,i

k=1

f_C_i(y^k_i|θ,u_i)(1−F_X_i(y^k_i|θ,u_i))





 .

In case of fixed Type I censoring at a single value ˜ci such that {y^k_i}_k₌_1,...,n_y,i =c˜i∀k, which corresponds to a probability density which is a Dirac delta, fC˜i(c_i|θ,u_i)=δ(c_i−c˜_i), the likeli-hood function simplifies to

P(D_i|θ)∝







n_y,i

j=1

f_X_i(y_i^j|θ,u_i)













n_y,i

k=1

(1−F_X_i(y^k_i|θ,u_i))





 .

This formulation exploits the tail probabilities 1−F_X_i(y^k_i|θ,u_i) to capture the censoring.

Note that this likelihood function can also be used to avoid explicit modelling of the cen-soring process as a probability density. While this still allows for inference, a visual compar-ison of model and data requires an estimate of the censoring density (Geissenet al., 2016), since f_Y_i and f_Y

i have to be evaluated for this purpose. Furthermore, both, f_X_i(x_i|θ,u_i) and fCi(c_i|θ,ui), are needed to resample data for a goodness-of-fit analysis based on bootstrapping of the likelihood distribution of the objective function.

The statistical model accounting for interval and right censoring

In the presence of interval and right censoring, interval censored observations {y^l

i} and right censored observations {y^k_i}are recorded in experimental conditioni. These observations are

4. Statistical models for the quantification and analysis of cellular SAC phenotypes

realizations of the random variables Y_i and Yi, respectively. To derive Y_i and Yi and their respective densities from X_i and C_i we need to make an intermediate step and create the random variables X⁺_i andC⁺_i first. X_i⁺ andC⁺_i are derived from X_i andC_i by discretisation.

Loosely speaking, realizations ofXiandCiare binned according to the censoring interval∆x.

Binning here equals a round up ofx_iandc_ito the next multiple of∆x. This yields the smallest multiple of∆x, x⁺_i, which is larger thanx_i, and correspondinglyc⁺_i. Without loss of generality we assume that measured time points are multiples of ∆x, such that ∀i,j ∃k⁰,k⁰⁰ such that y^j

i =k⁰∆xandy^k_i =k⁰⁰∆x. The densities of the conditional random variablesY_i:=X_i⁺|X_i⁺≤C_i⁺ andY_i:=C⁺_i|C⁺_i ≤X_i⁺for experimental conditionu_iare then derived as

f_Y

i(y

i|θ,u_i)= f_X+

i|X_i⁺≤C_i⁺(x⁺_i|θ,u_i)

= f_X+

i,X_i⁺≤C⁺_i(x⁺_i |θ,u_i) P(X_i⁺≤Ci|θ,ui) , f_Y

i(y_i|θ,u_i)= f_C+

i|C_i⁺≤X_i⁺(c⁺_i|θ,u_i)

= f_C+

i,C⁺_i≤X⁺_i (c⁺_i |θ,ui) P(C_i⁺≤X_i|θ,u_i) , with joint distributions

f_X+

i,X_i⁺≤C_i⁺(x⁺_i|θ,ui)=

FX_i(x⁺_i|θ,ui)−FX_i(x⁺_i −∆x|θ,ui)

(1−FC_i(x⁺_i|θ,ui)), f_C+

i,C_i⁺≤X_i⁺(c⁺_i|θ,u_i)=

F_C_i(c⁺_i|θ,u_i)−F_C_i(c⁺_i −∆x|θ,u_i)

(1−F_X_i(c⁺_i |θ,u_i)), and marginal probabilities for observing uncensored or censored data,

P(X_i⁺≤C_i⁺|θ,u_i)=X

k⁰∈Z

FXi(k⁰∆x|θ,u_i)−FXi((k⁰−1)∆x|θ,u_i)

(1−FCi(k⁰∆x|θ,ui)), P(C_i⁺≤X_i⁺|θ,u_i)= X

k⁰⁰∈Z

F_C_i(k⁰⁰∆x|θ,u_i)−F_C_i((k⁰⁰−1)∆x|θ,u_i)

(1−F_X_i(k⁰⁰∆x|θ,u_i)).

The cumulative distributions ofX_iandC_iare denoted byF_X_i(x_i|θ,u_i) andF_C_i(c_i|θ,u_i), respec-tively.

In the case of random censoring the likelihood function for data D_i=

( ny^l

l=1,...,ny,i,n y^k_io

k=1,...,ny,i

)

is proportional to P(D_i|θ)∝







n_y,i

l=1

f_Y

i(y^l

i|θ,ui)













n_y,i

k=1

f_Y

i(y^k_i|θ,ui)







∝







ny,i

l=1

F_X_i(y^l

i|θ,u_i)−F_X_i(y^l

i−∆x|θ,u_i)

(1−F_C_i(y^l

i|θ,u_i))













n_y,i

k=1

F_C_i(y^k_i|θ,u_i)−F_C_i(y^k_i −∆x|θ,u_i)

(1−F_X_i(y^k_i|θ,u_i))





 .

4.3. Multi-experiment mixture modelling of censored single-cell data (MEMO)

As before, for fixed Type I censoring at a value ˜ci, the likelihood function simplifies to

P(D_i|θ)∝







n_y,i

l=1

f_X_i(y^l

i|θ,u_i)













n_y,i

k=1

(1−F_X_i(y^k_i|θ,u_i))





 .

Im Dokument A statistical and mechanistic, model-based analysis of spindle assembly checkpoint signalling (Seite 46-51)