Hobolth, Uyenoyama & Wiuf’s Scheme

4. Importance Sampling in the Infinitely-Many-Sites-Model 89

4.2. Importance Sampling Schemes

4.2.3. Hobolth, Uyenoyama & Wiuf’s Scheme

In [HUW08] the authors introduced a proposal distribution for the infinitely many sites model where the underlying genealogy is given by Kingman’s coalescent. The basic idea behind their proposal distribution is to treat each mutation as if it is the only mutation in the genetree. In the case of just one mutation, the optimal proposal weights can be calculated explicitly. The proposal weights from each mutation are then suitably combined to obtain a proposal distribution for samples with an arbitrary number of mutations. We present a possible generalisation of their proposal distribution to the Λ-coalescent scenario and give a short summary of the work in [HUW08] later.

Recall from Section 3.3.1 that in the infinitely many sites model the genetree of the sample configuration can also be represented by a matrixS = (si,m)1≤i≤d,1≤m≤s, where s_i,m= 1 if and only if typeibears a mutation at the segregating sitem. We denote by d_m =P_d

i=1s_i,mn_i the number of individuals in the sample bearing a mutation at site m.

For convenience denote by

Mⁿd :=

(0),(0,1)

∼,(n−d, d)

the sample configuration with one segregating site wheredindividuals bear the mutation at the segregating site and n−ddo not. Denote by

Mⁿ0 :=

(0)

∼,(n)

n − d

d

(a) The sample configuration [Mⁿ_d]≈.

n

(b) The sample configuration [Mⁿ₀]≈.

Figure 4.1.: The sample configurations [Mⁿd]_≈ and [Mⁿ0]_≈. The sample has one segrat-ing site respectively no segregatsegrat-ing site.

the configuration where all n individuals share the same type. These sample config-urations are illustrated in Figure 4.1. Note that in the sequel whenever we present illustrations of such samples we will use the corresponding equivalence class under ≈ out of convenience.

In case of one segregating site let p(n, d) for n≥2 be the probability that the most recent event affected an individual out of thedindividuals bearing the mutation, that is

p_θ(n, d) =

(Pd−1

l=1 Q^∗_θ(Mⁿ_d → M^n−l_d−l) ifd >1 Q^∗θ(Mⁿ1 → Mⁿ0) ifd= 1.

In the first case the most recent event was a merger of any size involving individuals bearing the mutation, whereas in the second case the last event was the origin of the mutation.

To utilise this probability for a general sample (t,n) fix a type i and a segregating site m. Say that the type i is involved in this most recent evolutionary event with probability

u^(m)_θ (i) :=

(p_θ(n, dm)_dⁿⁱ

m ifsi,m = 1,

(1−p_θ(n, d_m))_n−dⁿⁱ

m ifs_i,m = 0.

In the first case typei bears the mutation at sitem, thus the probability for the most recent event affecting individuals carrying the mutation has to be multiplied by ⁿ_dⁱ since this is the fraction of individuals with typeiamong the individuals bearing the mutation at sitem. In the second case a similar factor shows up due to the same reasoning among the individuals not carrying the mutation. In the remainder we refer to the genetree Mⁿ_d_m as thecompressed genetree.

For a general sample consider the contribution of all segregating sites. We expect the method to perform well, since types bearing more mutations will get a larger weight and thus are involved in the most recent event with a higher probability, which seems rather intuitive.

Definition 4.15(Proposal distributionQ^H_θ¹). Denote by (t,n) the current state of the

0 0

1 2

4 2 5 4

2 1 3 3

carrying:

0 4 =n−d1

1 8 =d1

0 8

5 4 i

not carrying:

0 11

2 1

i 0 9

3 3 i

0 10

4 2 i

Figure 4.2.: A sample configuration is depicted on the left and a typeiis marked. On the right all possible compressed genetrees are listed. The type corresponding to i is marked in the compressed genetrees. Either type i corresponds to the type carrying the mutation or not.

proposal Markov chainQ^H_θ¹. Propose type ito be involved in the most recent event with probability

Q^H_θ¹(i|t,n)∝ (P

mu^(m)_θ (i) if iis allowed to act

0 else, (4.20)

where type i is allowed to act if either n_i ≥ 2 or type i corresponds to a leaf node in the associated genetree. If the multiplicity of the chosen type is one remove the outmost mutation. If the multiplicity is larger than one, perform a merger inside this group. The size of the merger is chosen to be l+ 1 with probability proportional to Q^∗θ(Mⁿdo → M^n−l_d_o_−l), where d_o is the number of individuals in the sample bearing the outmost mutation of the chosen type and1≤l≤n_i−1. If the selected typeicarries no mutation the merging size is chosen proportional toQ^∗_θ(Mⁿ₀ → M^n−l₀ )whereQ^∗_θ(Mⁿ₀ → M¹0) = ^g(^Pⁱ_G(Mⁿⁱ^,n)q(n,1))n

0) is the probability of jumping to the terminal state.

Remark 4.16. The quantities p_θ(n, d) and the proposal of the merging size involve the optimal proposal distribution for samples with at most one segregating site so these quantities can be computed explicitly.

Figure 4.2 depicts a sample configuration and all corresponding compressed genetrees with one mutation. If the underlying genealogy is given by Kingman’s coalescent, then the transition probabilities can be computed explicitly.

Explicit Formulae for the Kingman case

The authors in [HUW08] developed their proposal distribution in the Kingman case (Λ = δ₀). In this case the transition probabilities of the Markov urn introduced in Definition 3.28 take the form

(t,n)→





(t,n+e_i

w.p. ⁿ_nⁱ_n−1+θⁿ⁻¹ a_i(t),n

w.p. _n¹_n−1+θ^θ ifn_i = 1 e_i(t),e(n−ei)

w.p. _d+1¹ ⁿ_nⁱ_n−1+θ^θ ifni >1,

(4.21)

whereU⁽ⁿ^f⁾(1) = [(0)]_∼,(2)

almost surely anda_i,e_i and eas in Definition 3.28. This Markov chain was introduced in [EG87, Section 5] and restated in [HUW08]. Note that the transition probabilities do not depend on the final sample size, soU⁽ⁿ^f⁾ =U. This fact was also remarked in the proof of Proposition 1 from [BB08]. Furthermore, either the number of lines or the number of mutations increase by one in each step and therefore the sample complexity comp(t,n) = s+n increases in each step by exactly one where, as before,n=Pd

i=1n_i andsis the number of segregating sites in (t,n). It follows that the Markov urn can visit a given state (t,n) only at the times+n−1 and immediately leaves it in the next step, thus the Green function simplifies to

G⁽ⁿ^f⁾(t,n) =E_µ X∞

k=0

1{U(k)=(t,n)}=P_θ

U(s+n−1) = (t,n) (4.22)

and the probability of observing the sample (t,n) at stationarity is given by p(t,n) =P_θ

U(s+n−1) = (t,n) n−1 n−1 +θ.

The authors in [HUW08] investigated the case of only one segregating site in the sample and showed that

P_θ{U(n) =Mⁿd}=

n−d+1X

k=2

θ k−1 +θ

n−2Y

l=1

l l+θ

n−d−1 k−2

n−1 k−1

−1

(4.23)

holds, see [HUW08, Appendix]. A main ingredient of the proof is a result from [S00b]

on the time at which the unique mutation occurred in the coalescent tree.

Recall that in Kingman’s coalescent only binary merging is possible. For d ≥ 2 Theorem 4.4 and the special form of the Green function (4.22) yield

p_θ(n, d) =P_θ

U(n−1) =Mⁿ⁻¹_d−1U(n) =Mⁿd

=P_θ

U(n) =Mⁿd

U(n−1) =Mⁿ⁻¹_d−1 P_θ

U(n) =Mⁿ_d

= d−1 n−1

n−2 n−2 +θ

P_n−d+1

k=2 θ

k−1+θ

Q_n−3

l=1 l l+θ

n−d−1 k−2

_n−2

k−1

−1

P_n−d+1

k0=2 θ k0−1+θ

Q_n−2

l=1 l l+θ

n−d−1 k0−2

_n−1

k0−1

−1

Pn−d+1

k=2 1

k−1+θd−1 n−k

n−d−1 k−2

_n−1

k−1

−1

P_n−d+1

k0=2 1 k0−1+θ

n−d−1 k0−2

_n−1

k0−1

−1

for the probability that individuals bearing the mutation were involved in the most recent event, which is a merger of size 2 among those individuals in this case. This coincides with equation (11) in [HUW08]. For d = 1 the probability that the most

recent event was the origin of the mutation is given by P

U(n−1) =Mⁿ0

U(n) =Mⁿ1

U(n−1) =Mⁿ0

U(n) =Mⁿ1

= 1 2

θ n−1 +θ

Q_n−1

l=1 l

Pn l+θ

k=2 θ

k−1+θ

Qn−1 l=1 l

l+θ n−2 k−2

_n−1

k−1

−1

= 1 2

1 n−1+θ

P_n

k=2 1

k−1+θk−1 n−1

which basically coincides with equation 14 in [HUW08]. The factor ¹₂ vanishes if Mⁿ_d is replaced by [Mⁿd]_≈.

Regarding Pairs of Mutations

In this section we extend the approach of [HUW08] to consider pairs of mutations. First note that there are two kinds of structurally distinct genetrees with two mutations.

Definition 4.17. Let

Mⁿd1,d2 :=

(0),(0,1),(0,2)

∼,(n−d₁−d₂, d₁, d₂) ,

be the genetree, where the two mutations are on different branches. The number of individuals carrying mutation 1 is d₁ and d₂ gives the number of individuals bearing mutation 2. Denote by

Ndⁿ1,d2 :=

(0),(0,1),(0,1,2)

∼,(n−d₁, d₁−d₂, d₂)

the sample configuration where the mutations are on the same branch. The number of individuals carrying only mutation 1 is d₁−d₂ and both mutations are carried by d₂ individuals.

The two possible types of genetrees are depicted in Figure 4.3.

Remark 4.18. (i) Note that [Mⁿ_d₁_,d₂]≈ = [Mⁿ_d₂_,d₁]≈ holds ford1+d2≤n. Furthermore, note that Mⁿ_d₁_,d₂ with d₁ +d₂ =n, N_dⁿ₂_,d₂ and N_n,dⁿ ₂ denote valid genetrees with two segregating sites, whereas by a slight abuse of notation Mⁿ_d₁_,0 = Mⁿ_0,d₁ = Mⁿ_d₁ and N_dⁿ₁_,0 =Mⁿ_d₁ denote genetrees with only one segregating site. As before we denote by Mⁿ0,0 =N0,0ⁿ =Mⁿ0 the sample of size nwith no segregating sites.

To consider pairs of mutations determining the probabilities of performing a step involving type i in a general sample configuration [t,n] it is necessary to know the relation of the outmost mutation of type i to two given mutations m1 and m2 in the genetree. In other words, the type in the compressed genetree corresponding to typei has to be determined. Based on this information the corresponding most recent event

0 n−d1−d2

1 d1 2 d2

(a) The genetree for the sam-ple configuration [Mⁿ_d

1,d₂]≈

0 n−d₁

1 d₁−d₂

2 d₂

(b) The genetree for the sam-ple configuration [N_dⁿ

1,d₂]≈

Figure 4.3.: The two different sample configurations of sizenwith two segregating sites (or mutations). The number of individuals carrying mutation 1 isd₁ andd₂ individuals carry mutation 2

in the compressed tree can be chosen. Figure 4.4 shows two examples of compressing the genetree for two given mutations. Due to symmetry reasons the relation can be described by one of five distinct cases.

Definition 4.19. The relation between the type iand the mutations m₁ and m₂ is Case I if s_i,m₁ = 1 and s_i,m₂ = 1,

Case II if si,m1 = 1, si,m2 = 0 and m1, m2 are on the same branch, Case III if s_i,m₁ = 1, s_i,m₂ = 0 and m₁, m₂ are on different branches, Case IV if s_i,m₁ = 0, s_i,m₂ = 0 and m₁, m₂ are on the same branch, or Case V if si,m1 = 0, si,m2 = 0 and m1, m2 are on different branches.

In Case I type ibears both mutations. In Case II and III typeibears one mutation and in Case IV and V the type bears none of the two mutations. In the first case the two mutations lie on the same branch of the genetree. In the remaining four cases this property may or may not hold. The five cases are depicted in Figure 4.6 and the pseudo-code in Figure 4.5 can be used to identify the relation of the outmost mutation of typeito the mutationsm₁and m₂ in the genetree and the case to be used in a given setting.

In each case the most recent event affecting one of the three types is either a mutation event (not for the root type) or a merging event. For convenience denote by

d₁ =d_m₁ = Xd

i=1

s_i,m₁d_i

the number of individuals carrying mutationm1 and by d₂ =d_m₂ =

i=1

s_i,m₂d_i

the number of individuals bearing mutationm2. Without loss of generality the mutation labels can be chosen such thatd₁ ≥d₂ ⇒m₁< m₂ holds ifm₁ andm₂ are on the same branch.

0 3

4 1 5

8 2

2 5 3

6 1 i

⇒

0 9

1 1

8 2 i

(a) The two mutations considered in this example are mutation 1 and 8. The com-pressed genetree is of the formN_dⁿ

1,d₂ and typeicorresponds to mutation 1.

0 3

4 1 5

8 2

2 5 3

6 1 i

⇒

0 9

5 2 3 1

(b) In this example mutation 3 and 5 are con-sidered. The compressed genetree is of the form Mⁿ_d

1,d₂ and typeicorreponds to the root type.

Figure 4.4.: Two examples of genetree compressions. The type iis identified with one of the three types in the compressed sample by this procedure.

In Case III and V the compressed genetree is of the form Mⁿd1,d2. In Case V the probability that the most recent event reduced the multiplicity of the type bearing no mutation by l is given by Q^∗_θ(Mⁿ_d₁_,d₂ → M^n−l_d₁_,d₂) if n−d1−d2 > l. In Case III the probability that the multiplicity of the type bearing mutation 1 is decreased bylis given by Q^∗θ(Mⁿd1,d2 → M^n−l_d₁_−l,d₂) if d₁ > 1. If d₁ = 1, the probability that the most recent event was the origin of mutation 1 is given byQ^∗_θ(Mⁿ_1,d₂ → Mⁿ_0,d₂). By symmetry Case III includes the case that typeicorresponds to m₂ in the compressed genetree.

In Case I, II and IV the compressed genetree is of the form N_dⁿ₁_,d₂. The most recent event affected the type bearing no mutation in Case IV. The probability that the mul-tiplicity of the type is decreased bylis given byQ^∗_θ(N_dⁿ₁_,d₂ → N_d^n−l₁_,d₂) ifn−d₁ >1. The multiplicity of the type bearing only mutation 1 gets decreased by l with probability Q^∗θ(Ndⁿ1,d2 → N_d^n−l₁_−l,d₂) ifd₁−d₂ >1 in Case II. Note that in Case II the probability for the most recent event being the origin of mutation 1 is zero since in general mutation 2 is a successor of mutation 1 in the genetree and must therefore be removed from the genetree before mutation 1 can be removed. Decreasing the multiplicity of the type bearing both mutations bylin Case I occurs with probabilityQ^∗_θ(N_dⁿ₁_,d₂ → N_d^n−l₁_−l,d₂_−l) if d2 > 1. In Case I mutation 2 can originate from the most recent event and the probability is given byQ^∗_θ(N_dⁿ₁_,1 → Mⁿ_d₁_,0) if the multiplicity of the type bearing both mutations is one.

The optimal transition probabilities for samples with two mutations can be calculated numerically. To determine the transitions of the proposal Markov chain until it hits the root configuration M¹0 the transition probabilities for the cases with one mutation or zero mutations also have to be provided. The above considerations can now be combined in several ways to obtain proposal distributions for general sample configurations.

We now present several proposal distributions combining the considerations from above in different ways.

1 i f same branch (m1,m2) : 2 i f ( same branch (i,m1) ) :

3 i f (i < m1) : + _⇒_{Case IV}

4 e l s e:

5 i f same branch (i,m2) :

6 i f (i < m2) :

+ ^⇒^{Case II}

7 e l s e:

⇒Case I

8 e l s e:

+ ^⇒^{Case II}

9 e l s e: +_⇒_{Case IV}

10 e l s e:

11 i f same branch (i,m1) :

12 i f (i < m1) : + ⇒Case V

13 e l s e:

+ ^⇒^{Case III}

14 e l s e:

15 i f same branch (i,m2) :

16 i f (i < m2) : + ⇒Case V

17 e l s e:

+^⇒^{Case III}

18 e l s e: x ⇒Case V

Figure 4.5.: Pseudo-code to check the constellation of typei(or the outmost mutation of this type) and the two mutationsm₁andm₂in the genetree. The predicate same branch(m₁, m₂) is true if and only if the two mutations are on the same branch. In the pictograms the hollow dot depicts the root. The two filled dots correspond to the two mutations and the cross, ‘+’ or ‘x’, gives the position of the outmost mutation of type i. The case corresponding to the given constellation is indicated to the right of the arrow. Furthermore, m₁ < m₂ holds if mutation m₂ is a descendant of mutation m₁ in the genetree.

0 n−d₁

m₁ d₁−d₂

m₂ d₂ (a) Case I

0 n−d₁

m₁ d₁−d₂

m₂ d₂ (b) Case II

0 n−d₁−d₂

m₁ d₁ m₂ d₂ (c) Case III

0 n−d₁

m₁ d₁−d₂

m₂ d₂ (d) Case IV

0 n−d₁−d₂

m₁ d₁ m₂ d₂ (e) Case V

Figure 4.6.: The five different cases of types being affected by the most recent event in the two genetrees corresponding to configurationsMⁿ_d₁_,d₂ (Case III and V) and N_dⁿ₁_,d₂ (Case I, II and IV).

Two-Step Procedure

One way to extend the proposal distributionQ^H¹ to consider pairs of mutation is again a two step procedure as in Definition 4.15, where in the first step a type that is involved in the most recent event is chosen and the size of the possible merger is determined in the second step. Fix a configuration (t,n). For each typei,1≤i≤dand all unordered pairs of mutations in the genetree {m1, m2} let

u^{m_θ ¹^,m²^}(i) :=











1{d2>1}

dX2−1

l=1

Q^∗θ(Ndⁿ1,d2,N_d^n−l₁_−l,d₂_−l) +¹_{d₂_=1}Q^∗θ(Ndⁿ1,1,Ndⁿ1,0)

in Case I

d1−d2¹{d1−d2>1}

d1−dX2−1

l=1

Q^∗θ(Ndⁿ1,d2,N_d^n−l₁_−l,d₂) in Case II

1{d1>1}

dX1−1

l=1

Q^∗θ(Mⁿd1,d2,M^n−l_d₁_−l,d₂) +¹_{d₁_=1}Q^∗θ(Mⁿ1,d2,Mⁿ0,d2)

in Case III

n−d1

1{n−d1>1}

n−dX1−1

l=1

Q^∗θ(Ndⁿ1,d2,N_d^n−l₁_,d₂) in Case IV

n−d1−d2¹{n−d1−d2>1}

n−dX1−d2−1

l=1

Q^∗θ(Mⁿd1,d2,M^n−l_d₁_,d₂) in Case V be the probability that if the genetree is compressed to the one consisting only of mutation m₁ and m₂, then the most recent event involved an individual bearing type i. Againd_j =d_m_j denotes the number of individuals bearing the respective mutation.

In Case I and III type i identifies with one of the types in the compressed tree that could have been involved in either a merging event or a mutation event, depending on the number of types bearing the corresponding mutation. In the other three cases the type could have been involved in a merging event if the number of individuals bearing the corresponding mutation is greater than or equal to 2. Note that we write Q^∗_θ(H, H^′) instead of Q^∗_θ(H → H^′) for simplicity and that in Case I, III and IV the

order of the mutations in the compressed tree N_dⁿ₁_,d₂ is determined by their order in the original genetree. As in equation (4.21) the prefactors stem from the fact that n_i individuals bearing typeihave to be chosen from a larger group of individuals bearing the respective mutation.

Based on this probabilities define the distributionQ^H_θ²^f¹ with probabilities Q^H_θ²^f¹(i|t,n)∝

{m1,m2}u^{m_θ ¹^,m²^}(i) ifiis allowed to act,

0 otherwise, (4.24)

for all typesi∈ {1, . . . , d}, where the sum extends over all unordered pairs of mutations present in the sample. This distribution can be used to propose a type to be involved in the most recent event. One way of choosing the size of the possible merger is analogous to the one pursued inQ^H¹ in that again the probabilities of the merger size in a specific sample, now with two mutations, are considered.

Definition 4.20 (Proposal distributionQ^H_θ²^f^1,1). Let (t,n) be the current sample con-figuration. Choose typeito be involved in the most recent event according toQ^H_θ²^f¹(i|t,n) defined in equation (4.24). If n_i= 1, remove the outmost mutation of typei. Ifn_i ≥2, letobe the outmost mutation of typei. Reduce the multiplicity of typeibylwith proba-bilityQ^∗_θ(N_dⁿ_o_,d_o_−n_i → N_d^n−l_o_−l,d_o_−n_i). If typeiis the root type, reduce the multiplicity byl with probabilityQ^∗θ(Mⁿn−ni → M^n−l_n−l−n_i). Denote this proposal distribution byQ^H_θ²^f^1,1.

Another way of proposing the merger size is to again consider all mutations present in the sample. Define the quantity

u^{m¹^,m²^}

θ (i, l) :=











d2Q^∗θ(Ndⁿ1,d2 → N_d^n−l₁_−l,d₂_−l) in Case I

d1−d2Q^∗_θ(N_dⁿ₁_,d₂ → N_d^n−l₁_−l,d₂) in Case II

d1Q^∗_θ(Mⁿ_d₁_,d₂ → M^n−l_d₁_−l,d₂) in Case III

n−d1Q^∗_θ(N_dⁿ₁_,d₂ → N_d^n−l₁_,d₂) in Case IV

n−d1−d2Q^∗θ(Mⁿd1,d2 → M^n−ld1,d2) in Case V

for a given unordered pair of mutations{m₁, m₂} and a given 1≤l≤n_i−1, providing the weight of anl+ 1-merger in the compressed genetree. Based on this quantities the following proposal distribution can be defined:

Definition 4.21 (Proposal distributionQ^H_θ²^f^1,2). Let (t,n) be the current sample con-figuration. Choose typeito be involved in the most recent event according toQ^H_θ²^f¹(i|t,n) defined in equation (4.24). If n_i= 1, remove the outmost mutation of typei. Ifn_i ≥2, choose to decreasen_i byl with probability proportional to

m1,m2

u^{m¹^,m²^}

θ (i, l),

where the sum extends over all pairs of mutations present in the current sample. Denote this proposal distribution by Q^H_θ²^f^1,2.

Whereas the proposal distributions in this paragraph perform two steps to choose the transition of the Markov chain, we introduce proposal distributions performing only one step in the next paragraph.

One-Step procedure

The ideas of [HUW08] can be extended in another direction by choosing the type in-volved in the most recent event and the size of the possible merger in one step. In this paragraph we present two proposal distributions that let pairs of mutations valuate all possible transitions and then the most recent step is chosen proportionally to the weights.

Note that a possible transition of the proposal Markov chain can be characterised by a pair (i, l) with 1 ≤ i ≤ d and 0 ≤ l ≤ n_i−1, where i denotes the type that is involved in the most recent event and l denotes the amount by which the multiplicity is decreased. Denote byl= 0 the case that the outmost mutation of type iis removed from the genetree. The probability for an event (i,0) is zero ifn_i ≥2 since a type has to be singleton in order to remove the outmost mutation. Forni ≥2 andl >0 let

v^{m_θ ¹^,m²^}(i, l) :=











d2Q^∗θ(Ndⁿ1,d2 → N_d^n−l₁_−l,d₂_−l) in Case I

d1−d2Q^∗_θ(N_dⁿ₁_,d₂ → N_d^n−l₁_−l,d₂) in Case II

d1Q^∗_θ(Mⁿ_d₁_,d₂ → M^n−l_d₁_−l,d₂) in Case III

n−d1Q^∗θ(Ndⁿ1,d2 → N_d^n−l₁_,d₂) in Case IV

n−d1−d2Q^∗_θ(Mⁿ_d₁_,d₂ → M^n−l_d₁_,d₂) in Case IV

be the probability that if the genetree would just consist of mutationm₁ and m₂, then the multiplicity of the type that corresponds toiin the compressed genetree is reduced by l. Again the optimal transition probability has to be multiplied withn_i divided by the multiplicity of the corresponding type.

If the multiplicity of type iis one, then

v^{m_θ ¹^,m²^}(i,0) :=











d2Q^∗θ(Ndⁿ1,1→ Ndⁿ1,0)¹_{d₂_=1} in Case I

0 in Case II

d1Q^∗_θ(Mⁿ_d₂_,1 → Mⁿ_d₂_,0)¹_{d₁_=1} in Case III

0 in Case IV

0 in Case V

(4.25)

yields the probability that in the compressed genetree consisting of mutationm1andm2

the most recent event removed the outmost mutation of the type that is identified with i. Note that this probability is zero if the respectived_m_jis greater than one, because in this case the outmost mutation cannot be removed in the compressed tree. Furthermore in Case II, IV and V the mutation cannot be removed since it is not a leaf node in the genetree. These quantities can be combined as follows:

Definition 4.22 (Proposal DistributionQ^H_θ²^f^2,1). Denote by(t,n) the current sample configuration. The proposal distribution Q^H_θ²^f^2,1 proposes the event (i, l) for 1≤i≤d and 0≤l≤ni−1 to be the most recent evolutionary event with probability

Q^f_θ^2,1(i, l|t,n)∝ (P

{m1,m2}v^{m_θ ¹^,m²^}(i, l) if iis allowed to act

0 otherwise,

where the sum extends over all pairs of mutations present in the current sample and again type iis allowed to act if n_i≥2 or it is a leaf node in the genetree.

Note that in a given sample (t,n), by equation (4.25) the contribution of the pair {m₁, m₂} to the event (i,0) of removing the outmost mutation of a leaf typei is zero if the corresponding dmj is greater than one. In a general genetree this case appears rather frequently and thus we argue that the proposal distribution Q^H_θ²^f^2,1 underrates mutation events. This effect is illustrated in Figure 4.8. To circumvent this effect, we modify the proposal distribution by summing only over those pairs of mutations where one of the mutations coincides with the outmost mutation of the current type. This should reduce the number of pairs that put too much emphasis on the merging events and establishes a more balanced proposal distribution.

Definition 4.23 (Proposal DistributionQ^H_θ²^f^2,2). Denote by(t,n) the current sample configuration. The proposal distribution Q^H_θ²^f^2,2 chooses the event (i, l) for 1 ≤ i ≤d an0≤l≤n_i−1 with probability

Q^H_θ²^f^2,2(i, l|t,n)∝ (P

m6=o(i)v_θ^{o(i),m}(i, l) if i is allowed to act

0 otherwise,

whereo(i)denotes the outmost mutation of typeiand the sum extends over all mutations except this outmost mutation.

Remark 4.24. (i) Another promising side effect of this method is that it reduces the complexity of proposing a step from quadratic to linear in the number of mutations.

(ii) The same authors who derived the probabilities (4.23) in the case of Kingman’s co-alescent and one segregating site present equations for the case of two segregating sites in Section 4 of [HW09]. They note in Section 7 that their results “could potentially be used to further improve the proposal distribution for inference in coalescent mod-els.” Indeed, their results can be applied to derive explicit formulas for the quantities um1,m2(l), u_m

1,m2(i, l), and vm1,m2(l) that govern the proposal distributions regarding pairs of mutations.

(iii) Note that the idea to let mutations valuate all possible transitions can also be applied for the case when just one mutation at a time is considered in the sense ofQ^H_θ¹. The next proposal distribution combines the single-mutation approach with the pair approach.

Sesqui-proposal

The complexity of the proposal distributions regarding pairs of mutations is quadratic in the number of mutations, whereas the proposal distributions regarding all mutations have linear complexity. We will see in Section 4.3 that the real time to compute steps for the distributions differ. However, we noted that the method determining the size of the merger in proposal distributionQ^H_θ²^f^1,1 from Definition 4.20 performs well. Thus a promising candidate concerning speed and performance is given by the combination of proposing a type in the first step considering all mutations separately and then choosing the merging size in the second step by the method from Q^H_θ²^f^1,1.

Definition 4.25 (Proposal Distribution Q^SQH_θ ). Choose type i to be involved in the most recent event considering all mutations according to the distribution Q^H_θ¹ defined in (4.20). If a singleton type is chosen, remove the outmost mutation, whereas in the case of a non-singleton type withn_i ≥2the multiplicity is decreased bylwith probability Q^∗_θ(N_dⁿ_o_,d_o_−n_i → N_d^n−l_o_−l,d_o_−n_i), where 1≤l≤n_i−1.

Im Dokument Multiple Merger Coalescents and Population Genetic Inference (Seite 109-121)

4. Importance Sampling in the Infinitely-Many-Sites-Model 89

4.2. Importance Sampling Schemes

4.2.3. Hobolth, Uyenoyama &amp; Wiuf’s Scheme

n − d

d

n

4.2.3. Hobolth, Uyenoyama & Wiuf’s Scheme