• Keine Ergebnisse gefunden

Boom, Boulding, Beckenbach’s Oyster Data

5. Inference of Evolutionary Parameters from Datasets 125

5.6. Real data

5.6.2. Boom, Boulding, Beckenbach’s Oyster Data

The dataset used in this section was was taken from [BBB94]. The authors obtained the data as the result of a restriction-enzyme digest of mitochondrial DNA taken from 155 Pacific oysters (Crassostrea gigas) from British Columbia. After transferring this data into an S-matrix introduced in Section 3.3.1. resembling sequence data with 45 segregating sites we had to remove one site to resolve violations of the infinitely many sites model. The sample configuration obtained by this procedure showed 40 haplotypes and is depicted in Figure 5.10. Of the resulting 44 segregating sites 30 where singletons.

This dataset was analysed in [EW06] and the authors already pointed out the underlying genealogy might not be adequately modelled by Kingman’s coalescent.

The resulting values for Tajima’sDand Fu & Li’sDare given in Table 5.1. The max-imum likelihood estimates based on the full information and on the frequency-spectrum under the Beta-coalescent and the ψ-coalescent are given in Table 5.2. Complete like-lihood surfaces were computed on a discrete grid with stepsizes ranging from 0.05 to

1001 12 11411 22 1 5 1 1 1 2 111 2 1 1 1 11412 1 1 1 1 1 1 1 111 1

Figure 5.10.: Genetree reconstructed from the RFLP data in [BBB94]. The picture was generated using Bob Griffiths’s treepic program (obtained from http://www.stats.ox.ac.uk/~griff/software.html).

Ref. n Tajima’s D CI Fu & Li’sD CI

[BBB94] 155 -2.65 [-1.76, 2.10] -5.8 [-2.38,1.63]

Table 5.1.: Tajima’s D and FU & Li’s D for Pacific oyster data. The statistical tests strongly reject the neutral Kingman hypothesis.

0.2 in the mutation rate. Theα stepsize for the frequency-spectrum analysis was 0.025 and for the method based on the complete information 0.05. For the parameterψ the stepsizes were 0.01 and 0.02 respectively. The maximum was taken on this discrete grid.

We obtained the likelihood values for the frequency-spectrum by simulating coalescent trees to approximate the expected value in equation (3.32). For the method based on the complete information we indicate by the † symbol that the corresponding surface was obtained by estimating the likelihood values at a certain gridpoint using different driving values. The different estimates were then suitably combined. Each gridpoint was covered by approximately 11 driving values and there were 4 times as many gridpoints as driving values. The ‡ symbol indicates that the likelihood values were estimated independently at each gridpoint. The estimated standard deviations are not shown, but the relative error was made smaller than 0.01 using sufficiently many independent runs.

Except for the‡analysis where we provide the likelihood surface and the corresponding standard deviations in Figure 5.11. Here the relative error is approximately one, so a more thorough analysis is required.

The values of Tajima’s D and Fu & Li’s D for the oyster dataset are negative and lie outside the confidence intervals, indicating a shallow genealogy. Furthermore, the

Ref. (ˆr,α) (fs)ˆ (˜r,α) (flh)˜ (ˆr,ψ) (fs)ˆ (˜r,ψ) (flh)˜ [BBB94] (1.15, 1.125) (1.2, 1.15) (1.95, 0.09) (3, 0.04)

Table 5.2.: Estimates for the mutation rate r and the parameter for the coalescent in the two different models (α/ψ) for the [BBB94] Pacific oyster dataset. The parameter are estimated by the method based on the frequency-spectrum (fs) and the method based on the full tree (flh).

−50 −40

−30

−30 −20

−10

r

ψ

1234

0.0 0.1 0.2 0.3

(a) Estimated log-likelihood surface for the Pacific oyster dataset.

−50

−40

−30

−30 −20 −10

r

ψ

1234

0.0 0.1 0.2 0.3

(b) Corresponding logarithm of estimated standard deviations.

Figure 5.11.: Log-likelihood surface and corresponding estimated standard deviation for the Oyster dataset analysed under the ψ-coalescent. The likelihood values were estimated independently for each discrete gridpoint using 108 independent runs of the proposal chain.

maximum likelihood approach based on the frequency-spectrum and the full likelihood estimatesα = 1.25 respectively 1.15 andψ = 0.09 respectively 0.04 for the tree shape parameter. Thus the underlying genealogy is more likely to be of non-Kingman type.

However, as was pointed out by the authors in [BBB94], the population under con-sideration has been introduced into the habitat approximately one century ago, hence providing evidence for a severe recent bottleneck. It is unclear whether the shallow ge-nealogy observed in the data is caused by this bottleneck or by a neutral sweepstake-like reproduction mechanism featured in the population models presented in Section 5.2.

5.6.3. ´Arnason’s Atlantic Cod Data

The third kind of data we are going to analyse is based on DNA sequence data taken from a 250 bp stretch of the mitochondrial cytochromebgene of the Atlantic cod (Gadus morhua). In the numbering of the sites from [JB96] this stretch ranges from site 14,459 to site 14,708 included. In [A04], ´Arnason took several cod datasets from different publications, combined them and provided an analysis of the whole dataset. However, since this combined dataset is too big to be treated by our method for computing likelihoods based on the full information, we analysed certain samples separately. As Arnason pointed out the samples stem from various localities throughout the Atlantic,´ ranging from Newfoundland [CM91], [PC93] & [CSHW95], Greenland [APKS00], the Faroe Islands [SA03], and Norway [AP96] to the Baltic Sea [APP98].

The DNA sequence data of the different types present in the different samples can be found in Figure 1 on page 1874 in [A04]. The composition of the sample from [AP96]

is {A: 35, E: 25, G: 14, D: 14, NI: 4, B: 2, C: 2, F: 1, GI: 1, DI: 1, BI: 1}, the sample from [APP98] is{E: 62, A: 19, G: 12, D: 6, DI: 3, H: 2, ES: 1, DK: 1, C: 1, EJ: 1, NI:

1}, the sample from [APKS00] is {A: 48, D: 6, E: 8, G: 7, C: 1, NI: 1, MI: 1, LI: 2, S:

1, PI: 1, GJ: 1, ED: 1}, the sample from [CM91] is {A: 36, B: 2, C: 2, D: 1, E: 4, F: 1, G: 4, H: 1, I: 1, J: 1, K: 1, L: 1}, the sample from [CSHW95] is{A: 201, B: 1, C: 2 , D:

7, E: 4, H: 3, J: 2, N: 4, O: 1, P: 1, S: 3, T: 1, X: 2, Y: 2, Z: 1, a: 1}, the sample from [PC93] is{A: 84, G: 6, E: 4, P: 1, X: 1, U: 2, N: 2, M: 1, C: 1, J: 1}, and the sample from [SA03] is {A: 26, E: 13, D: 11, G: 11, MI: 3, H: 2, NI: 1, C: 1, EY: 1, EX: 1, EL:

1, EK: 1, DO: 1, DL: 1}.

From [APKS00] we took the Greenland subsample and we restricted the sample from [APP98] to the Baltic and transition area. In [SA03] the authors provide the DNA sequence data of a larger 566 bp stretch from which we only took the information of the 250 bp fragment in question. Furthermore, in [CSHW95] the authors report the type ‘R’ which differs from the type ‘A’ only outside of the 250 bp segment we are considering. Thus we count this type as an ‘A’ type in the Case 2) analysis which will be explained below.

Some of the samples violated the assumptions of the infinitely many sites model and could not be transformed into a valid sample configuration. In the following we pursue two approaches to resolve these violations.

Case 1) For the first approach we considered every sample separately and alter-natingly removed types and segregating sites until the violations disappeared. This

Ref. Modifications

[AP96] Type DI and site 14,562 removed from analysis.

[APP98] Type DK and site 14,691 removed from analysis.

[APKS00] Type ED and site 14,691 removed from analysis.

[CM91] No violations.

[CSHW95] No violations.

[PC93] No violations.

[SA03] Type NI removed from analysis.

Table 5.3.: For each dataset the types and sites that had to be removed from the analysis in order to resolve the violations of the infinitely many sites model are shown.

procedure resulted in the modifications shown in Table 5.3. We chose the most fre-quent type to be the ancestral one. Thus the ancestral type was ‘A’ for every dataset, except for the data from [APP98], where we chose type ‘E’. In what follows we refer to the sample configurations obtained by this procedure as the Case 1) samples.

Case 2) In the second approach we took the complete sample ´Arnason analysed in [A04] and solved the violations of the infinitely-many-sites model by introducing a consistent pattern of parallel mutations. This approach leads to a dataset that resembles the phylogenetic maximum parsimony network from Figure 2 of [A04, p. 1875]. We then obtain the subsamples corresponding to the different publications by choosing the corresponding types in the corresponding quantities from this violation-free dataset.

We choose ‘A’ as the ancestral type for each dataset. The complete modified dataset with all parallel mutations is given in Figure 5.12. We mark the datasets obtained by this procedure with the number 2).

Some figures for the different datasets are given in Table 5.4 and the values of the test statistics are shown in Table 5.5. ´Arnason refers to these test statistics in [A04, p. 1876] and though he presents no actual values, he reports highly significant negative values for some datasets.

Again, we computed maximum likelihood estimates for the mutation rate and the parameter of the underlying coalescent tree based on equation (3.10) for each dataset.

For the Atlantic cod datasets (Case 1) and 2)) Table 5.6 shows the estimates where the underlying genealogy is the Beta-coalescent and the results in Table 5.7 are based on the ψ-coalescent. The distance between gridpoints was (0.05, 0.05) for the Case 1) analysis and (0.05, 0.05) for the Case 2) analysis. For the method based on the full information the probabilities could be evaluated using the exact recursion for all datasets except [CSHW95]. For this dataset and for all datasets in the summary statistics case we had to approximate the likelihood value with the abovementioned Monte-Carlo approximations. The symbols † and ‡ dennote the same methods as in Section 5.6.2.

In each case we used sufficiently many independent runs of the Markov chain so that the estimated relative error was below 0.01. Likelihood surfaces for selected datasets are shown in Figure 5.13 and Figure 5.14.

Segregating site

4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 Haplo- 6 6 6 6 6 7 8 8 8 8 9 9 9 0 0 0 1 1 2 2 3 4 4 4 4 6 6 6 6 8 8 8 9 0 3 3 3 3 3 4 4 4 5 5 5 5 6 7 7 7 8 8 9 9 9 9 9 9 9 0 type 3 6 7 7 8 5 1 7 7 8 0 6 6 2 8 8 4 7 2 3 5 2 4 7 7 2 2 2 5 6 6 9 5 1 1 1 1 1 7 3 9 9 8 8 8 8 4 3 6 6 5 5 1 1 1 1 5 5 7 6 N A . . . 696 E . . . p . . . 193 D . . . o . . . p . . . 124 G . . . p . . . . 112 C . . . p . . . . 29 NI . . . p . . . p . . . . 15 H . . . p . . . . 9 MI . . . p . . . . 7 N . . . o . . . . 6 J . . . o . . . o . . . p . . . . 5 EC . . . p . . . p . . . . 5 DI . . . o . . . p . p . . . . 5 S . . . p . . . . 4 AJ . o . . . . 4 XI . . . o . . . p . . . . 3 X . . . p . . . . 3 RI . . . p . . . . 3 LI . . . p . . . p . . . p . . . . 3 BI . . . p . . . p . . . . 3 B . . . p . . . . 3 Y . . . p . . . . 2 U . . p . . . . 2 P . . . o . . . . 2 O . . . o . . . p . . . . 2 LJ . . . p . . . p . . . . 2 GJ . . . p . . . p . . . . 2 AI . . . o . . . . 2 a . . . o . . . . 1 ZI . . . . o . . . . 1 Z . . . o . . . . 1 UI . . . p . . . p . . . . 1 TI . . . o . . . p . . . . 1 T . . . o . . . . 1 QI . . . o . . . . 1 PI . . . p . . . 1 OI . . . p . . . o . . . . 1 MC . . . p . p . . . p . . 1 M o . . . p . . . p . . . p . . 1 L . . . o . . . . 1 K . . . o . . . . 1 I . . . o . . . p . . . . 1 HJ . . . p . . . p . . . . 1 HI . . . p . . . p . . . . 1 GI . . . p . . . . 1 F . . . p . . . o . . . . 1 EY . . . p . . . p . . . . 1 EX . . . p . . p . . . . 1 ES . . . p . . . p . . . . 1 EM . . . p . . . p . . . . 1 EL . . . o . . . p . . . . 1 EK . . . p . . . o . 1 EJ . . . p . . . p . . . . . 1 EI . . . o . . . p . . . . 1 ED . . . p . . . o . . . p . . . . 1 DY . . . o . . . p . . . . p . . . . 1 DO . . . o o . . . p . . . . 1 DL . . . o . . . p . . . o 1 DK . . . o . . . p . . . p . p . . . . 1 DJ . . . o . . . p . . . p . . . . 1

Figure 5.12.: The dataset taken from [A04] where we introduced several parallel muta-tions in order to resolve the violamuta-tions of the infinitely many sites model.

Site numbers are as in [JB96] minus 14,000. The ‘.’ mark the cases where the respective type has the same nucleotide as the ancestral sequence (type ‘A’). The letter ‘o’ denotes a mutation at the respective site and

‘p’ denotes a mutation on a segregating site where we had to introduce parallel mutations to resolve violations of the infinitely many sites model.

Note that for a segregating site with parallel mutations there are multiple columns having the same number.

Case 1) Case 2)

Ref. Location n d s ηi ηe n d s ηi ηe

[AP96] Norway 99 8 7 5 2 100 11 10 6 4

[APP98] Baltic/ trans. area 108 7 6 4 2 109 11 10 6 4

[APKS00] Greenland subsample 77 9 8 5 3 78 12 11 6 5

[CM91] Norway/ Newfoundland 55 12 11 5 6 identical to a)

[CSHW95] Newfoundland 236 17 16 9 7 236 16 15 9 6

[PC93] Newfoundland 103 10 12 5 7 identical to a)

[SA03] Faroe Islands 73 14 13 6 7 74 15 14 6 8

Table 5.4.: The datasets used in the analysis together with their respective sample size (n), number of haplotypes (d), segregating sites (s), number of mutations on internal branches (ηi) and number of singleton mutations (ηe).

−18 −14

−14 −12

−12 −10

−10

r

1.2 1.4α 1.6 1.8

0.51.01.52.0

(a) Estimated log-likelihood values for the dataset from [CSHW95].

−20 −18 −16

−16 −14

−14 −12

r

1.2 1.4 α 1.6 1.8

0.51.01.52.0

(b) Corresponding logarithm of standard de-viation.

Figure 5.13.: Estimated likelihood surface and corresponding standard deviations in logarithmic scale for the dataset from [CSHW95] in the Case 2) analysis.

The estimates were performed with the driving value method described in the text.

2.53.0

5e−10

5e−10

1e−09

1.5e−09 2e−09

r

1.2 1.4 α 1.6 1.8

0.51.01.52.0

2.0

(a) Likelihood under the Beta-coalscent.

5e−10 5e−10

1e−09 1.5e−09

r

0.02 0.06 ψ 0.10 0.14

0.51.01.52.02.53.0

(b) Likelihood under theψ-coalescent.

Figure 5.14.: Exactly calculated likelihood surfaces for the dataset from [APP98] in the Case 1) analysis.

Case 1)

Ref. Size Tajima’sD CI Fu & Li’sD CI

[AP96] 99 -0.03 [-1.78, 2.07] -0.58 [-2.37, 1.61]

[APP98] 108 -0.5 [-1.78, 2.07] -0.85 [-2.36, 1.61]

[APKS00] 77 -1.39 [-1.79, 2.06] -1.1 [-2.38, 1.59]

[CM91] 55 -1.87 [-1.8, 2.05] -2.24 [-2.45, 1.57]

[CSHW95] 236 -2.15 [-1.75, 2.11] -2.67 [-2.25, 1.65]

[PC93] 103 -2.12 [-1.78, 2.07] -3.07 [-2.36, 1.61]

[SA03] 73 -1.18 [-1.79, 2.06] -2.55 [-2.38, 1.59]

Case 2)

Ref. Size Tajima’sD CI Fu & Li’sD CI

[AP96] 100 -0.66 [-1.78, 2.07] -1.50 [-2.36, 1.61]

[APP98] 109 -0.89 [-1.78, 2.07] -1.54 [-2.36, 1.61]

[APKS00] 78 -1.45 [-1.79, 2.06] -1.83 [-2.38, 1.59]

[CM91] 55 -1.87 [-1.8, 2.05] -2.24 [-2.45, 1.57]

[CSHW95] 236 -2.10 [-1.75, 2.11] -2.23 [-2.25, 1.65]

[PC93] 103 -2.12 [-1.78, 2.07] -3.07 [-2.36, 1.61]

[SA03] 74 -1.28 [-1.79, 2.06] -2.88 [-2.38, 1.59]

Table 5.5.: Rejection of Kingman hypothesis for some Atlantic cod datasets

Case 1) Case 2)

No. Ref. (ˆr,α) (fs)ˆ (˜r,α) (flh)˜ (ˆr,α) (fs)ˆ (˜r,α) (flh)˜ 1 [AP96] (2.4, 1.8) (0.75, 2.0) (1.15, 1.58) (0.7, 1.65) 2 [APP98] (1.3, 1.72) (0.35, 1.5) (1.15, 1.58) (0.6, 1.55) 3 [APKS00] (1.2, 1.62) (0.6, 1.5) (1.1, 1.5) (0.7, 1.5) 4 [CM91] (0.8, 1.33) (0.75, 1.4) (0.8, 1.33) (0.8, 1.4) 5 [CSHW95] (1.1, 1.53) (0.6, 1.45) (1.25, 1.58) (0.6, 1.5) 6 [PC93] (0.6, 1.3) (0.65, 1.4) (0.6, 1.3) (0.6, 1.4) 7 [SA03] (0.8, 1.32) (0.7, 1.35) (0.85, 1.3) (0.7, 1.3)

Table 5.6.: Inference results for Atlantic Cod datasets under the Beta-coalescent based on summary statistics (fs) and on the complete information (flh).

Case 1) Case 2)

No. Ref. (ˆr,ψ) (fs)ˆ (˜r,ψ) (flh)˜ (ˆr,ψ) (fs)ˆ (˜r,ψ) (flh)˜ 1 [AP96] (0.65, 0.015) (0.6, 0.025) (0.8, 0.036) (0.9, 0.03) 2 [APP98] (0.5, 0.024) (0.45, 0.045) (0.8, 0.036) (0.75, 0.045) 3 [APKS00] (0.7, 0.039) (0.75, 0.04) (0.85, 0.06) (1.05, 0.04) 4 [CM91] (0.85, 0.102) (1.05, 0.07) (0.85, 0.102) (1.05, 0.07) 5 [CSHW95] (1 ,0.024) (1.2, 0.02) (0.95, 0.021) (1.6, 0) 6 [PC93] (0.75, 0.078) (1.05, 0.035) (0.75, 0.078) (1.05, 0.035) 7 [SA03] (0.95, 0.084) (1.2, 0.06) (0.95, 0.096) (1.2, 0.065) Table 5.7.: Inference results for Atlantic Cod datasets under theψ-coalescent based on

summary statistics (fs) and on the complete information (flh).

In the next section we argue about the evidence of non-Kingman type genealogies in the Atlantic cod data.