• Keine Ergebnisse gefunden

Sample selection and sequencing strategy

The costs of sequencing are constantly decreasing (NHGRI 2016) which we experienced in majorly different prices for our two projects, where sequencing was conducted with a time difference of just three years. A mammalian genome at a decent coverage of ~15X would cost about 1,500€ today, which is 10 to 15 times more expensive than array genotyping. This effectively limits the number of samples in a study. A cost effective approach chosen in other studies (Zhu et al. 2012) is to sequence DNA pools instead of individual samples. In pool-seq, equimolar amounts of DNA from individuals are pooled and sequenced as one sample, which makes it impossible to determine the origin of each read or construct long haplotypes from short reads, without further methods, such as barcoding. It has been shown that allele frequencies, estimated from reads counts, are sufficiently reliable (Anand et al. 2016; Zhu et al. 2012). Still, there are some disadvantages: Pool-seq is inherently afflicted with a bias introduced by pooling and sequencing errors (Kofler et al. 2011), which is particularly due to a problem of differentiating between these error types and real rare variants. Rare variants are defined as variants with a minor allele frequency less than 1 % (Anand et al. 2016) to less than 5 % (Kim et al. 2010) and are assumed to explain a major part of genetic variation (Anand et al. 2016). Furthermore, many current statistical approaches to identify rare variants were initially not designed for pool-seq data (Wang et al. 2010b). In addition, 𝐹𝑆𝑇 estimates from pooled and individual sequences seem to be differently distributed (Bersaglieri et al.

2004; Akey et al. 2002), a finding we also noticed in chapter 5. A possible explanation is a high number of falsely positive detected differentiations, when either coverage or sample size are too low. Lynch et al. (2014) estimated that at minimum of both 100 samples and 100X implying that a significant proportion of the differentiated variants observed may be false positives. Another strategy that we took in chapters 2 and 3, is to estimate allele frequencies from genotypes based on individual sequencing. Kim et al. (2011) outlined that this should be done with caution when individual samples were sequenced at low to medium coverage (< 15X), since again, rare allele estimates are expected to be biased. If it was necessary, genotypes should not be filtered on genotype confidence, which we actually performed, albeit

126 Chapter 6 in a relatively lenient manner. Despite all these possible disadvantages, pool sequencing has frequently proven to be especially useful to enlighten the background of interesting traits, albeit often violating the aforementioned rules (Carneiro et al. 2014; Rubin et al. 2010;

Lamichhaney et al. 2012).

The major strategical advantage of preferring pool-seq over individual sequencing is the larger number of samples to be included at the same cost, which facilitates incorporation of multiple breeds or multiple strains of the same breed. This in turn is expected to improve the power of differentiation studies by eradicating artefacts based on stratification (Schlötterer 2002; Zhu et al. 2012). We tried to incorporate this by including a second miniature breed, the MiniLEWE in chapters 2 and 3, and by sequencing two pools per stock as described in chapters 4 and 5. With a focus on body size, breeds to be involved in future could be Bama pigs from China (Liu et al. 2008) or the pygmy hog Sus salvanius/ Porcula salvania (Funk et al. 2007). One point that should be mentioned is, that all these analyses including multiple miniature breeds, are based on the hypothesis that their miniaturization has a common genetic background, which is debatable with respect to the situation in humans (Merimee et al. 1989;

Klingseisen and Jackson 2011; Mayer et al. 2001).

Figure 6.1 shows expected heterozygosity, estimated from the GMP pools from the second study (chapters 4 and 5) and the MiniLEWE and GMP samples from the first study (chapters 2 and 3). Chromosome 5 carries a region we identified as a major selective sweep between 40 to 46 Mb in the first study, which could serve as an example for different results due to different experimental setting as they were mentioned before. Revisiting the same region in the pool data, the strong signature is not observed. Possible explanations could be different variant calling algorithms, different filtering strategies, the aforementioned differences between pool and individual sequencing or simply a sampling error due to a low number of GMP samples. However, in both types of data, this region shows irregular behavior, elevated heterozygosity in the pools and diminished heterozygosity in the old samples, which also incorporated a MiniLEWE pool of ten sows. Both datasets show concordance in the remaining genome, but we feel that this issue cannot be resolved based on the current analyses and requires further research.

Chapter 6 127 Figure 6.1: Expected heterozygosity, averaged in 250 SNP windows along chromosome

5. Respective sweep region highlighted in green.

128 Chapter 6

Differentiation

Single nucleotide mutations can have a tremendous effect on the functionality of proteins and subsequent phenotypes (Amorim et al. 2017). Studies aiming at identification of nonsense alleles underlying genetic diseases were fairly successful in associating, for example, stop codons with dementia (Vidal et al. 1999), legionnaires disease (Hawn et al. 2003) or stickler syndrome (Ahmad et al. 1991). Conte et al. (2017) found that deleterious mutations were effectively under selection in spruce and therefore reduced in allele frequency compared to non-deleterious mutations. 𝐹𝑆𝑇 has been successfully used to efficiently map regions of divergent selection in the herring (Lamichhaney et al. 2012). Therefore, applying combined 𝐹𝑆𝑇 analysis and functional annotation using WGS data is a promising approach to identify loci undergoing divergent selection, or having been fixed by it. The most representative example of such a locus might be gait pattern in horses (Andersson et al. 2012), where a premature stop codon mutation in DMRT3 determines if a horse has four or five, instead of three gaits. Breeds able to do pacing or trotting, like the Icelandic horse, were found to carry the mutation at high allele frequencies or even show fixation for the mutation.

We used similar approaches in our studies aiming to identify the background of growth and evaluating the effects of differentiation in the GMP stocks. In the first study, we identified autosomal loci with high 𝐹𝑆𝑇 values, which will include oppositely fixed loci, and annotated them using the Ensembl variant effect predictor (McLaren et al. 2016). As shown in Figure 6.2, only 1,331 SNPs at 𝐹𝑆𝑇 > 0.95 and 804 SNPs thereof at 𝐹𝑆𝑇 = 1 were detected, respectively. Few were annotated to multiple functional classes.

The majority of these loci are located in intergenic regions, where mechanisms of possible functional constraint remain poorly understood. No missense mutations were identified. The coding sequence variants were located in DNAJC28, involved in Golgi vesicle transports (Yates et al. 2016), ITGB2 underlying leukocyte adhesion deficiency in cattle and dogs (Daetwyler et al. 2014; Kijas et al. 1999), CHD6, involved in gene regulation (Lathrop et al.

2010) and a novel gene. No obvious relations to GMP phenotypes were evident. Similarly, a study on domestication of rabbits found that few loci go towards fixation and none of them were located in coding regions (Carneiro et al. 2014). Rafati et al. (2016) located the few high 𝐹𝑆𝑇-SNPs they identified in a case-control study on skeletal atavism in horses, only on unassigned scaffolds.

Chapter 6 129 Figure 6.2: Functional annotation of high 𝑭𝑺𝑻 values. Left: 𝑭𝑺𝑻 > 0.95; Right: 𝑭𝑺𝑻 = 1.

This might imply, single highly differentiated SNPs may have limited relevance in the genetics of complex traits, such as growth or fertility, which we focused on. Conversely, using highly differentiated missense mutations as indicators for functional divergence of the GMP stocks, as we did in chapter 4 and 5, might fall short and critically underestimate real divergence. Christe et al. (2017) showed that despite generally high differentiation, no fixed polymorphisms appeared in a certain region. Additionally, if deleterious mutations were beneficial for a desired phenotype, these could interact through genetic complementation without single loci necessarily being fixed (Conte et al. 2017). Sohail et al. (2017) also found that deleterious mutations seem to function synergistically rather than independently, which would imply that multiple deleterious mutations may have a stronger effect than a single locus. This leads to negative linkage disequilibrium between deleterious mutations and would support the hypothesis of complementation. In any case, it appears that, apart from qualitative traits and few examples, highly differentiated deleterious mutations do not play a major role in complex traits in livestock.

130 Chapter 6