Finding CpG Islands - Algebraic Statistics

The EM technique can be particularly used to provide maximum likelihood estimates for the hidden Markov model, because the hidden Markov model is the compositionf =ρ·F of a fully observed toric modelF and a marginalization mappingρ. A version of the EM algorithm for the hidden Markov model is given by Alg. 6.3. In the E-step, the Viterbi algorithm can be used to compute the entitiespσ,τ and in the M-step, Prop. 6.2 is used to compute the locally maximal estimatesθ^∗ andθ^′∗.

Algorithm 6.3EM algorithm for hidden Markov model

Require: Hidden Markov modelf :R^l×(l−1)×R^l×(l^′⁻¹⁾→R^l^′n with parameter spaceΘ1 and observed data u= (uτ)∈N^l^′n

Ensure: Maximum likelihood estimate (ˆθ,θˆ^′)∈Θ1

[Init] Thresholdǫ >0 and parameters (θ, θ^′)∈Θ1

[E-Step] Define matrixU= (uσ,τ)∈R^lⁿ^×l^′n with uσ,τ= uτ·pσ,τ(θ, θ^′)

pτ(θ, θ^′) , σ∈Σⁿ, τ∈Σ^′ⁿ,

[M-Step] Compute solution (θ^∗, θ^′∗)∈Θ1 of the maximation problem in fully observed Markov model [Comp] Ifℓ(θ^∗, θ^′∗)−ℓ(θ, θ^′)> ǫ, setθ:=θ^∗andθ^′:=θ^′∗and resume with E-step

Output ˆθ:=θ^∗, ˆθ^′:=θ^∗

6.6 Finding CpG Islands

TheCpGsites are regions of DNA in a linear DNA strand where a cytosine is next to a guanine linked by one phosphate group. The notation ”CpG” is used to distinguish this linear sequence from the CG base pairs where cytosine and guanine are on different DNA strands linked by hydrogen bonds.

Cytosines inCpGcan be methylated to form 5-methylcytosine. In mammals, 70 % to 80 % of theCpG cytosines are methylated. The methylated cytosines within a gene can change the gene’s expression.

Gene expression is a mechanism to transcribe and translate a gene into a protein.

CpG islands are regions with a high frequency of CpG sites. An objective definition of CpG island is lacking. The usual formal definition is that a CpGisland is a region with at least 200 base pairs in length, aCGpercentage greater than 50 %, and the observed-to-expectedCpGratio greater than 60 %, where the observed-to-expectedCpGratio is given by the ratio between an observed part (i.e., number ofCpGtimes length of sequence) and an expected part (i.e., number ofCtimes number ofG).

In mammals, many genes haveCpGislands at the start of a gene (promoter regions). In mammalian genomes,CpGislands are usually 300 to 3,000 base pairs in length and occur in about 40 % of promoter regions. In particular, the promoter regions in human genomes have aCpGcontent of about 70 %. Over time, methylated cytosines inCpGsites tend to turn into thymines because of spontaneous deamination.

The methylation ofCpGsites within promoter regions can lead to the silencing of the gene. Silencing is a phenomenon which can be found in a number of human cancers such as the silencing of tumor suppressor genes. Age has a strong impact on DNA methylation levels on tens of thousands ofCpGsites.

In computational biology, two questions aboutCpGislands arise. First, decide whether a short stretch of a genomic linear strand lies inside of aCpGisland. Second, find theCgGregions of a long stretch of a genomic linear strand.

We begin with the first question. From a set of human DNA sequences there were extracted a total of 48 putativeCpGislands. From the regions labelled as CpGislands the + model was derived and from the remaining regions the −model was established. The transition probabilities for each model were calculated using the equations

θ⁺_XY = c⁺_XY P

Zc⁺_XZ and θ⁻_XY = c⁻_XY P

Zc⁻_XZ,

wherec⁺_XY andc⁻_XY are the number of times the nucleotideY followed the nucleotideX in aCpGisland and non-island, respectively. In this way, the transition probabilities of the two Markov chain models are given in Tab. 6.5.

+ A C G T

A 0.180 0.274 0.426 0.120 C 0.171 0.368 0.274 0.188 G 0.161 0.339 0.375 0.125 T 0.079 0.355 0.384 0.182

− A C G T

A 0.300 0.205 0.285 0.210 C 0.322 0.298 0.078 0.302 G 0.248 0.246 0.298 0.208 T 0.177 0.239 0.292 0.292

Fig. 6.5.Transition probabilities for + model and−model.

Consider a DNA sequence w. We calculate in both Markov chain models the probabilitiesp⁺(w) andp⁻(w). For discrimination purposes, the log-odd ratio is used,

S(w) = logp⁺(w) p⁻(w) =X

logθ_w⁺_i_,w_i+1 θw⁻i,wi+1

If the value ofS(w) is positive, there is a high chance that the DNA sequence represents aCpGisland;

otherwise, it will not.

Finally, we study the second question. For this, we build a hidden Markov model for the entire DNA sequence that incorporates both of the above Markov chain models. To this end, the states are relabelled such thatA+,C+,G+, andT+ determineCpGisland areas, whileA−,C−,G−, andT−provide non-island areas. For simplicity, we assume that there is a uniform transition probability of switching between island and non-island. The transition probabilities are given in Tab. 6.6, wherep⁺andp⁻= 1−p⁺ are the probabilities for staying inside and outside of aCpGisland.

For instance, we have θT⁺A⁺ = 0.079p⁺ and θT⁺A⁻ = (1−p⁺)/4. The transition probabilities in this model can be set so that within each region they are close to the transition probabilities of the original component model, but there is also a chance of switching into the other region.

Finally, the emission probabilities are all 0 and 1. The stateX⁺orX⁻outputs the symbolX with certainty; that is,

θ^′_X+,Y =θ_X^′ −,Y =

1 ifX =Y, 0 ifX 6=Y.

There are three canonical problems associated with a Hidden Markov model. First, given the pa-rameters of the model, a DNA sequence (output), and a state sequence. Calculate the probability of

6.6 Finding CpG Islands 137

θ A⁺ C⁺ G⁺ T⁺ A⁻ C⁻ G⁻ T⁻

A⁺ 0.180p⁺ 0.274p⁺ 0.426p⁺ 0.120p⁺ ^1−p₄⁺ ^1−p₄⁺ ^1−p₄⁺ ^1−p₄⁺ C⁺ 0.171p⁺ 0.368p⁺ 0.274p⁺ 0.188p⁺ ^1−p₄⁺ ^1−p₄⁺ ^1−p₄⁺ ^1−p₄⁺ G⁺ 0.161p⁺ 0.339p⁺ 0.375p⁺ 0.125p⁺ ^1−p₄⁺ ^1−p₄⁺ ^1−p₄⁺ ^1−p₄⁺ T⁺ 0.079p⁺ 0.355p⁺ 0.384p⁺ 0.182p⁺ ^1−p₄⁺ ^1−p₄⁺ ^1−p₄⁺ ^1−p₄⁺ A⁻ ^1−p₄⁻ ^1−p₄⁻ ^1−p₄⁻ ^1−p₄⁻ 0.300p⁻0.205p⁻0.285p⁻0.210p⁻ C⁻ ^1−p₄⁻ ^1−p₄⁻ ^1−p₄⁻ ^1−p₄⁻ 0.322p⁻0.298p⁻0.078p⁻0.302p⁻ G⁻ ^1−p₄⁻ ^1−p₄⁻ ^1−p₄⁻ ^1−p₄⁻ 0.248p⁻0.246p⁻0.298p⁻0.208p⁻ T⁻ ^1−p₄⁻ ^1−p₄⁻ ^1−p₄⁻ ^1−p₄⁻ 0.177p⁻0.239p⁻0.292p⁻0.292p⁻

Fig. 6.6.Transition probablities for hidden Markov model.

the output sequence when the model runs through the states. For instance, the DNA sequence CGCG generated by the state sequenceC+G+C+G+ has the probability

pC+G+C+G+,CGCG=1

8θ_C^′₊_,CθC+,G+θ_G^′₊_,GθG+,C+θ_C^′₊_,CθC+,G+θ_G^′₊_,G.

Second, given the parameters of the model and a DNA sequence (output). Find the maximum a posteriori probability of generating the output sequence. This problem is tackled by the Viterbi algorithm that finds the most probable path. When this path goes through the + states, aCpGisland is predicted. For instance, consider an output sequence and a corresponding a postoriori state sequence,

A C C C G C C G A A T A T T C G G G C C G A A T A A⁻C⁺ C⁺C⁺G⁺C⁺ C⁺ G⁺A⁻ A⁻ T⁻A⁻ T⁻T⁻ C⁺ G⁺G⁺G⁺ C⁺ C⁺G⁺A⁻ A⁻T⁻ A⁻ The state sequence would indicate that the strand has twoCpGislands.

Third, given a set of DNA sequences (output), find the most likely set of transitions probabilities.

This amounts to the discovery of the parameters of the hidden Markov model given the data set. This problem can be tackled by the EM algorithm.

Im Dokument Algebraic Statistics (Seite 147-151)