Pattern-based motif discovery tool (PEnGmotif)

1.6 Aims and contents of this thesis

2.1.2 Pattern-based motif discovery tool (PEnGmotif)

We introduce PEnGmotif (Pattern-based discovery of enriched genomic or transcriptomic sequence motifs), a tool which learns motifs in PWMs to represent the enriched patterns.

The key idea of PEnGmotif is to find the enriched DNA patterns in the sequences over random expectations from a second-order background model. It first uses an enumerative approach to exhaustively count all the possible non-degenerateW-mers with a fixed length.

One big advantage of such an enumerative approach is that it covers the motif space to a large extend efficiently. To get the probability distribution ofW-mers observed by chance in the negative set, a background model is needed. The simplest model is built upon the mononucleotide frequencies, similar to E.q. (2.7) and (2.8) (note: here aW-mer from the background set is denoted asyto distinguish it from the positive set):

p_bg(y)≈

∏

i=1

p_i(y_i)≈

∏

i=1

f_i(y_i). (2.10)

However, this model does not account for sequence features such as dinucleotide CG repeats, or poly-A or poly-T sequences, which are commonly present in most non-coding regions of the genome.

Model background distribution with interpolated homogeneous Markov models To take into account the nucleotide dependencies in the background sequences where no binding events occur, there are a few approaches used in this field. One approach is to use the shuffled k-mers to construct negative sequence set. Another approach is to build the background model using higher-order Markov models. For SELEX-like experiments, there are library sequences which can be used as background sequences. Here I explain the higher-order background models are built upon akth-order homogeneous (that is, position-nonspecific) Markov model for the pattern with a length ofW:

p_bg(y)≈ p(y1...y_k+1)

W i=k+1

∏

p(yi|y_i₋_k...yi−1). (2.11)

To reduce the amount of parameters for training, we use an interpolated Markov model (first introduced by [54] ) as E.q. (2.9):

p(y_k+1|y₁...y_k)≈n(y₁...y_k) +α_k×p(y_k+1|y₂...y_k)

n(y₁...y_k) +α_k . (2.12) with the order-specific pseudo-factorα_k as follows:

α_k=

( 1, if k=0, β×γ^k⁻¹, if k>0.

withβ =20 andγ =3 as chosen by [35], which improve the performance of using solely thek-mer frequencies and keep robust with increasing motif orders.

We choose the model orderkto be 2 for the background models. Because if the order is lower, it does not capture the genomic features such as the CpG islands; and for the higher orders it is prone to over-fitting.

CalculateP-values forW-mers

To check whether aW-mer is enriched in the given sequences over the random expectation, one method is to compute itsP-value.

The number of occurrences ofW-meryshould follow a Poisson distribution with expec-tation valueµ =L_totp_bg(y), whereL_tot=∑^N_n=1(L_n−W+1)is the total number of positions in the input sequences, and p_bg(y)is the probability ofk-meryaccording to thekth-order background model as computed previously (E.q. (2.11) and (2.12)).

Forn(y)≫1 andn(y)>µ, which is always fulfilled anyway when a motif is significantly over-represented, theP-value can be approximated using Stirling’s approximation by:

P-value(y) =

∑

^∞

k=n

µ^k k!e⁻^µ

= µⁿ

n!e⁻^µ

∑

^∞

k=0

µ^k

(n+1)···(n+k)

≈< µⁿ

n!e⁻^µ

∑

^∞

k=0

µ^k (n+1)^k

≈ µ^n(y)

n(y)!e⁻^µ 1

1−µ/(n(y) +1) logP-value(y)≈n(y)log µ

n(y)+n(y)−µ−1

2log(2πn(y))−log

1− µ

n(y) +1

. (2.13)

CalculateZ-values forW-mers

We then compute squaredZ-scores for all non-degenerateW-mers. TheZ-scores are simply the deviation from expectation divided by the expected standard deviation. The standard deviation of a Poisson distribution is equal to the square root of its mean, therefore:

Z(y) = n(y)−L_totp_bg(y)

pL_totp_bg(y) . (2.14)

Z-scores are used later for comparingW-mers to find the optimal ones.

Find local optimalW-mers

To reduce the amount of non-degenerate patterns and select the representative ones around their neighbours (i.e., those that are at most one substitution away), we apply a recursive function which takes aW-meryand checks for all its neighbouringW-mers. If it finds a neighbouryneigh with a betterZ-score, the function is called recursively withyneigh as an argument. Otherwise, if no neighbour ofyhas betterZ-score thany,yis appended to the list of locally optimalW-mers. By doing this, the number of enrichedW-mers are reduced for further computation.

TransformW-mers to degenerate IUPAC patterns

From the local optimal non-degenerate W-mers, we allow some flexibility by replacing the bases∈ {A,C,G,T} with IUPAC letters∈ {A,C,G,T,S,W,R,Y,M,K,N}(see Table A.1). Similarly to the previous step, we apply an iterative greedy search for local optimal IUPAC patterns by comparing them with their neighbouring substitutions with regard to their P-values orZ-values.

Convert degenerate IUPAC patterns to PWMs

To derive a position weight matrix (PWM) from an IUPAC patterny, the simplest way is to count the occurrence of nucleotideaat position jwithin the motif in the matched sequences, for alla∈ {A,C,G,T}at all positions.

The probabilities of the PWM are then calculated as:

p_pwm(y) =

∏

j=1

p_ja=

∏

j=1

n_ja

∑^A,C,G,T_b n_jb. (2.15)

A drawback of this method is when the amino acidais excluded at a certain position j by an IUPAC lettery_j,n_jacan be zero and so does p_jaequal to zero, which is not allowed in the PWM.

To avoid this from happening, we could use pseudo-counts. But there is a smarter way. It relies on the insight that if we allow any of the four nucleotides at position j, the vast majority of motif matches will still be true positives due to the descriptive power of the otherW−1 IUPAC letters. Therefore, we count the four nucleotides at motif position jfor matches to the patterny_0:_j₋₁Nyj+1:W−1in which we replace the j’th IUPAC letter by an N:

p_ja= n(y_1:_jay_j+2:W)

n(y_1:_jNy_j+2:W), (2.16)

where we denote byn(y)the number of occurrences ofW-meryin the input set. Note that these PWM probabilities can be computed solely from theW-mer counts in a timeO(W×D) that is independent of the size of the input data setL_totand only depends on the degeneracy D={x∈ {A,C,G,T}^W :xmatchesy}of the motify, i.e., the number of differentW-mers it matches.

After efficiently getting PWMs, we refine the models using expectation maximization (EM) algorithm, which will be explain in section2.2.5. PWMs with overlaps are merged and extended, and can serve as seeds to be refined to higher-order Markov models using Bayes’

rules (explained in the following section2.2.1).

Im Dokument Discovery and prediction of protein binding sites in DNA and RNA sequences using Bayesian Markov models (Seite 28-31)