minor groove major groove

(1)

FACULTY OF MATHEMATICS AND COMPUTER SCIENCE

Institute of Computer Siene

Timo Petmanson

Mining Motifs in DNA

Regulatory Area

Barhelor's Thesis (6 ECTS)

Supervisor: Sven Laur, D.S. (Teh.)

Author: ... ... May 2010

Supervisor: ... ... May 2010

Chairman: ... ... ... 2010

TARTU 2010

(2)

(3)

Introdution 5

1 Preliminaries 6

1.1 DNA . . . 6

1.2 Gene expression . . . 8

1.3 Data mining . . . 8

2 Sequene Mining with Multiple Layers of Data 11 2.1 Sequenes and Sores . . . 11

2.2 Motifs and Mathing . . . 13

2.3 Support Metris . . . 15

2.4 Properties of Support Metris . . . 18

2.5 StatistiallyRelevantMotifs . . . 21

3 Algorithms and Data Strutures 23 3.1 Compat Enoding of Motifs . . . 23

3.2 Hash-mapof SupportMetris . . . 24

3.2.1 Inluding Motifswith Wild Card Charaters . . . 26

3.3 NaiveSearhbased onApriori . . . 26

3.4 Pruning Strategies . . . 27

3.4.1 MaximalSupportEstimationPruning . . . 28

3.4.2 Safe Over-Approximation Searh . . . 29

3.4.3 Infrequent Sub-Motifs PruningMethod . . . 30

3.5 Mining Fixed Number of Best Motifs . . . 32

3.6 Generalized FP-Tree . . . 35

4 Experimental Results 38 4.1 Runtime Performane of Searh Algorithms . . . 38

(4)

4.1.2 Mining Fixed Number of Frequent Motifs . . . 42

4.2 Mining BiologiallySigniantMotifs . . . 43

4.2.1 Data Preparation . . . 43

4.2.2 Results . . . 44

Summary 49

Resümee (eesti keeles) 50

Bibliography 51

A Multi-onstraint miner tool for gene expression analysis 53

(5)

All living organisms on earth are believed to ontain geneti information

oded instruturedolletionsof genesandnon-odingsequenes thatmake

up the DNA. The oded information is used to build organisms, maintain

themanditdenesawiderangeofgenetifeaturesthatvaryfromindividuals

to individualsand fromspeies to speies. The non-oding parts have muh

of the responsibility to regulate the expression of partiular genes. Genes

withtheirnon-odingregulatoryareasformomplexsignalingnetworksthat

together oordinatethe lifeyle of anorganism. Contemporarymethodsin

genetislikeChIPandmiro-arraymeasurementsmakeitpossibletomeasure

features of thousands of genes in one experiment, generating huge amounts

of data. Therefore, the developmentof new algorithmsand methods able to

analyze this data is ruial.

Ourontributionsinludethedevelopmentofnovelmethodsabletoom-

bine dierent soures of experimental data. In Chapter 2, we formalize the

theory desribing sequene mining with multiple input sequenes and mul-

tiple data layers. We also desribe, how to determine statistially signi-

ant motifs using our theory. In Chapter 3, we develop algorithms Max-

SupSearh, SafeApproxSearh, InfreqSearh, GFPSearh, that

utilize dierent pruning strategies. For GFPSearh, we dene generi-

frequent-pattern tree struture that isa generalizationof FP-tree[JJYR04℄.

Wealsodevelop NBest, that ombines any previouslymentionedalgorithm

with binary searh to get xed numberof best motifs. We develop SigMo-

tifs, that goes even further by distilling out statistially signiant motifs.

Performane study of mentioned algorithmsalong with experiments on real

biologialdata are given in Chapter 4.

(6)

Preliminaries

1.1 DNA

Currently sientists havedesribed about1.5milliondierentspeies: about

ve thousand mammals, thirty thousand speies of sh and over nine hun-

dredthousand insetsamongothers[WCU07 ℄. Someestimates ofomparing

samplesfromvariouspartsoftheworldseassuggestthatinoeanstheremay

bemore than100 million speies ofbateria[MHJ06℄. Thisvastdiversity of

known and unknown speies in Earth's biosphere are believed to have one

thing in ommon: the presene of DNA.

minor groove major groove

Figure 1.1: DNA Double Helix. The distane between strands varies and

forms major and minor grooves.

(7)

bakbone of a strand ontains alternating phosphate and sugar residues

linked with bases. These two strands form a struture known as double

helix seen in Figure 1.1, whose stability is maintained by hydrogen bonds

between the bases, see Figure 1.2 [RSM05℄. There are four types of bases

in DNA: adenine (abbreviated A), thymine (T), guanine (G) and ytosine

(C) that ombined with a sugar and one ormore phosphate residues forma

nuleotide. The nuleotides are pairwise aligned,making the struture anti

parallel, where adenine bonds only to guanine and ytosine bonds only to

thymine. The endpointsof the strands are alled 3'and 5'where the rst is

denedbyaterminalphosphategroupandtheseondbyaterminalhydroxyl

group [Coh04℄.

DNAnuleotide sequenes are usually writtenonly using basesfrom one

strand asthe bases onother strandare omplementary. Sequene TATAAA is

omplementarytoATATTT forexample. Theorderthe haraters are written

dependsonthesoureofthedatasometimesthedataiswrittenindiretion

from 3'to 5'while others are vie versa.

sugar CH3

N O

N

H H

O H

N N

N sugar

Adenine Thymine

N sugar O

N O H

N

H N

H

sugar O

N

N N

Cytosine Guanine

Figure 1.2: TA and GC omplementary base pairs. Dotted lines represent

hydrogen bonds between bases

(8)

Gene expression means the rate and amount of RNA transribed from it,

whih in turn is used to dene other proteins neessary for the ell and

the organism. The transription proess requires transription fators that

are speial proteins able to reognize and attah to partiular fragments in

gene promoter areas. The transription fators are requiredto reruit RNA

polymerasethat is responsible for arrying out the transription proess.

In more omplex eukaryoti ells, the promoters are rather diverse and

ompliated, but the ore elements are a transription start site, whih to-

getherwithRNApolymeraseandtransriptionfatorbindingsitesareessen-

tialforinitiatingthetransriptionproess. Otherimportantbindingsitesare

typially a little more far away in upstream diretion that mainly regulate

gene expression by enhaning or restriting reruitment of the main tran-

sription fators. Additionally, there may be even more distant promoter

areas that have weaker inuene onthe gene regulation.

1.3 Data mining

Data mining is a method in statistis for extrating interesting patterns or

knowledge from large amounts of available data. This eld is very diverse

as among general data mining solutions there are many spei proedures

developedforbusiness,games,soialnetworksetetera [DP07 ℄. Inthiswork,

weonentrateonspeializedareaofdataminingalledsequenemining that

deals with ordered sequenes like nuleotide sequenes.

The Apriori algorithm is the most general and simple way to nd pat-

terns with high support in given data. In standard sequene mining, the

support is dened as the number of ourrenes of a pattern in input data,

whihisusedtodeidewhetherthepatternisfrequentorinfrequentbasedon

somedened threshold. TheApriorialgorithmassumesthatthesupportis

downward losed,whih meansthat forany infrequent patternthere do not

existany frequentsup-patterns. Forexample,aDNAmotif AAATCCC annot

be present in data more times than sequenes AAA and CCC, beause when-

ever the supmotif ours, the two submotifs must also our. Let us larify,

that inthis workby asubmotiforasubpatternwe meanasubsequene with

onseutive elements.

(9)

1:

F 1 ← {

^Frequent one-element patterns

}

2:

ℓ ← 2

3: while

F ℓ− 1 6= ∅

^do

4:

C ℓ ←

GenerateCandidates

(F ℓ− 1 )

5:

F ℓ ← {c ∈ C ℓ | supp(c) > σ} ⊲ σ

^is ^threshold

6:

ℓ ← ℓ + 1

7: end while

The Apriori algorithmuses downward loseness asamainpruningfeature.

In Algorithm 1.3.1 on line 4, the GenerateCandidates proedure takes

the set of frequent motifs of length

ℓ − 1

âs înput ând ^generates ^possible

andidates of length

ℓ

^. Ît ^does ^not ^need^to ônsider âny non-frequent motifs as none of their supmotifs are frequent. The algorithm stops running when

it has found all frequentmotifs inthe dataset.

LetusdemonstrateApriori by givinganexample. Consider thefollow-

ingsequene:

GCTTATGGTCGCTATGCTTT

.

Suppose we want to mine all motifs ourring at least three times in the

sequene. This means that we run Apriori with threshold

σ = 3

^. ^The ^set

F 1 = {

^T

,

^G

,

^C

}

^,^beauseâll^nuleotides êxept Â âre ^presentⁱⁿ ^sequene ^more

than threetimes. Next,we generateandidate motifsoflengthtwoby using

only frequent elementsin

F 1

^.

C 2 = {

^TT, ^TG, ^TC, ^GT, ^GG, ^GC, ^CT, ^CG, ^CC

}

Frequent motifsinthis ase are

F 2 = {

^TT, ^GC, ^CT

} .

Note that TT mathes TTTtwo times. The next andidate set is

C 3 = {

^TTT, ^GTT, ^TTG, ^CTT, ^TTC, ^TGC, ^GCT, ^GGC, ^GCG,

CGC, GCC, TCT, CTT, GCT, CTG, CCT, CTC

} .

This time there is onlyone frequent motif:

F 3 = {

^GCT

} .

(10)

Candidate motifsof length

4

^:

C 4 = {

^TGCT, ^GCTT, ^GGCT, ^GCTG, ^CGCT, ^GCTC

} .

Butnoneofthemisfrequent,so

F 4 = ∅

ândâll^frequent^motifsⁱⁿôurêxample

are

F = {

^T, ^G, ^C, ^TT, ^GC, ^CT, ^GCT

} .

There are also algorithms like WINEPI [MTV95 ℄, MINEPI [MT96℄,

SPEXS [Vil02℄ that are able to mine motifs using pattern mathing. Still,

while Apriori with other standard sequene mining algorithms are useful,

they treat all parts of the sequene with equalweight. In our ase, we need

methodsthatareabletoworkwithdatathatdeoratessequeneswithsores,

making some parts of them more relevant than the rest. In Chapter 2, we

reformulate standard sequene mining tehniques and later devise our own

algorithmsthat handle suhrequirements.

(11)

Sequene Mining with Multiple

Layers of Data

Inthishapter,weformalizebasinotionsandoneptslikesequenes, motifs

and support that are needed to develop our methods. We try to develop

our mathematial approah suh that it would be onvenient to study gene

regulation, whenwe onsider several promoterareas and dierentproperties

of these sequenes desribed by layers of experimentaldata.

We also study dierent properties and relations between these building

bloks that are later used in algorithmsto ut down the running times and

improve overall performane, although we do not over algorithmi details

and otheraspetslikedata struturesastheyare disussedinlaterhapters.

2.1 Sequenes and Sores

The most basi onstruts we will be dealing onwards are DNA sequenes

and their fragments. In our ase, it will be onvenient to think of them as

a set of nuleotide sequenes. Let

S = {a, b, c, . . .}

^denote ^a ^set ^of ^promoter

sequenes relevant to some gene. Single elements of a sequene are denoted

with subsripts as usual. For example,

a 1

^means ^the ^rst ^element ^and

a 2

the seond elementof

a ∈ S

^. Âs ^there âre ^four^typesôf ^nuleotides âdenine,

thymine, ytosine, guanine inDNA that orrespond to lettersA, T, C, G. We

write

a 1 =

^{A ,}îf^rstêlementⁱⁿ^the^nuleotide^sequene îsâdenineând

a 2 =

^{T ,}

if the seond element is thymine. Let us denote the length of sequene

a

^as

(12)

|a|

^. Ît îs ^worth ^to ^note ^that ^no ^promoter îs ^with ^length ôf ^zero, ^nor ^there

are promoters with innite length in real world. However, depending on

partiular ase, the lengths of the sequenes are not usually very short or

very long.

In mathematis, a fragment of a sequene is usually written as a list of

elements. In this paper, we willbe using a shorternotation:

a i : j

def

= a i , a i +1 , . . . , a j .

where

i

^is^the ^beginning^and

j

îs ^the ênd ôf ^the ^fragment.

Westatedinthe introdutionofthisthesisthatweare goingtodealwith

multiple layers of data about promoter sequenes. For example, if we have

data ontainingbinding andonservation soresfromDNA miro-arrayand

sequening experiments that assoiate with promoters we are interested in,

weanportraythemasdatatraksoverthenuleotidesequeneasillustrated

in Figure2.1.

value

... A T G C C C A T T G C T A G G C ... pos 0.5

1.0 conservation binding

Figure 2.1: An example subsequene having onservation and binding data

traks attahed. The sores are variable and may not diretly depend on

eah other.

From theoretialpointofview, itisnot importantexatlywhatkindofdata

we have, as long we an represent it as numeri values linked to positions

in promoter sequenes. However, it is important that these values express

some property that makes some regions of the nuleotide sequene more

relevant than other regions, thus dening important regions in respet to

eah data trak. If we have

n

^data ^sets ^ontaining ^various ^sores ^and

m

promoters, thenweneed

n × m

^mappings^that ^assoiate ^relev^ant^sores ^from

(13)

to normalize all data suh that all sores fall into range

[0, 1]

^like ^shown

on Figure 2.1. It simplies writing some formulas, beause we know the

maximum possible value of any type of sore linked to any position of a

nuleotidesequene. Let

ϕ : N −→ R

bea mappingthatassoiatesnumeri sores to all positions of a nuleotide sequene. Tomakethis notationmore

useful, letus agree that by writing

ϕ(a i )

^we ^mean ^the ^sore ^that

ϕ

^maps ^to

position

i

^of^sequene

a

^and^by^writing

ϕ(a i : j )

^,^we^mean^a^sequene ^of^sores

ϕ(a i : j )

^def

= ϕ(a i ), ϕ(a i +1 ), . . . , ϕ(a j )

^. ^By^writing

ϕ(a i : j )

^we^mean^the ^average

sore

ϕ(a i : j )

^def

= 1 j − i + 1 ·

j

X

k = i

ϕ(a k ) .

2.2 Motifs and Mathing

In this setion, we introdue motifs, whih an be thought of as possible

subsequenes in sequene set. Motifs do not diretly assoiate to any data

trak,butthereareseveral othermetrislikesupport,frequeny,signiane

ofamotifinapartiularset ofpromotersequenes. Inadditiontonuleotide

lettersA, T, G, C,motifsmayalsoontainspeialwildardharatersthat

have speial meaning and usage. In this work, we will be using only one

suh symbol* that represents any possible nuleotide in one position. Note

that this isdierent from standard usage of this symbol inbath-proessing

or regular-expression appliations where it usually stands for zero or more

symbols. In our ase, if we have a motif G**A, then by that we mean any

motif with lengthof four that starts with letter G and ends with letter A.

We will be dealing a lot with xed-length motifs in later setions, so it

is neessary tointroduenotationthat wean use to refertoallmotifswith

a xed length

ℓ

^. ^Let

M _ℓ

^represent â ^set ôf âll ^motifs ^with ^length

ℓ

^where

ℓ ∈ N

. We agreed before, that allmotifs are onsist of ve dierent letters:

the nuleotidesand the wildardharater. Thismeansthat theardinality

oftheset

M _ℓ

^is^equal^to

|M _ℓ | = 5 ^ℓ

âs^thereâre^ve^dierent^possibleêlements

perposition ina motif.

Oftenitisneessary, that weouldrefertosingleelementsof amotifthe

same way we do for sequenes, so given any motif

m ∈ M ℓ

^, ^let

m 1

^denote

the rst element of the motif,

m 2

^the ^seond êlement ôf ^the ^motifêt êtera.

In addition to that, it is onvenient to desribe motifs as onatenation of

(14)

onlyaprexandsux part. Let

||

^be^anonatenationoperator. If

m ^p ∈ M p

and

m ^s ∈ M s

^then ^motif

m = m ^p || m ^s

^, ^where

m ∈ M ℓ

^and

ℓ = p + s

^. ^Let

us illustrate this with an example. If

m ^p =

^AAAT ^and

m ^s =

^GCCGT, ^then ^the

onatenation

m ^p || m ^s

^is^AAATGCCGT.

Anotherveryusefulnotionisawildardextensionofsomemotif. Namely,

if we have some xed motif length

ℓ

^and ^a ^motif

m ∈ M k

^, ^suh ^that

k 6 ℓ

^,

we may pad the motif with wild ard haraters until it is

ℓ

^elements ^long.

This enables to easily express motifs we know to have a ertain prex. Let

m ^∗ ∈ M ℓ

^denote â ^wild ârd ^harater êxtension ôf ^motif

m ∈ M k

^where

k 6 ℓ

^suh ^that ^the ^prex

m ^∗ 1 : k = m

^and ^sux

m ^∗ _k +1 : ℓ =

^*...*. ^F^or

instane, if

m =

^AATA ^and ^we ^have ^xed ^motif ^length

ℓ = 10

^, ^then ^the ^wild

harater extension

m ^∗ =

AATA******. This notion omes handy when we desribe SafeApproxSearh algorithm in Chapter 3. Let us agree that

any motif gained from another motif by replaing one or more nuleotides

with wild ards is onsidered asubmotif of the original motif.

In standard sequene mining, the support of some motifis usually mea-

sured by how many mathes ithas in data [DP07℄. The number of mathes

of a motif ontainingno wild ard haraters is simplythe number of times

the motifan beviewed asa subsequeneof given data sequene. Withwild

ard haraters this works dierent as a wild ard harater mathes any

nuleotide. See Figure2.2for anillustration.

Figure 2.2: Threemathes of motif ATA*A ina subsequene.

Denition 2.2.1 A motif

m ∈ M ℓ

^mathes ^some ^fragment

a i : i + ℓ− 1

^of ^se-

quene

a

^, ^if

m k =

^*

∨ m k = a i+k− 1

^for ^all

k = 1, . . . , ℓ .

(15)

match (a, m, i) =

1,

^if ^m ^mathes

a i : i + ℓ− 1

0,

^otherwise

.

We an extend the number of mathes in the sequene

a

ôver â ^set ôf ^se-

quenes

S

^by ^simplyâddingâll ^the îndividualôunts ^together:

mcount (a, m)

^def

= P |a|−ℓ +1

i =1 match (m, a i ) mcount (S, m)

^def

= P

s ∈ S mcount (m, s) .

2.3 Support Metris

Standard sequene mining treats all parts of the input sequene with equal

value of importane [DP07℄. In our ase, we have possibly more than one

data traks ontainingvariablesores. Therefore, we need todene support

in a dierent way. We base our approah on a formulation given by Sven

Laur [Lau09℄.

The rst thing is to extend the notion of support of one single math.

Standard way was summing up all mathes of a motif in a sequene, suh

thateahmathhadequalvalueofimportane. Butaswehaveatualsores

linkedtopositions,weextendtheoriginalmethodbytakinganaveragesore

of mathing positions of asingle math.

Denition 2.3.1 The support of an individual motif

m ∈ M _ℓ

^with ^respet

to some fragment in sequene

a

^starting^from ^position

i

^:

supp(a, m, i) =

ϕ(a i : i + ℓ− 1 )

^if

match (a, m, i) = 1

0

^otherwise

.

Toextendthesupportofamotifoverasequene, wehaveseveral options.

The rst idea is to add up all the single supports of the motif. This is the

simplestwaytogoand werefertothis methodasadditive supportonwards.

Letusonsider another option: insteadofadding upthe sores, wean take

only the maximal sore and be ne with it. The plus side of this method

is that it promotes motifs that atually have high sores. Additive support

an be high even if all the sores of the single mathes are low. So, we also

onsider this method and we willbe referring toitas maximal support.

(16)

in a sequene. We might onsider average support that works like additive

support, but we divide the result by number of mathes of that motifin the

sequene. We ould also dene supports like weighted additive or weighted

average support, that onsiders some regions of the promoter to be more

signiantthanothers. Thelasttwoareatuallynotveryreasonable,beause

we express signiane of promoter areas through data traks anyway.

The average support is atuallymore relevant, but as itseems to have a

mixed propertiesof additiveandmaximalsupport,we donot overthis type

of supportinthis work andonentrate onstudyingonlythe two mentioned

supporttypes.

Denition 2.3.2 Additive support of a motif

m ∈ M ℓ

ⁱⁿ ^sequen^e

a

^is

asupp(a, m) =

|a|−ℓ +1

X

i=1

supp(a, m, i) .

Denition 2.3.3 Maximal support of a motif

m ∈ M ℓ

ⁱⁿ ^sequene

a

^is

msupp(a, m) = max { supp(a, m, i) | i = 1, . . . , |a| − ℓ + 1} .

By writing

supp(a, m)

^, ^we ^do ^not ^refer ^diretly ^to ^neither ^of ^the ^support

types inases we are disussing properties that apply to both of them.

Therefore, Denitions 2.3.2 and 2.3.3 are only two possible ways of ex-

pressing the support of a motif in one sequene. Biologial importane of

the twodependsmostly onthe atualdataused. Forexample,if weonsider

onservation, then additive support an reveal motifsthat oexist inseveral

genetially lose speies having great strutural importane, maximal sup-

port takes into aount only one ourrene of a motif in a promoter, thus

ignoring larger sale strutural eets. On the other hand, maximal sup-

portan bring up motifsthat are reognizedmost probablyby transription

fators as these enzymes require proper loations to enable mounting RNA

polymerase and initiate transription. Thus, the deision about what sup-

porttype shouldbeused with apartiulardata trak,depends onbiologial

properties the data.

To extend notion of support over a set of sequenes, we have several

options. The rst approah is to onsider all promoter sequenes having

(17)

an average of support of amotifin allsequenes.

Denition 2.3.4 Additive and maximal support of motif

m ∈ M ℓ

ⁱⁿ ^a ^list

of sequenes

S

^are

asupp(S, m) = 1

|S| · X

a∈S

asupp(a, m)

^(2.1)

msupp(S, m) = 1

|S| · X

a∈S

msupp(a, m) .

^(2.2)

The seond approah is to onsider promoters further away from the gene

theyregulatehavinglessimpatthantheoneslosertoit. Therefore,weneed

to give promoter sequenes meaningful weights, when alulating support.

We ould propagate these weights diretly into the datasets, enabling the

diret use of Equations (2.1) and (2.2). Let us also agree that by writing

supp(S, m)

^, ^we ^do^not ^refer ^diretly ^to ^additive, ^nor ^maximal ^support ^if ^we

are disussing properties that apply toboth of them.

InChapter3,wedisussalgorithmsanddata struturesandusuallyneed

support in respet to all data traks. Also, let us agree that we have xed

the support type for every data trak tomake semantis easier. In ases we

need touse bothsupporttypes, wean vieworiginaltrakastwodupliates

with dierent supporttypes.

Denition 2.3.5 Given mappings

ϕ ¹ , . . . , ϕ ⁿ

^, ^the ^support ^of ^motif

m

ⁱⁿ ^se-

quenes

S

ⁱⁿ ^respet ^to ^all

n

^data ^traks ^is

−−→ supp(S, m) = (s 1 , . . . , s n )

where

s i =

asupp(S, m, ϕ i )

^for ^additive ^type ^of ^support

msupp(S , m, ϕ i )

^for ^maximal ^type ^of ^support.

Mappings

ϕ 1 , . . . , ϕ n

^given âs êxtra ârguments ^to ^support ôperators ^will ^be

used as the mapping

ϕ

ⁱⁿ^Denition ^2.3.1.

(18)

In previous setion, we dened basi building bloks like sequenes, motifs

and mappings that gave eah position in a promoter sequene one or more

weights in regard to available data sets. In this setion, we study various

properties of newly dened supportmeasures and notions.

When weompare additive and maximal support, itisrathereasytosee

that additivesupport is always asbigas maximalsupport, beause additive

support onsiders all ourrenes of a motif in a sequene where maximal

supportonly onsiders the ourrene with maximalsupport.

Proposition 2.4.1 For any motif

m ∈ M ℓ

ând â ^set ôf ^sequenês

S msupp(S, m) 6 asupp(S , m) .

We have not mentioned that there isa problemwith the way wedened

our support of some motif. Namely, the denition breaks the standard se-

quene miningprinipleofbeingdownward losedasanynon-frequent motif

may have frequent supmotifs.

Claim 2.4.2 Let

σ ∈ R

be the threshold. For any motif

m ∈ M ℓ

^, ^suh

that support

supp(S, m) < σ

^, ^may ^exist ^a ^supmotif

m ^′ ∈ M ℓ + k

^, ^suh ^that

supp(S, m ^′ ) > σ

^, ^whether ^we ônsider âdditive ôr ^maximal ^suppôrt.

Proof. For simpliity, let us assume that there is only one single math of

motif

m ^′

ⁱⁿ ^positions

i

^to

i + ℓ − 1

ⁱⁿ ^sequene

a

^. ^Sine ^the ^sores ^of ^all

positions are in range

[0, 1]

^, ^the

supp(a, m, i) ∈ [0, 1]

^. ^Now, ^let ^us ^onsider

a situation where support of the prex

supp(a, m, i) = ϕ(a i : i+ℓ− 1 ) < 1

^and

supportof the sux

ϕ(a ℓ : ℓ + k− 1 ) = 1

^. ^F^rom ^here ^we ^an ^onlude ^that

supp(S, m ^′ ) = 1

|S| · ϕ(a i ) + . . . + ϕ(a i + ℓ− 1 ) + k

ℓ + k > supp(S, m) .

^(2.3)

If we take

σ = supp(S, m ^′ )

^, ^then

m

îs înfrequent ând

m ^′

^is^frequent.

2

Above proof raises anotherquestion: if we know the supportof a motif,

thenwhatisthemaximalpossiblesupportofanysupmotif? Weanapproah

theanswersamewayprovedabovelaim. Namely,ifweonsiderthesupport

of the sux of a possible supmotif to have maximal possible value, then we

an alulatethe maximal possible support of the supmotif.

(19)

Proposition 2.4.3 For any motif

m ^′ ∈ M ℓ + k

^and ^its ^submotif

m ∈ M ℓ

ⁱⁿ

a set of sequenes

S

msupp(S, m ^′ ) 6 ℓ · msupp(S, m) + |S| · k ℓ + k

asupp(S, m ^′ ) 6 ℓ · asupp(S, m) + mcount (S, m) · k

ℓ + k .

Proof. Letusonsiderasetoffragments

{a p : q , b r : s , . . . z t : u }

^that^represent

positionsoneverypromoterwheremotif

m

^has ^highest^support. ^In^that^ase

msupp(S, m) = |S| ⁻ ¹ (ϕ(a p : q ) + ϕ(b r : s ) + . . . + ϕ(z t : u )) .

If we now onsider

m

^as^prex ^of

m ^′

^, ^then ^analogous ^way ^to ^Equation ^(2.3)

we an estimate that support of

m ^′

^annot ^be ^larger^than

msupp(S, m ^′ ) 6 1

|S|

ϕ(a p : q ) + k

ℓ + k + ϕ(b r : s ) + k

ℓ + k + . . . + ϕ(z t : u ) + k ℓ + k

=

= 1

|S| · ϕ(a p : q ) + ϕ(b r : s ) + . . . + ϕ(z t : u ) + |S| · k ℓ + k

that ombined with Equation (2.2) beomes

msupp(S, m ^′ ) 6 ℓ · msupp(S, m) + |S| · k

ℓ + k .

^(2.4)

Additivesupport takesinto aountallourrenesof

m

^,^thus ^replaing ^the

relevant parts inEquation (2.4), we get

asupp(S, m ^′ ) 6 ℓ · asupp(S, m) + mcount (S, m) · k

ℓ + k .

2

Claim 2.4.2 implies that frequent motifs may have infrequent submotifs.

However, it is importantto note that any frequent motif alsomust have at

least one frequent submotifwith eitheradditive ormaximal type ofsupport.

(20)

Lemma 2.4.4 For all positions

i < j 6 k

ⁱⁿ ^sequen^e ^a

ϕ(a i : k ) 6 max{ ϕ(a i : j− 1 ), ϕ(a j : k ) } .

Proof. The rst possibility is that

ϕ(a i : j− 1 ) > ϕ(a j : k )

^or

ϕ(a i : j− 1 ) <

ϕ(a j : k )

^. ^In ^that ^ase

ϕ(a i : k ) < max{ ϕ(a i : j− 1 ), ϕ(a j : k )}

^as ^the ^average

sore of the supsequene must be lower than the subsequene with maxi-

mal average sore. The seond possibility is that

ϕ(a i : j− 1 ) = ϕ(a j : k )

^, ^thus

ϕ(a i : k ) = ϕ(a i : j− 1 ) = ϕ(a j : k )

^.

2

Theorem 2.4.5 Any frequent motif

m ∈ M p+s

^an ^be ^partitione^d ^into ^two

submotifs

m ^p ∈ M p

^and

m ^s ∈ M s

^, ^suh ^that

m = m ^p || m ^s

^and ^either ^the

prex

m ^p

^or ^sux

m ^s

^is ^frequent.

Proof. If we have only a single math of the motif

m

ⁱⁿ ^sequene

a

^, ^then

aording to Lemma2.4.4

supp(a, m, i) 6 max{supp(a, m ^p , i), supp(a, m ^s , i + p)} .

Now, suppose wehavemoremathes of

m

ⁱⁿ^one^single^sequene

a

^. ^As ^max-

imal support onlyonsiders one ourrene of

m

^, ^then ^aording ^to^Lemma

2.4.4 the theorem holds. By Denition 2.3.2, the additive support is the

sum of supports of all single mathes. Let

asupp(a, m) = ϕ(a i ₁ : j ₁ ) + . . . + ϕ(a i n : j n )

^,^where

n = mcount (a, m)

^and

i n , j n

^denote^start ^and ^end ^loations

of the ourrene. Let

ϕ(m i )

^def

= (ϕ(a i ₁ +i− 1 ) + . . . + ϕ(a i n +i− 1 )) (p + s) ⁻ ¹

^and

ϕ(m i : j )

^def

= (ϕ(m i ) + ϕ(m i+1 ) + . . . + ϕ(m j ))(p + s) ⁻ ¹

^. ^Similarly ^to ^Lemma

2.4.4, we ould showthat

ϕ(m 1 : p + s ) 6 max{ ϕ(a i : p− 1 ), ϕ(a p : p + s )) }

^(2.5)

whih means that

asupp(a, m) 6 max{asupp(a, m ^p ), asupp(a, m ^s )}

^. ^Note

that any extra ourrenes of prex or sux motifs in input sequenes do

not invalidateEquation (2.5).

Foreither additiveor maximal support overa set of sequenes

S

^, ^every-

thing works similarly to above steps, but we have to take into aount the

onstant

|S| ⁻ ¹

^.

2

(21)

piees suhthat atleast one of themis frequent. Weknow thatpartitioning

worksfor twosubmotifs, thusweaniteratevly ontinue andreateasmany

partitions of the original motif as neessary, beause always at least one

partitionhas tobe frequent.

Corollary 2.4.6 Given motif

m ∈ M ℓ

^and ^submotifs

m 1 , m 2 , . . . , m n

^, ^that

partition

m

^into

n

^piees,^thenât^least ôneôf ^the^submotifs^must ^be^frequent.

Sofarwe have desribed the propertiesof additiveand maximalsupport

without onentrating too muh on the atual ontents of the motifs. How-

ever,weusedmotiflengthsinProposition2.4.3toestimatemaximalpossible

supportofasupmotif. Whilethis isuseful knowledge,mostofthetimethese

estimations donot work best, beause they maketheir estimations solelyon

the motifsupport,lengthand possiblesupmotiflength. Wean improvethis

situation by introduingwild ard haraters.

Proposition 2.4.7 Given motifs

m ∈ M ℓ

^and

m ^′ ∈ M ℓ

^, ^that^is^onstrute^d

from motif

m

^suh ^way, ^that ^one ^nuleotide ⁱⁿ

m

îs ^replaed ^by â ^wild ârd

harater *, the additive or maximal support

supp(S, m) 6 supp(S, m ^′ ) .

Forexample, onsider amotif

m ^p =

^GCTâs â ^prex ôf â^longer ^supmotif

m ∈ M 10

^. Îf ^we ^want ^to ^know ^the ^maximal ^support ôf âny ^suh ^supmotif,

we an alulate

supp(S,

^GCT*******

)

^. ^It ^ertainly ^does ^not ^give ^higher

estimation than former desribed method. On the other hand, it requires

a query on the database, whih depending on situation an be ostly. We

desribeboth approahes more thoroughlyinChapter 3.

2.5 Statistially Relevant Motifs

In previous setions, we disussed how to determine if a motif is frequent.

In this setion, we desribe how to go even further by deiding, whih fre-

quent motifs are statistiallymore signiant. By this, we atually want to

measure the amount of surprise for every frequent motif. In our ase, we

may measure surprise individually even for every data trak and we have

(22)

the promoter sequenes. Then we an ompare the supportmeasures of the

permuted dataset against the original one. If motifs in original data have

higher supports, then they are surprising inthat sense. To bemore spei,

we may generate a large amount of datasets by permuting randomly the

original sequenes. If we onsider only one data trak, then we an sort the

motifsdereasingly by their supportsuhthat motifwith highestsupportis

the rst in the resulting list. Then, for every motif at position

i

ⁱⁿ ^the ^list,

we an alulatehowmany motifsat

i

^'th ^positionⁱⁿ^generated ^datasets ^had

supportas high asthe originalmotif. Wean writeit down as

p = Pr[supp(S ^′ , m ^′ ) > supp(S, m)]

where

S

^is^the ^original^dataset,

S ^′

^is^the ^permuted ^dataset,

m

^is^the ^original

motif at

i

^'th ^position ^and

m ^′

^is ^the ^motif ⁱⁿ

S ^′

^at ^same ^position. ^The ^value

p

îs âlled ^p-value ⁱⁿ^statistis ând ⁱⁿ ôur âse, ^represents ^the probability of havingthe support inarandomdatasetatleast asextreme asinthe original

one. Therefore, the smaller the

p

^,^the ^more ^surprising ^is^the ^motif

m

^.

Weanalulatep-valueforeveryfrequentmotifandforeverydatatrak.

Of ourse, we might want to alulate only asingle p-value for every motif,

but the problem is with sorting frequent motifs. This atually an be done,

asdisussed inChapter 3,buthavingap-value inrespet toeahdatatrak

mayrevealinteresting propertiesofthemotifs. Wewillomitexatalgorithm

for alulating p-values, but briey disuss it later in Setion 3.5. Let us

refer tothis algorithmas SigMotifsonwards.

(23)

Algorithms and Data Strutures

In this hapter, we willdevise algorithms based on formalizationand other

ideas desribed in Chapter 2. We start o by desribing ompat enoding

ofmotifsandontinuedevelopingalgorithmswithdierentpruningmethods

and apabilities.

3.1 Compat Enoding of Motifs

It turns out, that there is a rather straightforward way to enode xed-

length motifs as unique integers. If we onsider nulotides and wild ard

harater as a set

X = {

^A

,

^T

,

^G

,

^C

,

^*

}

^and ^have ^another ^set ^with ^same ^size

Y = {0, 1, 2, 3, 4}

^, ^then ^we ^an ^dene ^a ^mapping

π : X −→ Y

^, ^suh ^that

π(

^A

) = 0, π(

^T

) = 1, π(

^G

) = 2, π(

^C

) = 3, π(

^*

) = 4

^, ^that ^would ^enable ^us ^to

represent a motif

m ∈ M ℓ

âsânînteger

5 ⁰ π(m 1 ) + 5 ¹ π(m 2 ) + . . . + 5 ^ℓ− ¹ π(m ℓ ) .

^(3.1)

For our onveniene, let us agree that by writing

π(m)

^, ^where

m ∈ M ℓ

^, ^we

mean the integralrepresentation of motifgiven in Equation(3.1).

Thisrepresentationmakesiteasytohash anymotifof length

ℓ

^and ^store

it in a hash-table as for every xed length motif the integral representation

is unique.

If the motif length

ℓ

îs ^small ênough, ^then ^we ôuld ûse â ^hash-map ôf

size

5 ^ℓ

^. ^This ^way ^we ^ould ^diretly ^use ^the ^value

π(m)

^as ^a ^key ^to ^store

motif's support metris and this guarantees onstant time

O(1)

^aess ^as

there would beno ollisions.

(24)

motifs. For example, there are

5 ⁸ = 390625

^possible ^motifs ^of ^length ⁸

inludingwildardharaters. ForayeastS.Cerevisiae,thepromoterlengths

are not usually longerthan afew thousand basepairs. Therefore, if wehave

one promoter with length of 3000 base pairs, we an atually have maximal

of

3000 − 8 = 2992

^dierent ^non ^wild ârd ^harater ^motifsôf ^length êight.

3.2 Hash-map of Support Metris

The integral representation of motifs allows us to eetively build a hash-

map ontaining support metris of all motifs found in promoter sequenes.

Consider a sequene

a =

ÂTCCGTCCG. Îf^weâre înterested ⁱⁿ^motifsôf ^length

4,thenmotif

m ¹ =

^ATCC^mathes^the^rst ^position^of

a

^and^motif

m ² =

^TCCG

mathes the seondpositionof a. The integralrepresentationsare following:

π(m ¹ ) = 1 · 0 + 5 · 1 + 25 · 3 + 125 · 3 = 455 π(m ² ) = 1 · 1 + 5 · 3 + 25 · 3 + 125 · 2 = 341 .

It turns out, that we an update the integral representation of

m ¹

^to

m ²

ⁱⁿ

onstant time. ByEquation (3.1), the integral representation of motif

m ¹

^is

π(m ¹ ) = 5 ⁰ · π(a 1 ) + 5 ¹ · π(a 2 ) + 5 ² · π(a 3 ) + 5 ³ · π(a 4 )

^. ^By ^subtrating ^the

rst element

5 ⁰ · π(a 1 )

^, ^dividing ^the ^result ^by ^ve ^and ^adding

5 ³ · π(a 5 )

^, ^we

get

π(m ¹ ) − 5 ⁰ · π(a 1 )

5 + 5 ³ · π(a 5 ) = 5 ⁰ · π(a 2 ) + 5 ¹ · π(a 3 ) + 5 ² · π(a 4 ) + 5 ³ · π(a 5 )

whih is equal to

π(m ² )

^. ^So ⁱⁿ ^our ^example, ^where

π(m ¹ ) = 455

^, ^we ^an

alulate

π(m ² ) = π(m ¹ ) − 5 ⁰ · π(a 1 )

5 + 5 ³ · π(a 5 ) = 455 − 0

5 + 125 · 2 = 341 .

Analogously, we an do this with support of single mathes for all traks.

Why this is important, is that we an alulate all support metris of all

motifspresent in data inone pass. The negative side eet of this approah

with sores are possibly greater oating-point rounding errors. But we an

reduethemeetivelybyrealulatingthemfromdatatraksafterevery100

or1000steps. Thisofourseisnottheissue withthe integralrepresentation.

(25)

Letusgiveanin-depthexample. Considertwosequenes

a =

^ATCCGTCCG,

b =

^TTCCG ^and ^two ^mappings

ϕ 1 , ϕ 2

representing two data taks suhthat

ϕ 1 (a 1 : 9 ) = 1.0, 1.0, 1.0, 1.0, 1.0, 0.5, 0.5, 0.5, 0.5 ϕ 2 (a 1 : 9 ) = 0.5, 0.5, 0.5, 0.5, 0.5, 1.0, 1.0, 1.0, 1.0 ϕ 1 (b 1 : 5 ) = 1.0, 1.0, 1.0, 1.0, 1.0

ϕ 2 (b 1 : 5 ) = 0.5, 0.5, 0.5, 0.5, 0.5 .

We an traverse the promoters step-by-step, suh that after every yle the

hash-map ontains up-to-date support metrisbased onseen ourrenes of

motifs. All unseen ourrenes are regarded as having single supports equal

tozero. Inourexample,detailsoftraversing

a

^and

b

^are^givenⁱⁿ^table^below.

Step

m ϕ 1 (m) ϕ 2 (m)

^Comment

1 ATCC

1.0 0.5

^Add ^ATCC^to ^hash-map.

2 TCCG

1.0 0.5

^Do ^same ^with ^TCCG.

3 CCGT

0.875 0.625

^Keepâdding ûnseen ^motifsînto

4 CGTC

0.75 0.75

^hash-map ^with ^their ^support

5 GTCC

0.625 0.875

^metris.

6 TCCG

0.5 1.0

^Update ^support^metris ^of ^TCCG.

7 TTCC

1.0 0.5

^W^e ^are ^proessing

b

^now.

8 TCCG

1.0 0.5

^Update ^support^metris ^of ^TCCG.

Forexample,onsider motifTCCG.Foradditivesupportoverallsequeneswe

sum

1.0/2 + 0.5/2 + 1.0/2

^for

ϕ 1

^and

0.5/2 + 1.0/2 + 0.5/2

^for

ϕ 2

^. ^We^divide

the sores by two, due to Denition 2.3.4. After every update, the additive

supports are up-to-datebased ondata seensofar. Formaximal support, we

needtodomorebook-keeping,beausewhenwendanourrenewithbig-

ger maximalsore inasequene, we have toanelthe eet ofthe previous

ourrene. Forexample,themaximalsupportaftersteptwois

0.5/2

^for

ϕ 2

^.

At step 6,we disover thatit shouldbe

1.0/2

^instead,^therefore ^we ^subtrat

0.5/2

^from ^the ^variable ôntaining ^the ^supportând âdd

1.0/2

^.

Withthis kindof hash-map onstrution we alulate allthe metris on

the y. Therefore, we avoid any post-proessing, beause alulating the

support measures over all sequenes would otherwise require intermediate

lists ontaining sores of single supports. With motifs without wild ard

haraters, this would not be very big memory overhead, but otherwise it

(26)

O n · c · X

s∈S

|s|

!

where

n

îs ^the ^number ôf ^data ^traks ând

c

îs ^the ômplexity ^for ûpdating

the support of amotifin the hash-map.

3.2.1 Inluding Motifs with Wild Card Charaters

WewilldisussSafeApproxSearhinSetion3.4.2,wherehash-mapsare

required to also ontain supports of all wild harater extensions. This re-

quiresustomodifythemethoddesribedearlier. Theintegralrepresentation

allows us to preompute sux parts of all extensions. Let

w ⁱ

^be ^sux ^part

of some motif m of length

ℓ

^, ^suh ^that

1 6 i 6 ℓ

^and

m i : ℓ =

^*

. . .

^*. ^Then

π(w ⁱ ) = 5 ⁱ⁻ ¹ π(

^*

) + . . . + 5 ^ℓ− ¹ π(

^*

)

^. ^If^we^now^have^the^integralrepresentation ofaprex

m ^p

^,^then

π(m ^p )+ π(w ⁱ )

^will^yield^the^integralrepresentationofthe wild ard harater extension. Inhash-map onstrutionphase, itrequires

ℓ

steps insteadof one toinludethe supportmetrisof allwild ardharater

extensions, thereforethe omplexity is

O n · c · ℓ · X

s∈S

|s|

! .

3.3 Naive Searh based on Apriori

The simplestsearhmethodis based on the Apriori priniple desribed in

Chapter 1. Namely, we an mine all motifs present in input sequenes by

settingthethreshold

σ = 1

^withÂprioriând^then^hekîf^theyâre^frequent

in our terms. This is atually a omposition of Apriori and a ltering

funtion. In our ase, it is better to implement this as a depth-rst searh

algorithm,beausebreadth-rstnatureofAprioriausestoomuhmemory

overhead, when mininglonger motifs. The Algorithm 3.3.1 inorporates the

omposition of Apriori and the ltering funtion. On lines 10 12, we

see the andidate generationpart of the algorithm. Note thatwe always use

motifsA, T, G, Cforextension. Thisisduetothefatthattherearerarely

ases, where a nuleotide in promoter sequenes is missing. The Apriori

(27)

1: proedure NaiveSearh(

S, ~σ, m, ℓ

⁾

2: if

mcount (S , m ) = 0

^then

3: return

4: else if

|m| = ℓ

^then

5: if IsFrequent(

~σ

^,

−−→ supp(S, m)

⁾^then

6: SaveMotif(

m

⁾

7: endif

8: return

9: end if

10: for

e ∈ {

^A, ^T, ^G, ^C

}

^do

11: NaiveSearh(

S, ~σ, m || e, ℓ

⁾

12: end for

13: end proedure

pruningprinipleisinationonlines23and thelteringfuntionisgiven

onlines 4 9. FuntionIsFrequentheks, if allthresholds

σ i > s i

^where

~s = −−→ supp(S, m)

^. ^Reall,^that

−−→ supp

ôperator^returnsâ^vetorôf ^values,^where

eahelementdeterminesthesupportperonedatatrakaordingtoadditive

or maximal support type. Also, if implementations of

−−→ supp

^and

mcount

are implemented using data strutures like hash-map disussed in previous

setion, then these need tobe onstruted beforerunning this algorithm.

As an example, alling NaiveSearh(

S, ~σ, θ, 8

^), ^where

θ

^is ^the ^empty

zero-lengthmotif,

S

îs^the ^set ôf^sequenes ând

~σ

^is^the ^vetor ^ofthresholds, we nd all frequent motifs of length8. The omplexity of NaiveSearhis

O(4 ^ℓ )

^,^where

ℓ

^is ^the ^xed ^motif ^length.

3.4 Pruning Strategies

In this setion, we desribe dierent pruning strategies, whih an be used

to make more eient algorithms ompared to NaiveSearh. All these

methodsare based on properties studied inChapter 2.

(28)

ThesimplestmethodisbasedonProposition2.4.3. Namely,ifwearemining

motifswith length

ℓ + k

^and ^we^have^some ^motif

m ∈ M ℓ

^,^then ^the ^support

measures of any of itssuper motifswith length

ℓ + k

^annot^be^greater^than

motifhaving

m

âsâ^prex ândhypothetialsux withsore

1.0

^. ^Therefore,

a motif

m

ând îts^supmotifs ân ^be ^pruned, îf ôn âny ôf ^the ^data ^traks

ℓ · msupp(S, m) + |S| · k

ℓ + k < σ

if we are mining using maximal support or

ℓ · asupp(S, m) + mcount (S, m) · k

ℓ + k < σ

ifweareminingusingadditivesupport. Ofourse,themaximalmotiflength

ℓ + k

^must ^be ^xed ^to ^make ^these ^formulas ûsable. Âs ân êxample, ^let ûs

analyze Figure 3.4.1.

Figure 3.1: Support of motifsAT and AT** ina sample subsequene.

Weseethat

msupp(S,

^AT

) = max{0.1; 0.5; 0.25; 0.5} = 0.5

^and

asupp(S,

^AT

) = 0.1+0.5+0.25+0.5 = 2.25

^. Îf^we^were^miningûsing^maximal^typeôf^support

on this trak, then wean prunethe motifwith itssupmotifs if

(2 · msupp(S,

^AT

) + 2) /4 = (2 · 0.5 + 2) /4 = 0.75 < σ

where

σ

îs ^the ^threshold. ^Fôr âdditive^type ôf ^support ^this ^would ^be

(2 · asupp(S,

^AT

) + 2 · mcount (S,

^AT

)) /4 = = (2 · 0.5 + 2 · 4) /4 = 2.25 < σ .

InorporatingthispruningmethodrequiresonlysmallhangestoNaive-

Searhonline2of Algorithm3.3.1. The resultisgiveninAlgorithm3.4.1,

where CanPrune uses method desribed above to determine if the motif

and supmotifs an bepruned.

(29)

1: proedure MaxSupSearh(

S, ~σ, m, ℓ

⁾

2: if

mcount (S , m ) = 0 ∨

^CanPrune

(~σ

^,

−−→ supp(S, m))

^then

3: return

4: else if

|m| = ℓ

^then

5: if IsFrequent(

~σ

^,

−−→ supp(S, m)

⁾ ^then

6: SaveMotif(

m

⁾

7: endif

8: return

9: end if

10: for

e ∈ {

^A, ^T, ^G, ^C

}

^do

11: MaxSupSearh(

S, ~σ, m || e, ℓ

⁾

12: end for

13: end proedure

3.4.2 Safe Over-Approximation Searh

Another improvement toNaiveSearh uses slightly dierent approah. It

isbasedonProposition2.4.7thatstatedthatsupportofany motif

m ^′

^gained

frommotif

m

^by^replaingône ôr^more^nuleotides^with^wildârd^haraters,

is greater or equal ompared to original motif. Also, it holds with either

maximal or additive type of support. This allows us to dene a support

operator thatis guaranteed tobedownward losed, whih wasanissue with

NaiveSearh and MaxSupSearh[Lau09℄. We will be referring toit as

safe over-approximation type of support onwards.

Denition 3.4.1 Let

supp ^∗ (S, m)

^of ^motif

m ∈ M ℓ

^denote ^the ^supp^ort ^of

its wild harater extension

m ^∗ ∈ M _k

^, ^where

ℓ 6 k

^.

Reall that a wild ard harater extension of

m

^was ^a ^xed ^length ^motif

that ontained

m

âs â^prex ând ^rest ôf ^the êlements ^(wild ârd ^haraters)

as the sux. As an example, if we are interested in mining sequenes of

length

ℓ = 3

^, ^we ^rst ^start ^by ^heking ^the ^support ^of ^wild ^ard ^harater

extensions of motifsin

M 1

^, ^namely ^A**, ^T**, ^G**, ^C** ^(note ^that ^we ^do

not inludemotif * in this list,as it is anyway the most frequent motif and

we are not interested in it). Ifany of these motifs isinfrequent,for example

T, thenwepruneallitssupmotifsTAA, TAT, TAG, TAC, TTAet etera. But

(30)

supp ^∗

ôperator. ^We ônly ^have ^to ^keep ⁱⁿ ^mind, ^that ît îs ^downward ^losed

only when mining motifs with xed length, so that Proposition 2.4.7 would

hold.

Algorithm 3.4.2 SafeOver-Approximations Searh

1: proedure SafeApproxSearh(

S, ~σ, m, ℓ

⁾

2: if IsFrequent(

~σ

^,

−−→ supp ^∗ (S, m)

⁾^then

3: if

|m| = ℓ

^then

4: SaveMotif(

m

⁾

5: return

6: endif

7: else

8: return

9: end if

10: for

e ∈ {

^A, ^T, ^G, ^C

}

^do

11: SafeApproxSearh(

S, ~σ, m || e, ℓ

⁾

12: end for

13: end proedure

The Algorithm 3.4.2 denes SafeApproxSearh. Note that we use

−−→ supp ^∗

ôperatorînstead ôf

−−→ supp

ând ûse ÎsFrequent ^to^determine, ^whether

weanprunethemotifwithitssupmotifs. Thisispossibleduetodownward-

loseness of

−−→ supp ^∗

^operator.

BothMaxSupSearhand SafeApproxSearhhavesimilartheoreti-

alruntimeomplexity

O(f · 4 ^ℓ )

^,^where ^pruning^fator

f ∈ (0, 1]

^is^maximal,

if no pruningourand minimal,if allmotifsare pruned.

3.4.3 Infrequent Sub-Motifs Pruning Method

This alternative searh method isdiretly basedonTheorem 2.4.5. Namely,

if we are interested in motifs with length

ℓ

^, ^then ^for ^any partitioning of a frequentmotif

m ∈ M ℓ

^into^two^piees

m ¹ , m ²

^,ât^leastôneôf^the^piees^must

befrequent. The ideaistogenerate two sets

F

^and

I

^,^where

F

^ontains ^the

frequent motifsand

I

^the înfrequent ônes ôf ^length

ℓ/2

^. ^Thus, ^we ^ombine

motifs from

F

^and

I

^to^enumerate ^nal ^andidates. ^Note, ^that ^we ^need

I

^,

(31)

beause any frequent motif of length

ℓ

^may ^have ^infrequent ^prex ^or ^sux.

We do not need to onsider ombinationsof infrequent submotifs as due to

Theorem 2.4.5 we know, that the resulting motif is also infrequent. Also,

there are many ways to partition the motifs, but making them with same

length enables us to enumerate them faster. The Algorithm 3.4.3 desribes

this proess.

Algorithm 3.4.3 Infrequent Sub-Motifs Searh

1: proedure InfreqSearh(

S, ~σ, m, ℓ

⁾

⊲ ℓ

^must ^be ^even

2:

(F , I) ←

^Enumera^teMotifs

(S , ~σ, ℓ/2)

3:

C ← {(a, b) | a ∈ F , b ∈ F ∪ I}

4: for

c ∈ C

^do

5: if CanPrune(

~σ

^,

−−→ supp(S, c))

^then

6: ontinue

7: else if IsFrequent(

~σ, −−→ supp ^∗ (S, c 1 || c 2

⁾⁾^then

8: SaveMotif(

c 1 || c 2

⁾

9: else if

c 1 6= c 2

^then

10: if IsFrequent(

~σ, −−→ supp ^∗ (S, c 2 || c 1

⁾⁾^then

11: SaveMotif(

c 2 || c 1

⁾

12: end if

13: endif

14: end for

15: end proedure

On line 3, we enumerate all the andidate motifs of length

ℓ

^. ^On ^line

5, we rst try to eliminateandidates by using information we know about

their prex

m 1

^and ^sux

m 2

^. ^We ^try ^this, ^beause ^querying ^the ^database,

depending on data strutures used, an be more ostly. The CanPrune

methodheks on every trak if

msupp(S, m 1 || m 2 ) 6 msupp(S, m 1 ) + msupp(S, m 2 )

2 < σ

for maximal support typeand

asupp(S, m 1 || m 2 ) 6 asupp(S, m 1 ) + asupp(S, m 2 )

2 < σ

(32)

Proposition 2.4.3. Ifwe an prune

m 1 || m 2

ûsing âbove êquations, ^then ^we

an alsoprune

m 2 || m 1

âs ^there îs ^no ^dierene, ⁱⁿ ^what ôrder ^we ônsider

the prex and sux part.

3.5 Mining Fixed Number of Best Motifs

The searh algorithmsdisussed inearliersetionsonentrate onnding all

frequentmotifsinrespettosome thresholdvetor. But suppose wewantto

mine 100 best motifs. Doing this by hand usingany previously mentioned

searh algorithmwould require followingproess. First, we determine some

reasonable thresholds and support types for data traks. Seond, we mine

frequent motifs using these thresholds and deide, whether the number of

motifs was too small or too large. Third, we modify the thresholds by in-

reasingordereasing themand mineagain untilwehavedesired numberof

frequent motifs.

The proess we justdesribed isatually similar tobinary searh known

in omputer siene. The Algorithm 3.5.1 implements it to automate this

proess. On line3,wedetermine twosalars

α

^and

β

^, ^suh ^that^mining^with

α · ~σ

^returns ^all ^motifs ^present ⁱⁿ ^data ^and ^mining ^with

(β + ε) · ~σ

^returns

none of the motifs where

ε > 0

^. ^It ^is ^trivial, ^that

α = 0

^, ^beause ⁱⁿ ^that

ase allmotifswillbefrequent. Determining

β

^is^more^ompliated,^beause

we do not have any prior knowledge about maximal supports in data. First

optionistomakeaguess,but abetteralternativeistondout thesupports

by alulating

~s = −−→ supp ^∗ (S,

^*

)

^and ^set

β = max{s _i /σ i | i = 1, . . . , n}

^(3.2)

where

n

îs^the ^numberôf^data^traks ând

~σ

^ontains^user-dened thresholds.

This way

β · ~σ

^may ^return^only ^minimal^possible ^number^of ^frequent ^motifs.

Havingtheseboundaries xed, wean easilyombine anypreviouslydened

searhmethodwith binary searh. In other words, wekeep saling the orig-

inal vetor of thresholds

~σ

^, ^until ^we ^get ^desired ^number ^of ^frequent ^motifs.

The linearity of this approah may not always be the best hoie, beause

the relationsbetween the reasonable thresholds depend on the natureof the

data. We do not study further possibilities in this work, but it ould be a

possibleresearh area in the future.

(33)

1: proedure NBest(

S, ~σ, N, ℓ

⁾

2:

~s ← −−→ supp ^∗ (S,

^*

)

3:

α ← 0, β ← max{s i /σ i | i = 1, . . . , n} ⊲ n

^is ^the ^number^of ^traks

4:

C ← ∞ ⊲

^The ^losest ^number^of ^best ^motifs

5:

δ ← 0 ⊲

^Salar ^to^be ^used ^to ^mine^losest ^number^of ^best ^motifs

6: while

β − α > ε

^do

⊲ ε > 0

^limits^the ^reursion ^depth

7:

γ ← (α + β)/2

8:

k ←

NumFreqMotifs

(S, γ · ~σ, θ, ℓ) ⊲ θ

^is ^the zero-length motif 9: if

abs(k − N ) < C

^then

10:

C ← k, δ ← γ

11: endif

12: if

k > N

^then

13:

α ← γ

14: else if

k < N

^then

15:

β ← γ

16: else if

k = N

^then

17: break

18: endif

19: end while

20: return MineMotifs

(S, δ · ~σ, θ, ℓ)

21: end proedure

Funtion NumFreqMotifs an be used as a wrapper around searh

methodsdesribedinearliersetions. Therearestillafewthingstoonsider.

First, not always there exist some xed number of best motifs, beause two

motifsmayhave exatlysamesupportmeasures. In that ase,binary searh

goes intoinnite loop. Same happens, when the number of desiredmotifsis

greater than there are motifs present in input data. In both situations, we

need to limit the maximal depth of the reursion. But we an still return

the numberof motifs,that isverylose todesirednumberofmotifs. On line

3, wedene

C

^that ^will^remember, ^what ^was ^the ^losest^number^of ^frequent

motifs to the desired xed number of motifs. Salar

δ

^an ^be ^used ^to ^sale

~σ

^to ^get

C

^frequent ^motifs. ^On ^line ^6, ^we ^use

ε > 0

^to ^limit ^the ^reursion

depth. On lines 12 18, we see binary searh in ation. The while loop