• Keine Ergebnisse gefunden

minor groove major groove

N/A
N/A
Protected

Academic year: 2022

Aktie "minor groove major groove"

Copied!
54
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

FACULTY OF MATHEMATICS AND COMPUTER SCIENCE

Institute of Computer Siene

Timo Petmanson

Mining Motifs in DNA

Regulatory Area

Barhelor's Thesis (6 ECTS)

Supervisor: Sven Laur, D.S. (Teh.)

Author: ... ... May 2010

Supervisor: ... ... May 2010

Chairman: ... ... ... 2010

TARTU 2010

(2)
(3)

Introdution 5

1 Preliminaries 6

1.1 DNA . . . 6

1.2 Gene expression . . . 8

1.3 Data mining . . . 8

2 Sequene Mining with Multiple Layers of Data 11 2.1 Sequenes and Sores . . . 11

2.2 Motifs and Mathing . . . 13

2.3 Support Metris . . . 15

2.4 Properties of Support Metris . . . 18

2.5 StatistiallyRelevantMotifs . . . 21

3 Algorithms and Data Strutures 23 3.1 Compat Enoding of Motifs . . . 23

3.2 Hash-mapof SupportMetris . . . 24

3.2.1 Inluding Motifswith Wild Card Charaters . . . 26

3.3 NaiveSearhbased onApriori . . . 26

3.4 Pruning Strategies . . . 27

3.4.1 MaximalSupportEstimationPruning . . . 28

3.4.2 Safe Over-Approximation Searh . . . 29

3.4.3 Infrequent Sub-Motifs PruningMethod . . . 30

3.5 Mining Fixed Number of Best Motifs . . . 32

3.6 Generalized FP-Tree . . . 35

4 Experimental Results 38 4.1 Runtime Performane of Searh Algorithms . . . 38

(4)

4.1.2 Mining Fixed Number of Frequent Motifs . . . 42

4.2 Mining BiologiallySigniantMotifs . . . 43

4.2.1 Data Preparation . . . 43

4.2.2 Results . . . 44

Summary 49

Resümee (eesti keeles) 50

Bibliography 51

A Multi-onstraint miner tool for gene expression analysis 53

(5)

All living organisms on earth are believed to ontain geneti information

oded instruturedolletionsof genesandnon-odingsequenes thatmake

up the DNA. The oded information is used to build organisms, maintain

themanditdenesawiderangeofgenetifeaturesthatvaryfromindividuals

to individualsand fromspeies to speies. The non-oding parts have muh

of the responsibility to regulate the expression of partiular genes. Genes

withtheirnon-odingregulatoryareasformomplexsignalingnetworksthat

together oordinatethe lifeyle of anorganism. Contemporarymethodsin

genetislikeChIPandmiro-arraymeasurementsmakeitpossibletomeasure

features of thousands of genes in one experiment, generating huge amounts

of data. Therefore, the developmentof new algorithmsand methods able to

analyze this data is ruial.

Ourontributionsinludethedevelopmentofnovelmethodsabletoom-

bine dierent soures of experimental data. In Chapter 2, we formalize the

theory desribing sequene mining with multiple input sequenes and mul-

tiple data layers. We also desribe, how to determine statistially signi-

ant motifs using our theory. In Chapter 3, we develop algorithms Max-

SupSearh, SafeApproxSearh, InfreqSearh, GFPSearh, that

utilize dierent pruning strategies. For GFPSearh, we dene generi-

frequent-pattern tree struture that isa generalizationof FP-tree[JJYR04℄.

Wealsodevelop NBest, that ombines any previouslymentionedalgorithm

with binary searh to get xed numberof best motifs. We develop SigMo-

tifs, that goes even further by distilling out statistially signiant motifs.

Performane study of mentioned algorithmsalong with experiments on real

biologialdata are given in Chapter 4.

(6)

Preliminaries

1.1 DNA

Currently sientists havedesribed about1.5milliondierentspeies: about

ve thousand mammals, thirty thousand speies of sh and over nine hun-

dredthousand insetsamongothers[WCU07 ℄. Someestimates ofomparing

samplesfromvariouspartsoftheworldseassuggestthatinoeanstheremay

bemore than100 million speies ofbateria[MHJ06℄. Thisvastdiversity of

known and unknown speies in Earth's biosphere are believed to have one

thing in ommon: the presene of DNA.

minor groove major groove

Figure 1.1: DNA Double Helix. The distane between strands varies and

forms major and minor grooves.

(7)

bakbone of a strand ontains alternating phosphate and sugar residues

linked with bases. These two strands form a struture known as double

helix seen in Figure 1.1, whose stability is maintained by hydrogen bonds

between the bases, see Figure 1.2 [RSM05℄. There are four types of bases

in DNA: adenine (abbreviated A), thymine (T), guanine (G) and ytosine

(C) that ombined with a sugar and one ormore phosphate residues forma

nuleotide. The nuleotides are pairwise aligned,making the struture anti

parallel, where adenine bonds only to guanine and ytosine bonds only to

thymine. The endpointsof the strands are alled 3'and 5'where the rst is

denedbyaterminalphosphategroupandtheseondbyaterminalhydroxyl

group [Coh04℄.

DNAnuleotide sequenes are usually writtenonly using basesfrom one

strand asthe bases onother strandare omplementary. Sequene TATAAA is

omplementarytoATATTT forexample. Theorderthe haraters are written

dependsonthesoureofthedatasometimesthedataiswrittenindiretion

from 3'to 5'while others are vie versa.

sugar CH3

N O

N

H H

O H

N N

N N

N sugar

Adenine Thymine

N sugar O

N O H

N

H N

H

H

H

sugar O

N

N N

Cytosine Guanine

Figure 1.2: TA and GC omplementary base pairs. Dotted lines represent

hydrogen bonds between bases

(8)

Gene expression means the rate and amount of RNA transribed from it,

whih in turn is used to dene other proteins neessary for the ell and

the organism. The transription proess requires transription fators that

are speial proteins able to reognize and attah to partiular fragments in

gene promoter areas. The transription fators are requiredto reruit RNA

polymerasethat is responsible for arrying out the transription proess.

In more omplex eukaryoti ells, the promoters are rather diverse and

ompliated, but the ore elements are a transription start site, whih to-

getherwithRNApolymeraseandtransriptionfatorbindingsitesareessen-

tialforinitiatingthetransriptionproess. Otherimportantbindingsitesare

typially a little more far away in upstream diretion that mainly regulate

gene expression by enhaning or restriting reruitment of the main tran-

sription fators. Additionally, there may be even more distant promoter

areas that have weaker inuene onthe gene regulation.

1.3 Data mining

Data mining is a method in statistis for extrating interesting patterns or

knowledge from large amounts of available data. This eld is very diverse

as among general data mining solutions there are many spei proedures

developedforbusiness,games,soialnetworksetetera [DP07 ℄. Inthiswork,

weonentrateonspeializedareaofdataminingalledsequenemining that

deals with ordered sequenes like nuleotide sequenes.

The Apriori algorithm is the most general and simple way to nd pat-

terns with high support in given data. In standard sequene mining, the

support is dened as the number of ourrenes of a pattern in input data,

whihisusedtodeidewhetherthepatternisfrequentorinfrequentbasedon

somedened threshold. TheApriorialgorithmassumesthatthesupportis

downward losed,whih meansthat forany infrequent patternthere do not

existany frequentsup-patterns. Forexample,aDNAmotif AAATCCC annot

be present in data more times than sequenes AAA and CCC, beause when-

ever the supmotif ours, the two submotifs must also our. Let us larify,

that inthis workby asubmotiforasubpatternwe meanasubsequene with

onseutive elements.

(9)

1:

F 1 ← {

Frequent one-element patterns

}

2:

ℓ ← 2

3: while

F ℓ− 1 6= ∅

do

4:

C ℓ ←

GenerateCandidates

(F ℓ− 1 )

5:

F ℓ ← {c ∈ C ℓ | supp(c) > σ} ⊲ σ

is threshold

6:

ℓ ← ℓ + 1

7: end while

The Apriori algorithmuses downward loseness asamainpruningfeature.

In Algorithm 1.3.1 on line 4, the GenerateCandidates proedure takes

the set of frequent motifs of length

ℓ − 1

as input and generates possible

andidates of length

. It does not needto onsider any non-frequent motifs as none of their supmotifs are frequent. The algorithm stops running when

it has found all frequentmotifs inthe dataset.

LetusdemonstrateApriori by givinganexample. Consider thefollow-

ingsequene:

GCTTATGGTCGCTATGCTTT

.

Suppose we want to mine all motifs ourring at least three times in the

sequene. This means that we run Apriori with threshold

σ = 3

. The set

F 1 = {

T

,

G

,

C

}

,beauseallnuleotides exept A are presentin sequene more

than threetimes. Next,we generateandidate motifsoflengthtwoby using

only frequent elementsin

F 1

.

C 2 = {

TT, TG, TC, GT, GG, GC, CT, CG, CC

}

Frequent motifsinthis ase are

F 2 = {

TT, GC, CT

} .

Note that TT mathes TTTtwo times. The next andidate set is

C 3 = {

TTT, GTT, TTG, CTT, TTC, TGC, GCT, GGC, GCG,

CGC, GCC, TCT, CTT, GCT, CTG, CCT, CTC

} .

This time there is onlyone frequent motif:

F 3 = {

GCT

} .

(10)

Candidate motifsof length

4

:

C 4 = {

TGCT, GCTT, GGCT, GCTG, CGCT, GCTC

} .

Butnoneofthemisfrequent,so

F 4 = ∅

andallfrequentmotifsinourexample

are

F = {

T, G, C, TT, GC, CT, GCT

} .

There are also algorithms like WINEPI [MTV95 ℄, MINEPI [MT96℄,

SPEXS [Vil02℄ that are able to mine motifs using pattern mathing. Still,

while Apriori with other standard sequene mining algorithms are useful,

they treat all parts of the sequene with equalweight. In our ase, we need

methodsthatareabletoworkwithdatathatdeoratessequeneswithsores,

making some parts of them more relevant than the rest. In Chapter 2, we

reformulate standard sequene mining tehniques and later devise our own

algorithmsthat handle suhrequirements.

(11)

Sequene Mining with Multiple

Layers of Data

Inthishapter,weformalizebasinotionsandoneptslikesequenes, motifs

and support that are needed to develop our methods. We try to develop

our mathematial approah suh that it would be onvenient to study gene

regulation, whenwe onsider several promoterareas and dierentproperties

of these sequenes desribed by layers of experimentaldata.

We also study dierent properties and relations between these building

bloks that are later used in algorithmsto ut down the running times and

improve overall performane, although we do not over algorithmi details

and otheraspetslikedata struturesastheyare disussedinlaterhapters.

2.1 Sequenes and Sores

The most basi onstruts we will be dealing onwards are DNA sequenes

and their fragments. In our ase, it will be onvenient to think of them as

a set of nuleotide sequenes. Let

S = {a, b, c, . . .}

denote a set of promoter

sequenes relevant to some gene. Single elements of a sequene are denoted

with subsripts as usual. For example,

a 1

means the rst element and

a 2

the seond elementof

a ∈ S

. As there are fourtypesof nuleotides adenine,

thymine, ytosine, guanine inDNA that orrespond to lettersA, T, C, G. We

write

a 1 =

A ,ifrstelementinthenuleotidesequene isadenineand

a 2 =

T ,

if the seond element is thymine. Let us denote the length of sequene

a

as

(12)

|a|

. It is worth to note that no promoter is with length of zero, nor there

are promoters with innite length in real world. However, depending on

partiular ase, the lengths of the sequenes are not usually very short or

very long.

In mathematis, a fragment of a sequene is usually written as a list of

elements. In this paper, we willbe using a shorternotation:

a i : j

def

= a i , a i +1 , . . . , a j .

where

i

isthe beginningand

j

is the end of the fragment.

Westatedinthe introdutionofthisthesisthatweare goingtodealwith

multiple layers of data about promoter sequenes. For example, if we have

data ontainingbinding andonservation soresfromDNA miro-arrayand

sequening experiments that assoiate with promoters we are interested in,

weanportraythemasdatatraksoverthenuleotidesequeneasillustrated

in Figure2.1.

value

... A T G C C C A T T G C T A G G C ... pos 0.5

1.0

conservation binding

Figure 2.1: An example subsequene having onservation and binding data

traks attahed. The sores are variable and may not diretly depend on

eah other.

From theoretialpointofview, itisnot importantexatlywhatkindofdata

we have, as long we an represent it as numeri values linked to positions

in promoter sequenes. However, it is important that these values express

some property that makes some regions of the nuleotide sequene more

relevant than other regions, thus dening important regions in respet to

eah data trak. If we have

n

data sets ontaining various sores and

m

promoters, thenweneed

n × m

mappingsthat assoiate relevantsores from

(13)

to normalize all data suh that all sores fall into range

[0, 1]

like shown

on Figure 2.1. It simplies writing some formulas, beause we know the

maximum possible value of any type of sore linked to any position of a

nuleotidesequene. Let

ϕ : N −→ R

bea mappingthatassoiatesnumeri sores to all positions of a nuleotide sequene. Tomakethis notationmore

useful, letus agree that by writing

ϕ(a i )

we mean the sore that

ϕ

maps to

position

i

ofsequene

a

andbywriting

ϕ(a i : j )

,wemeanasequene ofsores

ϕ(a i : j )

def

= ϕ(a i ), ϕ(a i +1 ), . . . , ϕ(a j )

. Bywriting

ϕ(a i : j )

wemeanthe average

sore

ϕ(a i : j )

def

= 1 j − i + 1 ·

j

X

k = i

ϕ(a k ) .

2.2 Motifs and Mathing

In this setion, we introdue motifs, whih an be thought of as possible

subsequenes in sequene set. Motifs do not diretly assoiate to any data

trak,butthereareseveral othermetrislikesupport,frequeny,signiane

ofamotifinapartiularset ofpromotersequenes. Inadditiontonuleotide

lettersA, T, G, C,motifsmayalsoontainspeialwildardharatersthat

have speial meaning and usage. In this work, we will be using only one

suh symbol* that represents any possible nuleotide in one position. Note

that this isdierent from standard usage of this symbol inbath-proessing

or regular-expression appliations where it usually stands for zero or more

symbols. In our ase, if we have a motif G**A, then by that we mean any

motif with lengthof four that starts with letter G and ends with letter A.

We will be dealing a lot with xed-length motifs in later setions, so it

is neessary tointroduenotationthat wean use to refertoallmotifswith

a xed length

. Let

M

represent a set of all motifs with length

where

ℓ ∈ N

. We agreed before, that allmotifs are onsist of ve dierent letters:

the nuleotidesand the wildardharater. Thismeansthat theardinality

oftheset

M

isequalto

|M | = 5

astherearevedierentpossibleelements

perposition ina motif.

Oftenitisneessary, that weouldrefertosingleelementsof amotifthe

same way we do for sequenes, so given any motif

m ∈ M ℓ

, let

m 1

denote

the rst element of the motif,

m 2

the seond element of the motifet etera.

In addition to that, it is onvenient to desribe motifs as onatenation of

(14)

onlyaprexandsux part. Let

||

beanonatenationoperator. If

m p ∈ M p

and

m s ∈ M s

then motif

m = m p || m s

, where

m ∈ M ℓ

and

ℓ = p + s

. Let

us illustrate this with an example. If

m p =

AAAT and

m s =

GCCGT, then the

onatenation

m p || m s

isAAATGCCGT.

Anotherveryusefulnotionisawildardextensionofsomemotif. Namely,

if we have some xed motif length

and a motif

m ∈ M k

, suh that

k 6 ℓ

,

we may pad the motif with wild ard haraters until it is

elements long.

This enables to easily express motifs we know to have a ertain prex. Let

m ∈ M ℓ

denote a wild ard harater extension of motif

m ∈ M k

where

k 6 ℓ

suh that the prex

m 1 : k = m

and sux

m k +1 : ℓ =

*...*. For

instane, if

m =

AATA and we have xed motif length

ℓ = 10

, then the wild

harater extension

m =

AATA******. This notion omes handy when we desribe SafeApproxSearh algorithm in Chapter 3. Let us agree that

any motif gained from another motif by replaing one or more nuleotides

with wild ards is onsidered asubmotif of the original motif.

In standard sequene mining, the support of some motifis usually mea-

sured by how many mathes ithas in data [DP07℄. The number of mathes

of a motif ontainingno wild ard haraters is simplythe number of times

the motifan beviewed asa subsequeneof given data sequene. Withwild

ard haraters this works dierent as a wild ard harater mathes any

nuleotide. See Figure2.2for anillustration.

Figure 2.2: Threemathes of motif ATA*A ina subsequene.

Denition 2.2.1 A motif

m ∈ M ℓ

mathes some fragment

a i : i + ℓ− 1

of se-

quene

a

, if

m k =

*

∨ m k = a i+k− 1

for all

k = 1, . . . , ℓ .

(15)

match (a, m, i) =

1,

if m mathes

a i : i + ℓ− 1

0,

otherwise

.

We an extend the number of mathes in the sequene

a

over a set of se-

quenes

S

by simplyaddingall the individualounts together:

mcount (a, m)

def

= P |a|−ℓ +1

i =1 match (m, a i ) mcount (S, m)

def

= P

s ∈ S mcount (m, s) .

2.3 Support Metris

Standard sequene mining treats all parts of the input sequene with equal

value of importane [DP07℄. In our ase, we have possibly more than one

data traks ontainingvariablesores. Therefore, we need todene support

in a dierent way. We base our approah on a formulation given by Sven

Laur [Lau09℄.

The rst thing is to extend the notion of support of one single math.

Standard way was summing up all mathes of a motif in a sequene, suh

thateahmathhadequalvalueofimportane. Butaswehaveatualsores

linkedtopositions,weextendtheoriginalmethodbytakinganaveragesore

of mathing positions of asingle math.

Denition 2.3.1 The support of an individual motif

m ∈ M

with respet

to some fragment in sequene

a

startingfrom position

i

:

supp(a, m, i) =

ϕ(a i : i + ℓ− 1 )

if

match (a, m, i) = 1

0

otherwise

.

Toextendthesupportofamotifoverasequene, wehaveseveral options.

The rst idea is to add up all the single supports of the motif. This is the

simplestwaytogoand werefertothis methodasadditive supportonwards.

Letusonsider another option: insteadofadding upthe sores, wean take

only the maximal sore and be ne with it. The plus side of this method

is that it promotes motifs that atually have high sores. Additive support

an be high even if all the sores of the single mathes are low. So, we also

onsider this method and we willbe referring toitas maximal support.

(16)

in a sequene. We might onsider average support that works like additive

support, but we divide the result by number of mathes of that motifin the

sequene. We ould also dene supports like weighted additive or weighted

average support, that onsiders some regions of the promoter to be more

signiantthanothers. Thelasttwoareatuallynotveryreasonable,beause

we express signiane of promoter areas through data traks anyway.

The average support is atuallymore relevant, but as itseems to have a

mixed propertiesof additiveandmaximalsupport,we donot overthis type

of supportinthis work andonentrate onstudyingonlythe two mentioned

supporttypes.

Denition 2.3.2 Additive support of a motif

m ∈ M ℓ

in sequene

a

is

asupp(a, m) =

|a|−ℓ +1

X

i=1

supp(a, m, i) .

Denition 2.3.3 Maximal support of a motif

m ∈ M ℓ

in sequene

a

is

msupp(a, m) = max { supp(a, m, i) | i = 1, . . . , |a| − ℓ + 1} .

By writing

supp(a, m)

, we do not refer diretly to neither of the support

types inases we are disussing properties that apply to both of them.

Therefore, Denitions 2.3.2 and 2.3.3 are only two possible ways of ex-

pressing the support of a motif in one sequene. Biologial importane of

the twodependsmostly onthe atualdataused. Forexample,if weonsider

onservation, then additive support an reveal motifsthat oexist inseveral

genetially lose speies having great strutural importane, maximal sup-

port takes into aount only one ourrene of a motif in a promoter, thus

ignoring larger sale strutural eets. On the other hand, maximal sup-

portan bring up motifsthat are reognizedmost probablyby transription

fators as these enzymes require proper loations to enable mounting RNA

polymerase and initiate transription. Thus, the deision about what sup-

porttype shouldbeused with apartiulardata trak,depends onbiologial

properties the data.

To extend notion of support over a set of sequenes, we have several

options. The rst approah is to onsider all promoter sequenes having

(17)

an average of support of amotifin allsequenes.

Denition 2.3.4 Additive and maximal support of motif

m ∈ M ℓ

in a list

of sequenes

S

are

asupp(S, m) = 1

|S| · X

a∈S

asupp(a, m)

(2.1)

msupp(S, m) = 1

|S| · X

a∈S

msupp(a, m) .

(2.2)

The seond approah is to onsider promoters further away from the gene

theyregulatehavinglessimpatthantheoneslosertoit. Therefore,weneed

to give promoter sequenes meaningful weights, when alulating support.

We ould propagate these weights diretly into the datasets, enabling the

diret use of Equations (2.1) and (2.2). Let us also agree that by writing

supp(S, m)

, we donot refer diretly to additive, nor maximal support if we

are disussing properties that apply toboth of them.

InChapter3,wedisussalgorithmsanddata struturesandusuallyneed

support in respet to all data traks. Also, let us agree that we have xed

the support type for every data trak tomake semantis easier. In ases we

need touse bothsupporttypes, wean vieworiginaltrakastwodupliates

with dierent supporttypes.

Denition 2.3.5 Given mappings

ϕ 1 , . . . , ϕ n

, the support of motif

m

in se-

quenes

S

in respet to all

n

data traks is

−−→ supp(S, m) = (s 1 , . . . , s n )

where

s i =

asupp(S, m, ϕ i )

for additive type of support

msupp(S , m, ϕ i )

for maximal type of support.

Mappings

ϕ 1 , . . . , ϕ n

given as extra arguments to support operators will be

used as the mapping

ϕ

inDenition 2.3.1.

(18)

In previous setion, we dened basi building bloks like sequenes, motifs

and mappings that gave eah position in a promoter sequene one or more

weights in regard to available data sets. In this setion, we study various

properties of newly dened supportmeasures and notions.

When weompare additive and maximal support, itisrathereasytosee

that additivesupport is always asbigas maximalsupport, beause additive

support onsiders all ourrenes of a motif in a sequene where maximal

supportonly onsiders the ourrene with maximalsupport.

Proposition 2.4.1 For any motif

m ∈ M ℓ

and a set of sequenes

S msupp(S, m) 6 asupp(S , m) .

We have not mentioned that there isa problemwith the way wedened

our support of some motif. Namely, the denition breaks the standard se-

quene miningprinipleofbeingdownward losedasanynon-frequent motif

may have frequent supmotifs.

Claim 2.4.2 Let

σ ∈ R

be the threshold. For any motif

m ∈ M ℓ

, suh

that support

supp(S, m) < σ

, may exist a supmotif

m ∈ M ℓ + k

, suh that

supp(S, m ) > σ

, whether we onsider additive or maximal support.

Proof. For simpliity, let us assume that there is only one single math of

motif

m

in positions

i

to

i + ℓ − 1

in sequene

a

. Sine the sores of all

positions are in range

[0, 1]

, the

supp(a, m, i) ∈ [0, 1]

. Now, let us onsider

a situation where support of the prex

supp(a, m, i) = ϕ(a i : i+ℓ− 1 ) < 1

and

supportof the sux

ϕ(a ℓ : ℓ + k− 1 ) = 1

. From here we an onlude that

supp(S, m ) = 1

|S| · ϕ(a i ) + . . . + ϕ(a i + ℓ− 1 ) + k

ℓ + k > supp(S, m) .

(2.3)

If we take

σ = supp(S, m )

, then

m

is infrequent and

m

isfrequent.

2

Above proof raises anotherquestion: if we know the supportof a motif,

thenwhatisthemaximalpossiblesupportofanysupmotif? Weanapproah

theanswersamewayprovedabovelaim. Namely,ifweonsiderthesupport

of the sux of a possible supmotif to have maximal possible value, then we

an alulatethe maximal possible support of the supmotif.

(19)

Proposition 2.4.3 For any motif

m ∈ M ℓ + k

and its submotif

m ∈ M ℓ

in

a set of sequenes

S

msupp(S, m ) 6 ℓ · msupp(S, m) + |S| · k ℓ + k

asupp(S, m ) 6 ℓ · asupp(S, m) + mcount (S, m) · k

ℓ + k .

Proof. Letusonsiderasetoffragments

{a p : q , b r : s , . . . z t : u }

thatrepresent

positionsoneverypromoterwheremotif

m

has highestsupport. Inthatase

msupp(S, m) = |S| 1 (ϕ(a p : q ) + ϕ(b r : s ) + . . . + ϕ(z t : u )) .

If we now onsider

m

asprex of

m

, then analogous way to Equation (2.3)

we an estimate that support of

m

annot be largerthan

msupp(S, m ) 6 1

|S|

ϕ(a p : q ) + k

ℓ + k + ϕ(b r : s ) + k

ℓ + k + . . . + ϕ(z t : u ) + k ℓ + k

=

= 1

|S| · ϕ(a p : q ) + ϕ(b r : s ) + . . . + ϕ(z t : u ) + |S| · k ℓ + k

that ombined with Equation (2.2) beomes

msupp(S, m ) 6 ℓ · msupp(S, m) + |S| · k

ℓ + k .

(2.4)

Additivesupport takesinto aountallourrenesof

m

,thus replaing the

relevant parts inEquation (2.4), we get

asupp(S, m ) 6 ℓ · asupp(S, m) + mcount (S, m) · k

ℓ + k .

2

Claim 2.4.2 implies that frequent motifs may have infrequent submotifs.

However, it is importantto note that any frequent motif alsomust have at

least one frequent submotifwith eitheradditive ormaximal type ofsupport.

(20)

Lemma 2.4.4 For all positions

i < j 6 k

in sequene a

ϕ(a i : k ) 6 max{ ϕ(a i : j− 1 ), ϕ(a j : k ) } .

Proof. The rst possibility is that

ϕ(a i : j− 1 ) > ϕ(a j : k )

or

ϕ(a i : j− 1 ) <

ϕ(a j : k )

. In that ase

ϕ(a i : k ) < max{ ϕ(a i : j− 1 ), ϕ(a j : k )}

as the average

sore of the supsequene must be lower than the subsequene with maxi-

mal average sore. The seond possibility is that

ϕ(a i : j− 1 ) = ϕ(a j : k )

, thus

ϕ(a i : k ) = ϕ(a i : j− 1 ) = ϕ(a j : k )

.

2

Theorem 2.4.5 Any frequent motif

m ∈ M p+s

an be partitioned into two

submotifs

m p ∈ M p

and

m s ∈ M s

, suh that

m = m p || m s

and either the

prex

m p

or sux

m s

is frequent.

Proof. If we have only a single math of the motif

m

in sequene

a

, then

aording to Lemma2.4.4

supp(a, m, i) 6 max{supp(a, m p , i), supp(a, m s , i + p)} .

Now, suppose wehavemoremathes of

m

inonesinglesequene

a

. As max-

imal support onlyonsiders one ourrene of

m

, then aording toLemma

2.4.4 the theorem holds. By Denition 2.3.2, the additive support is the

sum of supports of all single mathes. Let

asupp(a, m) = ϕ(a i 1 : j 1 ) + . . . + ϕ(a i n : j n )

,where

n = mcount (a, m)

and

i n , j n

denotestart and end loations

of the ourrene. Let

ϕ(m i )

def

= (ϕ(a i 1 +i− 1 ) + . . . + ϕ(a i n +i− 1 )) (p + s) 1

and

ϕ(m i : j )

def

= (ϕ(m i ) + ϕ(m i+1 ) + . . . + ϕ(m j ))(p + s) 1

. Similarly to Lemma

2.4.4, we ould showthat

ϕ(m 1 : p + s ) 6 max{ ϕ(a i : p− 1 ), ϕ(a p : p + s )) }

(2.5)

whih means that

asupp(a, m) 6 max{asupp(a, m p ), asupp(a, m s )}

. Note

that any extra ourrenes of prex or sux motifs in input sequenes do

not invalidateEquation (2.5).

Foreither additiveor maximal support overa set of sequenes

S

, every-

thing works similarly to above steps, but we have to take into aount the

onstant

|S| 1

.

2

(21)

piees suhthat atleast one of themis frequent. Weknow thatpartitioning

worksfor twosubmotifs, thusweaniteratevly ontinue andreateasmany

partitions of the original motif as neessary, beause always at least one

partitionhas tobe frequent.

Corollary 2.4.6 Given motif

m ∈ M ℓ

and submotifs

m 1 , m 2 , . . . , m n

, that

partition

m

into

n

piees,thenatleast oneof thesubmotifsmust befrequent.

Sofarwe have desribed the propertiesof additiveand maximalsupport

without onentrating too muh on the atual ontents of the motifs. How-

ever,weusedmotiflengthsinProposition2.4.3toestimatemaximalpossible

supportofasupmotif. Whilethis isuseful knowledge,mostofthetimethese

estimations donot work best, beause they maketheir estimations solelyon

the motifsupport,lengthand possiblesupmotiflength. Wean improvethis

situation by introduingwild ard haraters.

Proposition 2.4.7 Given motifs

m ∈ M ℓ

and

m ∈ M ℓ

, thatisonstruted

from motif

m

suh way, that one nuleotide in

m

is replaed by a wild ard

harater *, the additive or maximal support

supp(S, m) 6 supp(S, m ) .

Forexample, onsider amotif

m p =

GCTas a prex of alonger supmotif

m ∈ M 10

. If we want to know the maximal support of any suh supmotif,

we an alulate

supp(S,

GCT*******

)

. It ertainly does not give higher

estimation than former desribed method. On the other hand, it requires

a query on the database, whih depending on situation an be ostly. We

desribeboth approahes more thoroughlyinChapter 3.

2.5 Statistially Relevant Motifs

In previous setions, we disussed how to determine if a motif is frequent.

In this setion, we desribe how to go even further by deiding, whih fre-

quent motifs are statistiallymore signiant. By this, we atually want to

measure the amount of surprise for every frequent motif. In our ase, we

may measure surprise individually even for every data trak and we have

(22)

the promoter sequenes. Then we an ompare the supportmeasures of the

permuted dataset against the original one. If motifs in original data have

higher supports, then they are surprising inthat sense. To bemore spei,

we may generate a large amount of datasets by permuting randomly the

original sequenes. If we onsider only one data trak, then we an sort the

motifsdereasingly by their supportsuhthat motifwith highestsupportis

the rst in the resulting list. Then, for every motif at position

i

in the list,

we an alulatehowmany motifsat

i

'th positioningenerated datasets had

supportas high asthe originalmotif. Wean writeit down as

p = Pr[supp(S , m ) > supp(S, m)]

where

S

isthe originaldataset,

S

isthe permuted dataset,

m

isthe original

motif at

i

'th position and

m

is the motif in

S

at same position. The value

p

is alled p-value instatistis and in our ase, represents the probability of havingthe support inarandomdatasetatleast asextreme asinthe original

one. Therefore, the smaller the

p

,the more surprising isthe motif

m

.

Weanalulatep-valueforeveryfrequentmotifandforeverydatatrak.

Of ourse, we might want to alulate only asingle p-value for every motif,

but the problem is with sorting frequent motifs. This atually an be done,

asdisussed inChapter 3,buthavingap-value inrespet toeahdatatrak

mayrevealinteresting propertiesofthemotifs. Wewillomitexatalgorithm

for alulating p-values, but briey disuss it later in Setion 3.5. Let us

refer tothis algorithmas SigMotifsonwards.

(23)

Algorithms and Data Strutures

In this hapter, we willdevise algorithms based on formalizationand other

ideas desribed in Chapter 2. We start o by desribing ompat enoding

ofmotifsandontinuedevelopingalgorithmswithdierentpruningmethods

and apabilities.

3.1 Compat Enoding of Motifs

It turns out, that there is a rather straightforward way to enode xed-

length motifs as unique integers. If we onsider nulotides and wild ard

harater as a set

X = {

A

,

T

,

G

,

C

,

*

}

and have another set with same size

Y = {0, 1, 2, 3, 4}

, then we an dene a mapping

π : X −→ Y

, suh that

π(

A

) = 0, π(

T

) = 1, π(

G

) = 2, π(

C

) = 3, π(

*

) = 4

, that would enable us to

represent a motif

m ∈ M ℓ

asaninteger

5 0 π(m 1 ) + 5 1 π(m 2 ) + . . . + 5 ℓ− 1 π(m ℓ ) .

(3.1)

For our onveniene, let us agree that by writing

π(m)

, where

m ∈ M ℓ

, we

mean the integralrepresentation of motifgiven in Equation(3.1).

Thisrepresentationmakesiteasytohash anymotifof length

and store

it in a hash-table as for every xed length motif the integral representation

is unique.

If the motif length

is small enough, then we ould use a hash-map of

size

5

. This way we ould diretly use the value

π(m)

as a key to store

motif's support metris and this guarantees onstant time

O(1)

aess as

there would beno ollisions.

(24)

motifs. For example, there are

5 8 = 390625

possible motifs of length 8

inludingwildardharaters. ForayeastS.Cerevisiae,thepromoterlengths

are not usually longerthan afew thousand basepairs. Therefore, if wehave

one promoter with length of 3000 base pairs, we an atually have maximal

of

3000 − 8 = 2992

dierent non wild ard harater motifsof length eight.

3.2 Hash-map of Support Metris

The integral representation of motifs allows us to eetively build a hash-

map ontaining support metris of all motifs found in promoter sequenes.

Consider a sequene

a =

ATCCGTCCG. Ifweare interested inmotifsof length

4,thenmotif

m 1 =

ATCCmathestherst positionof

a

andmotif

m 2 =

TCCG

mathes the seondpositionof a. The integralrepresentationsare following:

π(m 1 ) = 1 · 0 + 5 · 1 + 25 · 3 + 125 · 3 = 455 π(m 2 ) = 1 · 1 + 5 · 3 + 25 · 3 + 125 · 2 = 341 .

It turns out, that we an update the integral representation of

m 1

to

m 2

in

onstant time. ByEquation (3.1), the integral representation of motif

m 1

is

π(m 1 ) = 5 0 · π(a 1 ) + 5 1 · π(a 2 ) + 5 2 · π(a 3 ) + 5 3 · π(a 4 )

. By subtrating the

rst element

5 0 · π(a 1 )

, dividing the result by ve and adding

5 3 · π(a 5 )

, we

get

π(m 1 ) − 5 0 · π(a 1 )

5 + 5 3 · π(a 5 ) = 5 0 · π(a 2 ) + 5 1 · π(a 3 ) + 5 2 · π(a 4 ) + 5 3 · π(a 5 )

whih is equal to

π(m 2 )

. So in our example, where

π(m 1 ) = 455

, we an

alulate

π(m 2 ) = π(m 1 ) − 5 0 · π(a 1 )

5 + 5 3 · π(a 5 ) = 455 − 0

5 + 125 · 2 = 341 .

Analogously, we an do this with support of single mathes for all traks.

Why this is important, is that we an alulate all support metris of all

motifspresent in data inone pass. The negative side eet of this approah

with sores are possibly greater oating-point rounding errors. But we an

reduethemeetivelybyrealulatingthemfromdatatraksafterevery100

or1000steps. Thisofourseisnottheissue withthe integralrepresentation.

(25)

Letusgiveanin-depthexample. Considertwosequenes

a =

ATCCGTCCG,

b =

TTCCG and two mappings

ϕ 1 , ϕ 2

representing two data taks suhthat

ϕ 1 (a 1 : 9 ) = 1.0, 1.0, 1.0, 1.0, 1.0, 0.5, 0.5, 0.5, 0.5 ϕ 2 (a 1 : 9 ) = 0.5, 0.5, 0.5, 0.5, 0.5, 1.0, 1.0, 1.0, 1.0 ϕ 1 (b 1 : 5 ) = 1.0, 1.0, 1.0, 1.0, 1.0

ϕ 2 (b 1 : 5 ) = 0.5, 0.5, 0.5, 0.5, 0.5 .

We an traverse the promoters step-by-step, suh that after every yle the

hash-map ontains up-to-date support metrisbased onseen ourrenes of

motifs. All unseen ourrenes are regarded as having single supports equal

tozero. Inourexample,detailsoftraversing

a

and

b

aregivenintablebelow.

Step

m ϕ 1 (m) ϕ 2 (m)

Comment

1 ATCC

1.0 0.5

Add ATCCto hash-map.

2 TCCG

1.0 0.5

Do same with TCCG.

3 CCGT

0.875 0.625

Keepadding unseen motifsinto

4 CGTC

0.75 0.75

hash-map with their support

5 GTCC

0.625 0.875

metris.

6 TCCG

0.5 1.0

Update supportmetris of TCCG.

7 TTCC

1.0 0.5

We are proessing

b

now.

8 TCCG

1.0 0.5

Update supportmetris of TCCG.

Forexample,onsider motifTCCG.Foradditivesupportoverallsequeneswe

sum

1.0/2 + 0.5/2 + 1.0/2

for

ϕ 1

and

0.5/2 + 1.0/2 + 0.5/2

for

ϕ 2

. Wedivide

the sores by two, due to Denition 2.3.4. After every update, the additive

supports are up-to-datebased ondata seensofar. Formaximal support, we

needtodomorebook-keeping,beausewhenwendanourrenewithbig-

ger maximalsore inasequene, we have toanelthe eet ofthe previous

ourrene. Forexample,themaximalsupportaftersteptwois

0.5/2

for

ϕ 2

.

At step 6,we disover thatit shouldbe

1.0/2

instead,therefore we subtrat

0.5/2

from the variable ontaining the supportand add

1.0/2

.

Withthis kindof hash-map onstrution we alulate allthe metris on

the y. Therefore, we avoid any post-proessing, beause alulating the

support measures over all sequenes would otherwise require intermediate

lists ontaining sores of single supports. With motifs without wild ard

haraters, this would not be very big memory overhead, but otherwise it

(26)

O n · c · X

s∈S

|s|

!

where

n

is the number of data traks and

c

is the omplexity for updating

the support of amotifin the hash-map.

3.2.1 Inluding Motifs with Wild Card Charaters

WewilldisussSafeApproxSearhinSetion3.4.2,wherehash-mapsare

required to also ontain supports of all wild harater extensions. This re-

quiresustomodifythemethoddesribedearlier. Theintegralrepresentation

allows us to preompute sux parts of all extensions. Let

w i

be sux part

of some motif m of length

, suh that

1 6 i 6 ℓ

and

m i : ℓ =

*

. . .

*. Then

π(w i ) = 5 i− 1 π(

*

) + . . . + 5 ℓ− 1 π(

*

)

. Ifwenowhavetheintegralrepresentation ofaprex

m p

,then

π(m p )+ π(w i )

willyieldtheintegralrepresentationofthe wild ard harater extension. Inhash-map onstrutionphase, itrequires

steps insteadof one toinludethe supportmetrisof allwild ardharater

extensions, thereforethe omplexity is

O n · c · ℓ · X

s∈S

|s|

! .

3.3 Naive Searh based on Apriori

The simplestsearhmethodis based on the Apriori priniple desribed in

Chapter 1. Namely, we an mine all motifs present in input sequenes by

settingthethreshold

σ = 1

withAprioriandthenhekiftheyarefrequent

in our terms. This is atually a omposition of Apriori and a ltering

funtion. In our ase, it is better to implement this as a depth-rst searh

algorithm,beausebreadth-rstnatureofAprioriausestoomuhmemory

overhead, when mininglonger motifs. The Algorithm 3.3.1 inorporates the

omposition of Apriori and the ltering funtion. On lines 10 12, we

see the andidate generationpart of the algorithm. Note thatwe always use

motifsA, T, G, Cforextension. Thisisduetothefatthattherearerarely

ases, where a nuleotide in promoter sequenes is missing. The Apriori

(27)

1: proedure NaiveSearh(

S, ~σ, m, ℓ

)

2: if

mcount (S , m ) = 0

then

3: return

4: else if

|m| = ℓ

then

5: if IsFrequent(

,

−−→ supp(S, m)

)then

6: SaveMotif(

m

)

7: endif

8: return

9: end if

10: for

e ∈ {

A, T, G, C

}

do

11: NaiveSearh(

S, ~σ, m || e, ℓ

)

12: end for

13: end proedure

pruningprinipleisinationonlines23and thelteringfuntionisgiven

onlines 4 9. FuntionIsFrequentheks, if allthresholds

σ i > s i

where

~s = −−→ supp(S, m)

. Reall,that

−−→ supp

operatorreturnsavetorof values,where

eahelementdeterminesthesupportperonedatatrakaordingtoadditive

or maximal support type. Also, if implementations of

−−→ supp

and

mcount

are implemented using data strutures like hash-map disussed in previous

setion, then these need tobe onstruted beforerunning this algorithm.

As an example, alling NaiveSearh(

S, ~σ, θ, 8

), where

θ

is the empty

zero-lengthmotif,

S

isthe set ofsequenes and

isthe vetor ofthresholds, we nd all frequent motifs of length8. The omplexity of NaiveSearhis

O(4 )

,where

is the xed motif length.

3.4 Pruning Strategies

In this setion, we desribe dierent pruning strategies, whih an be used

to make more eient algorithms ompared to NaiveSearh. All these

methodsare based on properties studied inChapter 2.

(28)

ThesimplestmethodisbasedonProposition2.4.3. Namely,ifwearemining

motifswith length

ℓ + k

and wehavesome motif

m ∈ M ℓ

,then the support

measures of any of itssuper motifswith length

ℓ + k

annotbegreaterthan

motifhaving

m

asaprex andhypothetialsux withsore

1.0

. Therefore,

a motif

m

and itssupmotifs an be pruned, if on any of the data traks

ℓ · msupp(S, m) + |S| · k

ℓ + k < σ

if we are mining using maximal support or

ℓ · asupp(S, m) + mcount (S, m) · k

ℓ + k < σ

ifweareminingusingadditivesupport. Ofourse,themaximalmotiflength

ℓ + k

must be xed to make these formulas usable. As an example, let us

analyze Figure 3.4.1.

Figure 3.1: Support of motifsAT and AT** ina sample subsequene.

Weseethat

msupp(S,

AT

) = max{0.1; 0.5; 0.25; 0.5} = 0.5

and

asupp(S,

AT

) = 0.1+0.5+0.25+0.5 = 2.25

. Ifwewereminingusingmaximaltypeofsupport

on this trak, then wean prunethe motifwith itssupmotifs if

(2 · msupp(S,

AT

) + 2) /4 = (2 · 0.5 + 2) /4 = 0.75 < σ

where

σ

is the threshold. For additivetype of support this would be

(2 · asupp(S,

AT

) + 2 · mcount (S,

AT

)) /4 = = (2 · 0.5 + 2 · 4) /4 = 2.25 < σ .

InorporatingthispruningmethodrequiresonlysmallhangestoNaive-

Searhonline2of Algorithm3.3.1. The resultisgiveninAlgorithm3.4.1,

where CanPrune uses method desribed above to determine if the motif

and supmotifs an bepruned.

(29)

1: proedure MaxSupSearh(

S, ~σ, m, ℓ

)

2: if

mcount (S , m ) = 0 ∨

CanPrune

(~σ

,

−−→ supp(S, m))

then

3: return

4: else if

|m| = ℓ

then

5: if IsFrequent(

,

−−→ supp(S, m)

) then

6: SaveMotif(

m

)

7: endif

8: return

9: end if

10: for

e ∈ {

A, T, G, C

}

do

11: MaxSupSearh(

S, ~σ, m || e, ℓ

)

12: end for

13: end proedure

3.4.2 Safe Over-Approximation Searh

Another improvement toNaiveSearh uses slightly dierent approah. It

isbasedonProposition2.4.7thatstatedthatsupportofany motif

m

gained

frommotif

m

byreplaingone ormorenuleotideswithwildardharaters,

is greater or equal ompared to original motif. Also, it holds with either

maximal or additive type of support. This allows us to dene a support

operator thatis guaranteed tobedownward losed, whih wasanissue with

NaiveSearh and MaxSupSearh[Lau09℄. We will be referring toit as

safe over-approximation type of support onwards.

Denition 3.4.1 Let

supp (S, m)

of motif

m ∈ M ℓ

denote the support of

its wild harater extension

m ∈ M k

, where

ℓ 6 k

.

Reall that a wild ard harater extension of

m

was a xed length motif

that ontained

m

as aprex and rest of the elements (wild ard haraters)

as the sux. As an example, if we are interested in mining sequenes of

length

ℓ = 3

, we rst start by heking the support of wild ard harater

extensions of motifsin

M 1

, namely A**, T**, G**, C** (note that we do

not inludemotif * in this list,as it is anyway the most frequent motif and

we are not interested in it). Ifany of these motifs isinfrequent,for example

T, thenwepruneallitssupmotifsTAA, TAT, TAG, TAC, TTAet etera. But

(30)

supp

operator. We only have to keep in mind, that it is downward losed

only when mining motifs with xed length, so that Proposition 2.4.7 would

hold.

Algorithm 3.4.2 SafeOver-Approximations Searh

1: proedure SafeApproxSearh(

S, ~σ, m, ℓ

)

2: if IsFrequent(

,

−−→ supp (S, m)

)then

3: if

|m| = ℓ

then

4: SaveMotif(

m

)

5: return

6: endif

7: else

8: return

9: end if

10: for

e ∈ {

A, T, G, C

}

do

11: SafeApproxSearh(

S, ~σ, m || e, ℓ

)

12: end for

13: end proedure

The Algorithm 3.4.2 denes SafeApproxSearh. Note that we use

−−→ supp

operatorinstead of

−−→ supp

and use IsFrequent todetermine, whether

weanprunethemotifwithitssupmotifs. Thisispossibleduetodownward-

loseness of

−−→ supp

operator.

BothMaxSupSearhand SafeApproxSearhhavesimilartheoreti-

alruntimeomplexity

O(f · 4 )

,where pruningfator

f ∈ (0, 1]

ismaximal,

if no pruningourand minimal,if allmotifsare pruned.

3.4.3 Infrequent Sub-Motifs Pruning Method

This alternative searh method isdiretly basedonTheorem 2.4.5. Namely,

if we are interested in motifs with length

, then for any partitioning of a frequentmotif

m ∈ M ℓ

intotwopiees

m 1 , m 2

,atleastoneofthepieesmust

befrequent. The ideaistogenerate two sets

F

and

I

,where

F

ontains the

frequent motifsand

I

the infrequent ones of length

ℓ/2

. Thus, we ombine

motifs from

F

and

I

toenumerate nal andidates. Note, that we need

I

,

(31)

beause any frequent motif of length

may have infrequent prex or sux.

We do not need to onsider ombinationsof infrequent submotifs as due to

Theorem 2.4.5 we know, that the resulting motif is also infrequent. Also,

there are many ways to partition the motifs, but making them with same

length enables us to enumerate them faster. The Algorithm 3.4.3 desribes

this proess.

Algorithm 3.4.3 Infrequent Sub-Motifs Searh

1: proedure InfreqSearh(

S, ~σ, m, ℓ

)

⊲ ℓ

must be even

2:

(F , I) ←

EnumerateMotifs

(S , ~σ, ℓ/2)

3:

C ← {(a, b) | a ∈ F , b ∈ F ∪ I}

4: for

c ∈ C

do

5: if CanPrune(

,

−−→ supp(S, c))

then

6: ontinue

7: else if IsFrequent(

~σ, −−→ supp (S, c 1 || c 2

))then

8: SaveMotif(

c 1 || c 2

)

9: else if

c 1 6= c 2

then

10: if IsFrequent(

~σ, −−→ supp (S, c 2 || c 1

))then

11: SaveMotif(

c 2 || c 1

)

12: end if

13: endif

14: end for

15: end proedure

On line 3, we enumerate all the andidate motifs of length

. On line

5, we rst try to eliminateandidates by using information we know about

their prex

m 1

and sux

m 2

. We try this, beause querying the database,

depending on data strutures used, an be more ostly. The CanPrune

methodheks on every trak if

msupp(S, m 1 || m 2 ) 6 msupp(S, m 1 ) + msupp(S, m 2 )

2 < σ

for maximal support typeand

asupp(S, m 1 || m 2 ) 6 asupp(S, m 1 ) + asupp(S, m 2 )

2 < σ

(32)

Proposition 2.4.3. Ifwe an prune

m 1 || m 2

using above equations, then we

an alsoprune

m 2 || m 1

as there is no dierene, in what order we onsider

the prex and sux part.

3.5 Mining Fixed Number of Best Motifs

The searh algorithmsdisussed inearliersetionsonentrate onnding all

frequentmotifsinrespettosome thresholdvetor. But suppose wewantto

mine 100 best motifs. Doing this by hand usingany previously mentioned

searh algorithmwould require followingproess. First, we determine some

reasonable thresholds and support types for data traks. Seond, we mine

frequent motifs using these thresholds and deide, whether the number of

motifs was too small or too large. Third, we modify the thresholds by in-

reasingordereasing themand mineagain untilwehavedesired numberof

frequent motifs.

The proess we justdesribed isatually similar tobinary searh known

in omputer siene. The Algorithm 3.5.1 implements it to automate this

proess. On line3,wedetermine twosalars

α

and

β

, suh thatminingwith

α · ~σ

returns all motifs present in data and mining with

(β + ε) · ~σ

returns

none of the motifs where

ε > 0

. It is trivial, that

α = 0

, beause in that

ase allmotifswillbefrequent. Determining

β

ismoreompliated,beause

we do not have any prior knowledge about maximal supports in data. First

optionistomakeaguess,but abetteralternativeistondout thesupports

by alulating

~s = −−→ supp (S,

*

)

and set

β = max{s i /σ i | i = 1, . . . , n}

(3.2)

where

n

isthe numberofdatatraks and

ontainsuser-dened thresholds.

This way

β · ~σ

may returnonly minimalpossible numberof frequent motifs.

Havingtheseboundaries xed, wean easilyombine anypreviouslydened

searhmethodwith binary searh. In other words, wekeep saling the orig-

inal vetor of thresholds

, until we get desired number of frequent motifs.

The linearity of this approah may not always be the best hoie, beause

the relationsbetween the reasonable thresholds depend on the natureof the

data. We do not study further possibilities in this work, but it ould be a

possibleresearh area in the future.

(33)

1: proedure NBest(

S, ~σ, N, ℓ

)

2:

~s ← −−→ supp (S,

*

)

3:

α ← 0, β ← max{s i /σ i | i = 1, . . . , n} ⊲ n

is the numberof traks

4:

C ← ∞ ⊲

The losest numberof best motifs

5:

δ ← 0 ⊲

Salar tobe used to minelosest numberof best motifs

6: while

β − α > ε

do

⊲ ε > 0

limitsthe reursion depth

7:

γ ← (α + β)/2

8:

k ←

NumFreqMotifs

(S, γ · ~σ, θ, ℓ) ⊲ θ

is the zero-length motif 9: if

abs(k − N ) < C

then

10:

C ← k, δ ← γ

11: endif

12: if

k > N

then

13:

α ← γ

14: else if

k < N

then

15:

β ← γ

16: else if

k = N

then

17: break

18: endif

19: end while

20: return MineMotifs

(S, δ · ~σ, θ, ℓ)

21: end proedure

Funtion NumFreqMotifs an be used as a wrapper around searh

methodsdesribedinearliersetions. Therearestillafewthingstoonsider.

First, not always there exist some xed number of best motifs, beause two

motifsmayhave exatlysamesupportmeasures. In that ase,binary searh

goes intoinnite loop. Same happens, when the number of desiredmotifsis

greater than there are motifs present in input data. In both situations, we

need to limit the maximal depth of the reursion. But we an still return

the numberof motifs,that isverylose todesirednumberofmotifs. On line

3, wedene

C

that willremember, what was the losestnumberof frequent

motifs to the desired xed number of motifs. Salar

δ

an be used to sale

to get

C

frequent motifs. On line 6, we use

ε > 0

to limit the reursion

depth. On lines 12 18, we see binary searh in ation. The while loop

Referenzen

ÄHNLICHE DOKUMENTE

There are only two ways that the production structure could be said to become more capital intensive (Howden 2016b; c). The first is through the production of a greater amount

!  Good rule of thumb: choose the size of the cells such that the edge length is about the average size of the objects (e.g.,.. measured by

As amino acids have been shown to be potent stimuli for aquatic animals [4], we used them to induce sensory responses of olfactory neurons in the rhinophore.. For

basic, APL, fortran, and SPL, experience with digital electr. Need: mass memory such as floppy disk and a hard-copy device. Designing automated Measurement system

Scottish Vowel Length Rule (SVLR), prosodic timing, sound change, dialect contact, the Voicing Effect, real-time change, Scottish English, Glaswegian vernacular..

If this is the case one says that the field K is real (or formally real), otherwise nonreal. We assume that the reader is familiar with the basic theory of quadratic forms

Die Analyse gibt Aufschluss darüber, welche Faktoren relevant sind, wenn eine obli- gatorische Kommaposition als solche wahrgenommen, also ‚bedient‘ wird oder nicht.. Innovativ

From the evaluation of our data at RT, measured in samples grown by PLD, we obtain a mean propagation length of the order of 100 nm for thermally excited magnons, in agreement