FACULTY OF MATHEMATICS AND COMPUTER SCIENCE
Institute of Computer Siene
Timo Petmanson
Mining Motifs in DNA
Regulatory Area
Barhelor's Thesis (6 ECTS)
Supervisor: Sven Laur, D.S. (Teh.)
Author: ... ... May 2010
Supervisor: ... ... May 2010
Chairman: ... ... ... 2010
TARTU 2010
Introdution 5
1 Preliminaries 6
1.1 DNA . . . 6
1.2 Gene expression . . . 8
1.3 Data mining . . . 8
2 Sequene Mining with Multiple Layers of Data 11 2.1 Sequenes and Sores . . . 11
2.2 Motifs and Mathing . . . 13
2.3 Support Metris . . . 15
2.4 Properties of Support Metris . . . 18
2.5 StatistiallyRelevantMotifs . . . 21
3 Algorithms and Data Strutures 23 3.1 Compat Enoding of Motifs . . . 23
3.2 Hash-mapof SupportMetris . . . 24
3.2.1 Inluding Motifswith Wild Card Charaters . . . 26
3.3 NaiveSearhbased onApriori . . . 26
3.4 Pruning Strategies . . . 27
3.4.1 MaximalSupportEstimationPruning . . . 28
3.4.2 Safe Over-Approximation Searh . . . 29
3.4.3 Infrequent Sub-Motifs PruningMethod . . . 30
3.5 Mining Fixed Number of Best Motifs . . . 32
3.6 Generalized FP-Tree . . . 35
4 Experimental Results 38 4.1 Runtime Performane of Searh Algorithms . . . 38
4.1.2 Mining Fixed Number of Frequent Motifs . . . 42
4.2 Mining BiologiallySigniantMotifs . . . 43
4.2.1 Data Preparation . . . 43
4.2.2 Results . . . 44
Summary 49
Resümee (eesti keeles) 50
Bibliography 51
A Multi-onstraint miner tool for gene expression analysis 53
All living organisms on earth are believed to ontain geneti information
oded instruturedolletionsof genesandnon-odingsequenes thatmake
up the DNA. The oded information is used to build organisms, maintain
themanditdenesawiderangeofgenetifeaturesthatvaryfromindividuals
to individualsand fromspeies to speies. The non-oding parts have muh
of the responsibility to regulate the expression of partiular genes. Genes
withtheirnon-odingregulatoryareasformomplexsignalingnetworksthat
together oordinatethe lifeyle of anorganism. Contemporarymethodsin
genetislikeChIPandmiro-arraymeasurementsmakeitpossibletomeasure
features of thousands of genes in one experiment, generating huge amounts
of data. Therefore, the developmentof new algorithmsand methods able to
analyze this data is ruial.
Ourontributionsinludethedevelopmentofnovelmethodsabletoom-
bine dierent soures of experimental data. In Chapter 2, we formalize the
theory desribing sequene mining with multiple input sequenes and mul-
tiple data layers. We also desribe, how to determine statistially signi-
ant motifs using our theory. In Chapter 3, we develop algorithms Max-
SupSearh, SafeApproxSearh, InfreqSearh, GFPSearh, that
utilize dierent pruning strategies. For GFPSearh, we dene generi-
frequent-pattern tree struture that isa generalizationof FP-tree[JJYR04℄.
Wealsodevelop NBest, that ombines any previouslymentionedalgorithm
with binary searh to get xed numberof best motifs. We develop SigMo-
tifs, that goes even further by distilling out statistially signiant motifs.
Performane study of mentioned algorithmsalong with experiments on real
biologialdata are given in Chapter 4.
Preliminaries
1.1 DNA
Currently sientists havedesribed about1.5milliondierentspeies: about
ve thousand mammals, thirty thousand speies of sh and over nine hun-
dredthousand insetsamongothers[WCU07 ℄. Someestimates ofomparing
samplesfromvariouspartsoftheworldseassuggestthatinoeanstheremay
bemore than100 million speies ofbateria[MHJ06℄. Thisvastdiversity of
known and unknown speies in Earth's biosphere are believed to have one
thing in ommon: the presene of DNA.
minor groove major groove
Figure 1.1: DNA Double Helix. The distane between strands varies and
forms major and minor grooves.
bakbone of a strand ontains alternating phosphate and sugar residues
linked with bases. These two strands form a struture known as double
helix seen in Figure 1.1, whose stability is maintained by hydrogen bonds
between the bases, see Figure 1.2 [RSM05℄. There are four types of bases
in DNA: adenine (abbreviated A), thymine (T), guanine (G) and ytosine
(C) that ombined with a sugar and one ormore phosphate residues forma
nuleotide. The nuleotides are pairwise aligned,making the struture anti
parallel, where adenine bonds only to guanine and ytosine bonds only to
thymine. The endpointsof the strands are alled 3'and 5'where the rst is
denedbyaterminalphosphategroupandtheseondbyaterminalhydroxyl
group [Coh04℄.
DNAnuleotide sequenes are usually writtenonly using basesfrom one
strand asthe bases onother strandare omplementary. Sequene TATAAA is
omplementarytoATATTT forexample. Theorderthe haraters are written
dependsonthesoureofthedatasometimesthedataiswrittenindiretion
from 3'to 5'while others are vie versa.
sugar CH3
N O
N
H H
O H
N N
N N
N sugar
Adenine Thymine
N sugar O
N O H
N
H N
H
H
H
sugar O
N
N N
Cytosine Guanine
Figure 1.2: TA and GC omplementary base pairs. Dotted lines represent
hydrogen bonds between bases
Gene expression means the rate and amount of RNA transribed from it,
whih in turn is used to dene other proteins neessary for the ell and
the organism. The transription proess requires transription fators that
are speial proteins able to reognize and attah to partiular fragments in
gene promoter areas. The transription fators are requiredto reruit RNA
polymerasethat is responsible for arrying out the transription proess.
In more omplex eukaryoti ells, the promoters are rather diverse and
ompliated, but the ore elements are a transription start site, whih to-
getherwithRNApolymeraseandtransriptionfatorbindingsitesareessen-
tialforinitiatingthetransriptionproess. Otherimportantbindingsitesare
typially a little more far away in upstream diretion that mainly regulate
gene expression by enhaning or restriting reruitment of the main tran-
sription fators. Additionally, there may be even more distant promoter
areas that have weaker inuene onthe gene regulation.
1.3 Data mining
Data mining is a method in statistis for extrating interesting patterns or
knowledge from large amounts of available data. This eld is very diverse
as among general data mining solutions there are many spei proedures
developedforbusiness,games,soialnetworksetetera [DP07 ℄. Inthiswork,
weonentrateonspeializedareaofdataminingalledsequenemining that
deals with ordered sequenes like nuleotide sequenes.
The Apriori algorithm is the most general and simple way to nd pat-
terns with high support in given data. In standard sequene mining, the
support is dened as the number of ourrenes of a pattern in input data,
whihisusedtodeidewhetherthepatternisfrequentorinfrequentbasedon
somedened threshold. TheApriorialgorithmassumesthatthesupportis
downward losed,whih meansthat forany infrequent patternthere do not
existany frequentsup-patterns. Forexample,aDNAmotif AAATCCC annot
be present in data more times than sequenes AAA and CCC, beause when-
ever the supmotif ours, the two submotifs must also our. Let us larify,
that inthis workby asubmotiforasubpatternwe meanasubsequene with
onseutive elements.
1:
F 1 ← {
Frequent one-element patterns}
2:
ℓ ← 2
3: while
F ℓ− 1 6= ∅
do4:
C ℓ ←
GenerateCandidates(F ℓ− 1 )
5:
F ℓ ← {c ∈ C ℓ | supp(c) > σ} ⊲ σ
is threshold6:
ℓ ← ℓ + 1
7: end while
The Apriori algorithmuses downward loseness asamainpruningfeature.
In Algorithm 1.3.1 on line 4, the GenerateCandidates proedure takes
the set of frequent motifs of length
ℓ − 1
as input and generates possibleandidates of length
ℓ
. It does not needto onsider any non-frequent motifs as none of their supmotifs are frequent. The algorithm stops running whenit has found all frequentmotifs inthe dataset.
LetusdemonstrateApriori by givinganexample. Consider thefollow-
ingsequene:
GCTTATGGTCGCTATGCTTT
.
Suppose we want to mine all motifs ourring at least three times in the
sequene. This means that we run Apriori with threshold
σ = 3
. The setF 1 = {
T,
G,
C}
,beauseallnuleotides exept A are presentin sequene morethan threetimes. Next,we generateandidate motifsoflengthtwoby using
only frequent elementsin
F 1.
C 2 = {
TT, TG, TC, GT, GG, GC, CT, CG, CC}
Frequent motifsinthis ase are
F 2 = {
TT, GC, CT} .
Note that TT mathes TTTtwo times. The next andidate set is
C 3 = {
TTT, GTT, TTG, CTT, TTC, TGC, GCT, GGC, GCG,CGC, GCC, TCT, CTT, GCT, CTG, CCT, CTC
} .
This time there is onlyone frequent motif:
F 3 = {
GCT} .
Candidate motifsof length
4
:C 4 = {
TGCT, GCTT, GGCT, GCTG, CGCT, GCTC} .
Butnoneofthemisfrequent,so
F 4 = ∅
andallfrequentmotifsinourexampleare
F = {
T, G, C, TT, GC, CT, GCT} .
There are also algorithms like WINEPI [MTV95 ℄, MINEPI [MT96℄,
SPEXS [Vil02℄ that are able to mine motifs using pattern mathing. Still,
while Apriori with other standard sequene mining algorithms are useful,
they treat all parts of the sequene with equalweight. In our ase, we need
methodsthatareabletoworkwithdatathatdeoratessequeneswithsores,
making some parts of them more relevant than the rest. In Chapter 2, we
reformulate standard sequene mining tehniques and later devise our own
algorithmsthat handle suhrequirements.
Sequene Mining with Multiple
Layers of Data
Inthishapter,weformalizebasinotionsandoneptslikesequenes, motifs
and support that are needed to develop our methods. We try to develop
our mathematial approah suh that it would be onvenient to study gene
regulation, whenwe onsider several promoterareas and dierentproperties
of these sequenes desribed by layers of experimentaldata.
We also study dierent properties and relations between these building
bloks that are later used in algorithmsto ut down the running times and
improve overall performane, although we do not over algorithmi details
and otheraspetslikedata struturesastheyare disussedinlaterhapters.
2.1 Sequenes and Sores
The most basi onstruts we will be dealing onwards are DNA sequenes
and their fragments. In our ase, it will be onvenient to think of them as
a set of nuleotide sequenes. Let
S = {a, b, c, . . .}
denote a set of promotersequenes relevant to some gene. Single elements of a sequene are denoted
with subsripts as usual. For example,
a 1 means the rst element and a 2
the seond elementof
a ∈ S
. As there are fourtypesof nuleotides adenine,thymine, ytosine, guanine inDNA that orrespond to lettersA, T, C, G. We
write
a 1 =
A ,ifrstelementinthenuleotidesequene isadenineanda 2 =
T ,if the seond element is thymine. Let us denote the length of sequene
a
as|a|
. It is worth to note that no promoter is with length of zero, nor thereare promoters with innite length in real world. However, depending on
partiular ase, the lengths of the sequenes are not usually very short or
very long.
In mathematis, a fragment of a sequene is usually written as a list of
elements. In this paper, we willbe using a shorternotation:
a i : j
def
= a i , a i +1 , . . . , a j .
where
i
isthe beginningandj
is the end of the fragment.Westatedinthe introdutionofthisthesisthatweare goingtodealwith
multiple layers of data about promoter sequenes. For example, if we have
data ontainingbinding andonservation soresfromDNA miro-arrayand
sequening experiments that assoiate with promoters we are interested in,
weanportraythemasdatatraksoverthenuleotidesequeneasillustrated
in Figure2.1.
value
... A T G C C C A T T G C T A G G C ... pos 0.5
1.0
conservation binding
Figure 2.1: An example subsequene having onservation and binding data
traks attahed. The sores are variable and may not diretly depend on
eah other.
From theoretialpointofview, itisnot importantexatlywhatkindofdata
we have, as long we an represent it as numeri values linked to positions
in promoter sequenes. However, it is important that these values express
some property that makes some regions of the nuleotide sequene more
relevant than other regions, thus dening important regions in respet to
eah data trak. If we have
n
data sets ontaining various sores andm
promoters, thenweneed
n × m
mappingsthat assoiate relevantsores fromto normalize all data suh that all sores fall into range
[0, 1]
like shownon Figure 2.1. It simplies writing some formulas, beause we know the
maximum possible value of any type of sore linked to any position of a
nuleotidesequene. Let
ϕ : N −→ R
bea mappingthatassoiatesnumeri sores to all positions of a nuleotide sequene. Tomakethis notationmoreuseful, letus agree that by writing
ϕ(a i )
we mean the sore thatϕ
maps toposition
i
ofsequenea
andbywritingϕ(a i : j )
,wemeanasequene ofsoresϕ(a i : j )
def= ϕ(a i ), ϕ(a i +1 ), . . . , ϕ(a j )
. Bywritingϕ(a i : j )
wemeanthe averagesore
ϕ(a i : j )
def= 1 j − i + 1 ·
j
X
k = i
ϕ(a k ) .
2.2 Motifs and Mathing
In this setion, we introdue motifs, whih an be thought of as possible
subsequenes in sequene set. Motifs do not diretly assoiate to any data
trak,butthereareseveral othermetrislikesupport,frequeny,signiane
ofamotifinapartiularset ofpromotersequenes. Inadditiontonuleotide
lettersA, T, G, C,motifsmayalsoontainspeialwildardharatersthat
have speial meaning and usage. In this work, we will be using only one
suh symbol* that represents any possible nuleotide in one position. Note
that this isdierent from standard usage of this symbol inbath-proessing
or regular-expression appliations where it usually stands for zero or more
symbols. In our ase, if we have a motif G**A, then by that we mean any
motif with lengthof four that starts with letter G and ends with letter A.
We will be dealing a lot with xed-length motifs in later setions, so it
is neessary tointroduenotationthat wean use to refertoallmotifswith
a xed length
ℓ
. LetM ℓ represent a set of all motifs with length ℓ
where
ℓ ∈ N
. We agreed before, that allmotifs are onsist of ve dierent letters:
the nuleotidesand the wildardharater. Thismeansthat theardinality
oftheset
M ℓisequalto|M ℓ | = 5 ℓastherearevedierentpossibleelements
perposition ina motif.
Oftenitisneessary, that weouldrefertosingleelementsof amotifthe
same way we do for sequenes, so given any motif
m ∈ M ℓ, let m 1 denote
the rst element of the motif,
m 2 the seond element of the motifet etera.
In addition to that, it is onvenient to desribe motifs as onatenation of
onlyaprexandsux part. Let
||
beanonatenationoperator. Ifm p ∈ M p
and
m s ∈ M s then motif m = m p || m s, where m ∈ M ℓ and ℓ = p + s
. Let
m ∈ M ℓ and ℓ = p + s
. Let
us illustrate this with an example. If
m p =
AAAT andm s =
GCCGT, then theonatenation
m p || m s isAAATGCCGT.
Anotherveryusefulnotionisawildardextensionofsomemotif. Namely,
if we have some xed motif length
ℓ
and a motifm ∈ M k, suh that k 6 ℓ
,
we may pad the motif with wild ard haraters until it is
ℓ
elements long.This enables to easily express motifs we know to have a ertain prex. Let
m ∗ ∈ M ℓ denote a wild ard harater extension of motif m ∈ M k where
k 6 ℓ
suh that the prexm ∗ 1 : k = m
and suxm ∗ k +1 : ℓ =
*...*. Forinstane, if
m =
AATA and we have xed motif lengthℓ = 10
, then the wildharater extension
m ∗ =
AATA******. This notion omes handy when we desribe SafeApproxSearh algorithm in Chapter 3. Let us agree thatany motif gained from another motif by replaing one or more nuleotides
with wild ards is onsidered asubmotif of the original motif.
In standard sequene mining, the support of some motifis usually mea-
sured by how many mathes ithas in data [DP07℄. The number of mathes
of a motif ontainingno wild ard haraters is simplythe number of times
the motifan beviewed asa subsequeneof given data sequene. Withwild
ard haraters this works dierent as a wild ard harater mathes any
nuleotide. See Figure2.2for anillustration.
Figure 2.2: Threemathes of motif ATA*A ina subsequene.
Denition 2.2.1 A motif
m ∈ M ℓ mathes some fragment a i : i + ℓ− 1 of se-
quene
a
, ifm k =
*∨ m k = a i+k− 1 for all k = 1, . . . , ℓ .
match (a, m, i) =
1,
if m mathesa i : i + ℓ− 1
0,
otherwise.
We an extend the number of mathes in the sequene
a
over a set of se-quenes
S
by simplyaddingall the individualounts together:mcount (a, m)
def= P |a|−ℓ +1
i =1 match (m, a i ) mcount (S, m)def= P
s ∈ S mcount (m, s) .
2.3 Support Metris
Standard sequene mining treats all parts of the input sequene with equal
value of importane [DP07℄. In our ase, we have possibly more than one
data traks ontainingvariablesores. Therefore, we need todene support
in a dierent way. We base our approah on a formulation given by Sven
Laur [Lau09℄.
The rst thing is to extend the notion of support of one single math.
Standard way was summing up all mathes of a motif in a sequene, suh
thateahmathhadequalvalueofimportane. Butaswehaveatualsores
linkedtopositions,weextendtheoriginalmethodbytakinganaveragesore
of mathing positions of asingle math.
Denition 2.3.1 The support of an individual motif
m ∈ M ℓ with respet
to some fragment in sequene
a
startingfrom positioni
:supp(a, m, i) =
ϕ(a i : i + ℓ− 1 )
ifmatch (a, m, i) = 1
0
otherwise.
Toextendthesupportofamotifoverasequene, wehaveseveral options.
The rst idea is to add up all the single supports of the motif. This is the
simplestwaytogoand werefertothis methodasadditive supportonwards.
Letusonsider another option: insteadofadding upthe sores, wean take
only the maximal sore and be ne with it. The plus side of this method
is that it promotes motifs that atually have high sores. Additive support
an be high even if all the sores of the single mathes are low. So, we also
onsider this method and we willbe referring toitas maximal support.
in a sequene. We might onsider average support that works like additive
support, but we divide the result by number of mathes of that motifin the
sequene. We ould also dene supports like weighted additive or weighted
average support, that onsiders some regions of the promoter to be more
signiantthanothers. Thelasttwoareatuallynotveryreasonable,beause
we express signiane of promoter areas through data traks anyway.
The average support is atuallymore relevant, but as itseems to have a
mixed propertiesof additiveandmaximalsupport,we donot overthis type
of supportinthis work andonentrate onstudyingonlythe two mentioned
supporttypes.
Denition 2.3.2 Additive support of a motif
m ∈ M ℓ in sequene a
is
asupp(a, m) =
|a|−ℓ +1
X
i=1
supp(a, m, i) .
Denition 2.3.3 Maximal support of a motif
m ∈ M ℓ in sequene a
is
msupp(a, m) = max { supp(a, m, i) | i = 1, . . . , |a| − ℓ + 1} .
By writing
supp(a, m)
, we do not refer diretly to neither of the supporttypes inases we are disussing properties that apply to both of them.
Therefore, Denitions 2.3.2 and 2.3.3 are only two possible ways of ex-
pressing the support of a motif in one sequene. Biologial importane of
the twodependsmostly onthe atualdataused. Forexample,if weonsider
onservation, then additive support an reveal motifsthat oexist inseveral
genetially lose speies having great strutural importane, maximal sup-
port takes into aount only one ourrene of a motif in a promoter, thus
ignoring larger sale strutural eets. On the other hand, maximal sup-
portan bring up motifsthat are reognizedmost probablyby transription
fators as these enzymes require proper loations to enable mounting RNA
polymerase and initiate transription. Thus, the deision about what sup-
porttype shouldbeused with apartiulardata trak,depends onbiologial
properties the data.
To extend notion of support over a set of sequenes, we have several
options. The rst approah is to onsider all promoter sequenes having
an average of support of amotifin allsequenes.
Denition 2.3.4 Additive and maximal support of motif
m ∈ M ℓ in a list
of sequenes
S
areasupp(S, m) = 1
|S| · X
a∈S
asupp(a, m)
(2.1)msupp(S, m) = 1
|S| · X
a∈S
msupp(a, m) .
(2.2)The seond approah is to onsider promoters further away from the gene
theyregulatehavinglessimpatthantheoneslosertoit. Therefore,weneed
to give promoter sequenes meaningful weights, when alulating support.
We ould propagate these weights diretly into the datasets, enabling the
diret use of Equations (2.1) and (2.2). Let us also agree that by writing
supp(S, m)
, we donot refer diretly to additive, nor maximal support if weare disussing properties that apply toboth of them.
InChapter3,wedisussalgorithmsanddata struturesandusuallyneed
support in respet to all data traks. Also, let us agree that we have xed
the support type for every data trak tomake semantis easier. In ases we
need touse bothsupporttypes, wean vieworiginaltrakastwodupliates
with dierent supporttypes.
Denition 2.3.5 Given mappings
ϕ 1 , . . . , ϕ n, the support of motif m
in se-
quenes
S
in respet to alln
data traks is−−→ supp(S, m) = (s 1 , . . . , s n )
where
s i =
asupp(S, m, ϕ i )
for additive type of supportmsupp(S , m, ϕ i )
for maximal type of support.Mappings
ϕ 1 , . . . , ϕ n given as extra arguments to support operators will be
used as the mapping
ϕ
inDenition 2.3.1.In previous setion, we dened basi building bloks like sequenes, motifs
and mappings that gave eah position in a promoter sequene one or more
weights in regard to available data sets. In this setion, we study various
properties of newly dened supportmeasures and notions.
When weompare additive and maximal support, itisrathereasytosee
that additivesupport is always asbigas maximalsupport, beause additive
support onsiders all ourrenes of a motif in a sequene where maximal
supportonly onsiders the ourrene with maximalsupport.
Proposition 2.4.1 For any motif
m ∈ M ℓ and a set of sequenes S msupp(S, m) 6 asupp(S , m) .
We have not mentioned that there isa problemwith the way wedened
our support of some motif. Namely, the denition breaks the standard se-
quene miningprinipleofbeingdownward losedasanynon-frequent motif
may have frequent supmotifs.
Claim 2.4.2 Let
σ ∈ R
be the threshold. For any motifm ∈ M ℓ, suh
that support
supp(S, m) < σ
, may exist a supmotifm ′ ∈ M ℓ + k, suh that
supp(S, m ′ ) > σ
, whether we onsider additive or maximal support.Proof. For simpliity, let us assume that there is only one single math of
motif
m ′ in positions i
to i + ℓ − 1
in sequene a
. Sine the sores of all
positions are in range
[0, 1]
, thesupp(a, m, i) ∈ [0, 1]
. Now, let us onsidera situation where support of the prex
supp(a, m, i) = ϕ(a i : i+ℓ− 1 ) < 1
andsupportof the sux
ϕ(a ℓ : ℓ + k− 1 ) = 1
. From here we an onlude thatsupp(S, m ′ ) = 1
|S| · ϕ(a i ) + . . . + ϕ(a i + ℓ− 1 ) + k
ℓ + k > supp(S, m) .
(2.3)If we take
σ = supp(S, m ′ )
, thenm
is infrequent andm ′ isfrequent.
2
Above proof raises anotherquestion: if we know the supportof a motif,
thenwhatisthemaximalpossiblesupportofanysupmotif? Weanapproah
theanswersamewayprovedabovelaim. Namely,ifweonsiderthesupport
of the sux of a possible supmotif to have maximal possible value, then we
an alulatethe maximal possible support of the supmotif.
Proposition 2.4.3 For any motif
m ′ ∈ M ℓ + k and its submotif m ∈ M ℓ in
a set of sequenes
S
msupp(S, m ′ ) 6 ℓ · msupp(S, m) + |S| · k ℓ + k
asupp(S, m ′ ) 6 ℓ · asupp(S, m) + mcount (S, m) · k
ℓ + k .
Proof. Letusonsiderasetoffragments
{a p : q , b r : s , . . . z t : u }
thatrepresentpositionsoneverypromoterwheremotif
m
has highestsupport. Inthatasemsupp(S, m) = |S| − 1 (ϕ(a p : q ) + ϕ(b r : s ) + . . . + ϕ(z t : u )) .
If we now onsider
m
asprex ofm ′, then analogous way to Equation (2.3)
we an estimate that support of
m ′ annot be largerthan
msupp(S, m ′ ) 6 1
|S|
ϕ(a p : q ) + k
ℓ + k + ϕ(b r : s ) + k
ℓ + k + . . . + ϕ(z t : u ) + k ℓ + k
=
= 1
|S| · ϕ(a p : q ) + ϕ(b r : s ) + . . . + ϕ(z t : u ) + |S| · k ℓ + k
that ombined with Equation (2.2) beomes
msupp(S, m ′ ) 6 ℓ · msupp(S, m) + |S| · k
ℓ + k .
(2.4)Additivesupport takesinto aountallourrenesof
m
,thus replaing therelevant parts inEquation (2.4), we get
asupp(S, m ′ ) 6 ℓ · asupp(S, m) + mcount (S, m) · k
ℓ + k .
2
Claim 2.4.2 implies that frequent motifs may have infrequent submotifs.
However, it is importantto note that any frequent motif alsomust have at
least one frequent submotifwith eitheradditive ormaximal type ofsupport.
Lemma 2.4.4 For all positions
i < j 6 k
in sequene aϕ(a i : k ) 6 max{ ϕ(a i : j− 1 ), ϕ(a j : k ) } .
Proof. The rst possibility is that
ϕ(a i : j− 1 ) > ϕ(a j : k )
orϕ(a i : j− 1 ) <
ϕ(a j : k )
. In that aseϕ(a i : k ) < max{ ϕ(a i : j− 1 ), ϕ(a j : k )}
as the averagesore of the supsequene must be lower than the subsequene with maxi-
mal average sore. The seond possibility is that
ϕ(a i : j− 1 ) = ϕ(a j : k )
, thusϕ(a i : k ) = ϕ(a i : j− 1 ) = ϕ(a j : k )
.2
Theorem 2.4.5 Any frequent motif
m ∈ M p+s an be partitioned into two
submotifs
m p ∈ M p and m s ∈ M s, suh that m = m p || m s and either the
m = m p || m s and either the
prex
m p or sux m s is frequent.
Proof. If we have only a single math of the motif
m
in sequenea
, thenaording to Lemma2.4.4
supp(a, m, i) 6 max{supp(a, m p , i), supp(a, m s , i + p)} .
Now, suppose wehavemoremathes of
m
inonesinglesequenea
. As max-imal support onlyonsiders one ourrene of
m
, then aording toLemma2.4.4 the theorem holds. By Denition 2.3.2, the additive support is the
sum of supports of all single mathes. Let
asupp(a, m) = ϕ(a i 1 : j 1 ) + . . . + ϕ(a i n : j n )
,wheren = mcount (a, m)
andi n , j n denotestart and end loations
of the ourrene. Let
ϕ(m i )
def= (ϕ(a i 1 +i− 1 ) + . . . + ϕ(a i n +i− 1 )) (p + s) − 1 and
ϕ(m i : j )
def= (ϕ(m i ) + ϕ(m i+1 ) + . . . + ϕ(m j ))(p + s) − 1. Similarly to Lemma
2.4.4, we ould showthat
ϕ(m 1 : p + s ) 6 max{ ϕ(a i : p− 1 ), ϕ(a p : p + s )) }
(2.5)whih means that
asupp(a, m) 6 max{asupp(a, m p ), asupp(a, m s )}
. Notethat any extra ourrenes of prex or sux motifs in input sequenes do
not invalidateEquation (2.5).
Foreither additiveor maximal support overa set of sequenes
S
, every-thing works similarly to above steps, but we have to take into aount the
onstant
|S| − 1.
2
piees suhthat atleast one of themis frequent. Weknow thatpartitioning
worksfor twosubmotifs, thusweaniteratevly ontinue andreateasmany
partitions of the original motif as neessary, beause always at least one
partitionhas tobe frequent.
Corollary 2.4.6 Given motif
m ∈ M ℓ and submotifs m 1 , m 2 , . . . , m n, that
partition
m
inton
piees,thenatleast oneof thesubmotifsmust befrequent.Sofarwe have desribed the propertiesof additiveand maximalsupport
without onentrating too muh on the atual ontents of the motifs. How-
ever,weusedmotiflengthsinProposition2.4.3toestimatemaximalpossible
supportofasupmotif. Whilethis isuseful knowledge,mostofthetimethese
estimations donot work best, beause they maketheir estimations solelyon
the motifsupport,lengthand possiblesupmotiflength. Wean improvethis
situation by introduingwild ard haraters.
Proposition 2.4.7 Given motifs
m ∈ M ℓ andm ′ ∈ M ℓ, thatisonstruted
from motif
m
suh way, that one nuleotide inm
is replaed by a wild ardharater *, the additive or maximal support
supp(S, m) 6 supp(S, m ′ ) .
Forexample, onsider amotif
m p =
GCTas a prex of alonger supmotifm ∈ M 10. If we want to know the maximal support of any suh supmotif,
we an alulate
supp(S,
GCT*******)
. It ertainly does not give higherestimation than former desribed method. On the other hand, it requires
a query on the database, whih depending on situation an be ostly. We
desribeboth approahes more thoroughlyinChapter 3.
2.5 Statistially Relevant Motifs
In previous setions, we disussed how to determine if a motif is frequent.
In this setion, we desribe how to go even further by deiding, whih fre-
quent motifs are statistiallymore signiant. By this, we atually want to
measure the amount of surprise for every frequent motif. In our ase, we
may measure surprise individually even for every data trak and we have
the promoter sequenes. Then we an ompare the supportmeasures of the
permuted dataset against the original one. If motifs in original data have
higher supports, then they are surprising inthat sense. To bemore spei,
we may generate a large amount of datasets by permuting randomly the
original sequenes. If we onsider only one data trak, then we an sort the
motifsdereasingly by their supportsuhthat motifwith highestsupportis
the rst in the resulting list. Then, for every motif at position
i
in the list,we an alulatehowmany motifsat
i
'th positioningenerated datasets hadsupportas high asthe originalmotif. Wean writeit down as
p = Pr[supp(S ′ , m ′ ) > supp(S, m)]
where
S
isthe originaldataset,S ′ isthe permuted dataset, m
isthe original
motif at
i
'th position andm ′ is the motif in S ′ at same position. The value
p
is alled p-value instatistis and in our ase, represents the probability of
havingthe support inarandomdatasetatleast asextreme asinthe original
p
is alled p-value instatistis and in our ase, represents the probability of havingthe support inarandomdatasetatleast asextreme asinthe originalone. Therefore, the smaller the
p
,the more surprising isthe motifm
.Weanalulatep-valueforeveryfrequentmotifandforeverydatatrak.
Of ourse, we might want to alulate only asingle p-value for every motif,
but the problem is with sorting frequent motifs. This atually an be done,
asdisussed inChapter 3,buthavingap-value inrespet toeahdatatrak
mayrevealinteresting propertiesofthemotifs. Wewillomitexatalgorithm
for alulating p-values, but briey disuss it later in Setion 3.5. Let us
refer tothis algorithmas SigMotifsonwards.
Algorithms and Data Strutures
In this hapter, we willdevise algorithms based on formalizationand other
ideas desribed in Chapter 2. We start o by desribing ompat enoding
ofmotifsandontinuedevelopingalgorithmswithdierentpruningmethods
and apabilities.
3.1 Compat Enoding of Motifs
It turns out, that there is a rather straightforward way to enode xed-
length motifs as unique integers. If we onsider nulotides and wild ard
harater as a set
X = {
A,
T,
G,
C,
*}
and have another set with same sizeY = {0, 1, 2, 3, 4}
, then we an dene a mappingπ : X −→ Y
, suh thatπ(
A) = 0, π(
T) = 1, π(
G) = 2, π(
C) = 3, π(
*) = 4
, that would enable us torepresent a motif
m ∈ M ℓ asaninteger
5 0 π(m 1 ) + 5 1 π(m 2 ) + . . . + 5 ℓ− 1 π(m ℓ ) .
(3.1)For our onveniene, let us agree that by writing
π(m)
, wherem ∈ M ℓ, we
mean the integralrepresentation of motifgiven in Equation(3.1).
Thisrepresentationmakesiteasytohash anymotifof length
ℓ
and storeit in a hash-table as for every xed length motif the integral representation
is unique.
If the motif length
ℓ
is small enough, then we ould use a hash-map ofsize
5 ℓ. This way we ould diretly use the value π(m)
as a key to store
motif's support metris and this guarantees onstant time
O(1)
aess asthere would beno ollisions.
motifs. For example, there are
5 8 = 390625
possible motifs of length 8inludingwildardharaters. ForayeastS.Cerevisiae,thepromoterlengths
are not usually longerthan afew thousand basepairs. Therefore, if wehave
one promoter with length of 3000 base pairs, we an atually have maximal
of
3000 − 8 = 2992
dierent non wild ard harater motifsof length eight.3.2 Hash-map of Support Metris
The integral representation of motifs allows us to eetively build a hash-
map ontaining support metris of all motifs found in promoter sequenes.
Consider a sequene
a =
ATCCGTCCG. Ifweare interested inmotifsof length4,thenmotif
m 1 =
ATCCmathestherst positionofa
andmotifm 2 =
TCCGmathes the seondpositionof a. The integralrepresentationsare following:
π(m 1 ) = 1 · 0 + 5 · 1 + 25 · 3 + 125 · 3 = 455 π(m 2 ) = 1 · 1 + 5 · 3 + 25 · 3 + 125 · 2 = 341 .
It turns out, that we an update the integral representation of
m 1 to m 2 in
onstant time. ByEquation (3.1), the integral representation of motif
m 1 is
π(m 1 ) = 5 0 · π(a 1 ) + 5 1 · π(a 2 ) + 5 2 · π(a 3 ) + 5 3 · π(a 4 )
. By subtrating the
rst element
5 0 · π(a 1 )
, dividing the result by ve and adding5 3 · π(a 5 )
, weget
π(m 1 ) − 5 0 · π(a 1 )
5 + 5 3 · π(a 5 ) = 5 0 · π(a 2 ) + 5 1 · π(a 3 ) + 5 2 · π(a 4 ) + 5 3 · π(a 5 )
whih is equal to
π(m 2 )
. So in our example, whereπ(m 1 ) = 455
, we analulate
π(m 2 ) = π(m 1 ) − 5 0 · π(a 1 )
5 + 5 3 · π(a 5 ) = 455 − 0
5 + 125 · 2 = 341 .
Analogously, we an do this with support of single mathes for all traks.
Why this is important, is that we an alulate all support metris of all
motifspresent in data inone pass. The negative side eet of this approah
with sores are possibly greater oating-point rounding errors. But we an
reduethemeetivelybyrealulatingthemfromdatatraksafterevery100
or1000steps. Thisofourseisnottheissue withthe integralrepresentation.
Letusgiveanin-depthexample. Considertwosequenes
a =
ATCCGTCCG,b =
TTCCG and two mappingsϕ 1 , ϕ 2 representing two data taks suhthat
ϕ 1 (a 1 : 9 ) = 1.0, 1.0, 1.0, 1.0, 1.0, 0.5, 0.5, 0.5, 0.5 ϕ 2 (a 1 : 9 ) = 0.5, 0.5, 0.5, 0.5, 0.5, 1.0, 1.0, 1.0, 1.0 ϕ 1 (b 1 : 5 ) = 1.0, 1.0, 1.0, 1.0, 1.0
ϕ 2 (b 1 : 5 ) = 0.5, 0.5, 0.5, 0.5, 0.5 .
We an traverse the promoters step-by-step, suh that after every yle the
hash-map ontains up-to-date support metrisbased onseen ourrenes of
motifs. All unseen ourrenes are regarded as having single supports equal
tozero. Inourexample,detailsoftraversing
a
andb
aregivenintablebelow.Step
m ϕ 1 (m) ϕ 2 (m)
Comment1 ATCC
1.0 0.5
Add ATCCto hash-map.2 TCCG
1.0 0.5
Do same with TCCG.3 CCGT
0.875 0.625
Keepadding unseen motifsinto4 CGTC
0.75 0.75
hash-map with their support5 GTCC
0.625 0.875
metris.6 TCCG
0.5 1.0
Update supportmetris of TCCG.7 TTCC
1.0 0.5
We are proessingb
now.8 TCCG
1.0 0.5
Update supportmetris of TCCG.Forexample,onsider motifTCCG.Foradditivesupportoverallsequeneswe
sum
1.0/2 + 0.5/2 + 1.0/2
forϕ 1 and0.5/2 + 1.0/2 + 0.5/2
forϕ 2. Wedivide
the sores by two, due to Denition 2.3.4. After every update, the additive
supports are up-to-datebased ondata seensofar. Formaximal support, we
needtodomorebook-keeping,beausewhenwendanourrenewithbig-
ger maximalsore inasequene, we have toanelthe eet ofthe previous
ourrene. Forexample,themaximalsupportaftersteptwois
0.5/2
forϕ 2.
At step 6,we disover thatit shouldbe
1.0/2
instead,therefore we subtrat0.5/2
from the variable ontaining the supportand add1.0/2
.Withthis kindof hash-map onstrution we alulate allthe metris on
the y. Therefore, we avoid any post-proessing, beause alulating the
support measures over all sequenes would otherwise require intermediate
lists ontaining sores of single supports. With motifs without wild ard
haraters, this would not be very big memory overhead, but otherwise it
O n · c · X
s∈S
|s|
!
where
n
is the number of data traks andc
is the omplexity for updatingthe support of amotifin the hash-map.
3.2.1 Inluding Motifs with Wild Card Charaters
WewilldisussSafeApproxSearhinSetion3.4.2,wherehash-mapsare
required to also ontain supports of all wild harater extensions. This re-
quiresustomodifythemethoddesribedearlier. Theintegralrepresentation
allows us to preompute sux parts of all extensions. Let
w i be sux part
of some motif m of length
ℓ
, suh that1 6 i 6 ℓ
andm i : ℓ =
*. . .
*. Thenπ(w i ) = 5 i− 1 π(
*) + . . . + 5 ℓ− 1 π(
*)
. Ifwenowhavetheintegralrepresentation ofaprexm p,thenπ(m p )+ π(w i )
willyieldtheintegralrepresentationofthe
wild ard harater extension. Inhash-map onstrutionphase, itrequires ℓ
steps insteadof one toinludethe supportmetrisof allwild ardharater
extensions, thereforethe omplexity is
O n · c · ℓ · X
s∈S
|s|
! .
3.3 Naive Searh based on Apriori
The simplestsearhmethodis based on the Apriori priniple desribed in
Chapter 1. Namely, we an mine all motifs present in input sequenes by
settingthethreshold
σ = 1
withAprioriandthenhekiftheyarefrequentin our terms. This is atually a omposition of Apriori and a ltering
funtion. In our ase, it is better to implement this as a depth-rst searh
algorithm,beausebreadth-rstnatureofAprioriausestoomuhmemory
overhead, when mininglonger motifs. The Algorithm 3.3.1 inorporates the
omposition of Apriori and the ltering funtion. On lines 10 12, we
see the andidate generationpart of the algorithm. Note thatwe always use
motifsA, T, G, Cforextension. Thisisduetothefatthattherearerarely
ases, where a nuleotide in promoter sequenes is missing. The Apriori
1: proedure NaiveSearh(
S, ~σ, m, ℓ
)2: if
mcount (S , m ) = 0
then3: return
4: else if
|m| = ℓ
then5: if IsFrequent(
~σ
,−−→ supp(S, m)
)then6: SaveMotif(
m
)7: endif
8: return
9: end if
10: for
e ∈ {
A, T, G, C}
do11: NaiveSearh(
S, ~σ, m || e, ℓ
)12: end for
13: end proedure
pruningprinipleisinationonlines23and thelteringfuntionisgiven
onlines 4 9. FuntionIsFrequentheks, if allthresholds
σ i > s i where
~s = −−→ supp(S, m)
. Reall,that−−→ supp
operatorreturnsavetorof values,whereeahelementdeterminesthesupportperonedatatrakaordingtoadditive
or maximal support type. Also, if implementations of
−−→ supp
andmcount
are implemented using data strutures like hash-map disussed in previous
setion, then these need tobe onstruted beforerunning this algorithm.
As an example, alling NaiveSearh(
S, ~σ, θ, 8
), whereθ
is the emptyzero-lengthmotif,
S
isthe set ofsequenes and~σ
isthe vetor ofthresholds, we nd all frequent motifs of length8. The omplexity of NaiveSearhisO(4 ℓ )
,whereℓ
is the xed motif length.3.4 Pruning Strategies
In this setion, we desribe dierent pruning strategies, whih an be used
to make more eient algorithms ompared to NaiveSearh. All these
methodsare based on properties studied inChapter 2.
ThesimplestmethodisbasedonProposition2.4.3. Namely,ifwearemining
motifswith length
ℓ + k
and wehavesome motifm ∈ M ℓ,then the support
measures of any of itssuper motifswith length
ℓ + k
annotbegreaterthanmotifhaving
m
asaprex andhypothetialsux withsore1.0
. Therefore,a motif
m
and itssupmotifs an be pruned, if on any of the data traksℓ · msupp(S, m) + |S| · k
ℓ + k < σ
if we are mining using maximal support or
ℓ · asupp(S, m) + mcount (S, m) · k
ℓ + k < σ
ifweareminingusingadditivesupport. Ofourse,themaximalmotiflength
ℓ + k
must be xed to make these formulas usable. As an example, let usanalyze Figure 3.4.1.
Figure 3.1: Support of motifsAT and AT** ina sample subsequene.
Weseethat
msupp(S,
AT) = max{0.1; 0.5; 0.25; 0.5} = 0.5
andasupp(S,
AT) = 0.1+0.5+0.25+0.5 = 2.25
. Ifwewereminingusingmaximaltypeofsupporton this trak, then wean prunethe motifwith itssupmotifs if
(2 · msupp(S,
AT) + 2) /4 = (2 · 0.5 + 2) /4 = 0.75 < σ
where
σ
is the threshold. For additivetype of support this would be(2 · asupp(S,
AT) + 2 · mcount (S,
AT)) /4 = = (2 · 0.5 + 2 · 4) /4 = 2.25 < σ .
InorporatingthispruningmethodrequiresonlysmallhangestoNaive-
Searhonline2of Algorithm3.3.1. The resultisgiveninAlgorithm3.4.1,
where CanPrune uses method desribed above to determine if the motif
and supmotifs an bepruned.
1: proedure MaxSupSearh(
S, ~σ, m, ℓ
)2: if
mcount (S , m ) = 0 ∨
CanPrune(~σ
,−−→ supp(S, m))
then3: return
4: else if
|m| = ℓ
then5: if IsFrequent(
~σ
,−−→ supp(S, m)
) then6: SaveMotif(
m
)7: endif
8: return
9: end if
10: for
e ∈ {
A, T, G, C}
do11: MaxSupSearh(
S, ~σ, m || e, ℓ
)12: end for
13: end proedure
3.4.2 Safe Over-Approximation Searh
Another improvement toNaiveSearh uses slightly dierent approah. It
isbasedonProposition2.4.7thatstatedthatsupportofany motif
m ′ gained
frommotif
m
byreplaingone ormorenuleotideswithwildardharaters,is greater or equal ompared to original motif. Also, it holds with either
maximal or additive type of support. This allows us to dene a support
operator thatis guaranteed tobedownward losed, whih wasanissue with
NaiveSearh and MaxSupSearh[Lau09℄. We will be referring toit as
safe over-approximation type of support onwards.
Denition 3.4.1 Let
supp ∗ (S, m)
of motifm ∈ M ℓ denote the support of
its wild harater extension
m ∗ ∈ M k, where ℓ 6 k
.
Reall that a wild ard harater extension of
m
was a xed length motifthat ontained
m
as aprex and rest of the elements (wild ard haraters)as the sux. As an example, if we are interested in mining sequenes of
length
ℓ = 3
, we rst start by heking the support of wild ard haraterextensions of motifsin
M 1, namely A**, T**, G**, C** (note that we do
not inludemotif * in this list,as it is anyway the most frequent motif and
we are not interested in it). Ifany of these motifs isinfrequent,for example
T, thenwepruneallitssupmotifsTAA, TAT, TAG, TAC, TTAet etera. But
supp ∗ operator. We only have to keep in mind, that it is downward losed
only when mining motifs with xed length, so that Proposition 2.4.7 would
hold.
Algorithm 3.4.2 SafeOver-Approximations Searh
1: proedure SafeApproxSearh(
S, ~σ, m, ℓ
)2: if IsFrequent(
~σ
,−−→ supp ∗ (S, m)
)then3: if
|m| = ℓ
then4: SaveMotif(
m
)5: return
6: endif
7: else
8: return
9: end if
10: for
e ∈ {
A, T, G, C}
do11: SafeApproxSearh(
S, ~σ, m || e, ℓ
)12: end for
13: end proedure
The Algorithm 3.4.2 denes SafeApproxSearh. Note that we use
−−→ supp ∗ operatorinstead of −−→ supp
and use IsFrequent todetermine, whether
weanprunethemotifwithitssupmotifs. Thisispossibleduetodownward-
loseness of
−−→ supp ∗ operator.
BothMaxSupSearhand SafeApproxSearhhavesimilartheoreti-
alruntimeomplexity
O(f · 4 ℓ )
,where pruningfatorf ∈ (0, 1]
ismaximal,if no pruningourand minimal,if allmotifsare pruned.
3.4.3 Infrequent Sub-Motifs Pruning Method
This alternative searh method isdiretly basedonTheorem 2.4.5. Namely,
if we are interested in motifs with length
ℓ
, then for any partitioning of a frequentmotifm ∈ M ℓintotwopieesm 1 , m 2,atleastoneofthepieesmust
befrequent. The ideaistogenerate two sets
F
andI
,whereF
ontains thefrequent motifsand
I
the infrequent ones of lengthℓ/2
. Thus, we ombinemotifs from
F
andI
toenumerate nal andidates. Note, that we needI
,beause any frequent motif of length
ℓ
may have infrequent prex or sux.We do not need to onsider ombinationsof infrequent submotifs as due to
Theorem 2.4.5 we know, that the resulting motif is also infrequent. Also,
there are many ways to partition the motifs, but making them with same
length enables us to enumerate them faster. The Algorithm 3.4.3 desribes
this proess.
Algorithm 3.4.3 Infrequent Sub-Motifs Searh
1: proedure InfreqSearh(
S, ~σ, m, ℓ
)⊲ ℓ
must be even2:
(F , I) ←
EnumerateMotifs(S , ~σ, ℓ/2)
3:
C ← {(a, b) | a ∈ F , b ∈ F ∪ I}
4: for
c ∈ C
do5: if CanPrune(
~σ
,−−→ supp(S, c))
then6: ontinue
7: else if IsFrequent(
~σ, −−→ supp ∗ (S, c 1 || c 2))then
8: SaveMotif(
c 1 || c 2)
9: else if
c 1 6= c 2 then
10: if IsFrequent(
~σ, −−→ supp ∗ (S, c 2 || c 1))then
11: SaveMotif(
c 2 || c 1)
12: end if
13: endif
14: end for
15: end proedure
On line 3, we enumerate all the andidate motifs of length
ℓ
. On line5, we rst try to eliminateandidates by using information we know about
their prex
m 1 and sux m 2. We try this, beause querying the database,
depending on data strutures used, an be more ostly. The CanPrune
methodheks on every trak if
msupp(S, m 1 || m 2 ) 6 msupp(S, m 1 ) + msupp(S, m 2 )
2 < σ
for maximal support typeand
asupp(S, m 1 || m 2 ) 6 asupp(S, m 1 ) + asupp(S, m 2 )
2 < σ
Proposition 2.4.3. Ifwe an prune
m 1 || m 2 using above equations, then we
an alsoprune
m 2 || m 1 as there is no dierene, in what order we onsider
the prex and sux part.
3.5 Mining Fixed Number of Best Motifs
The searh algorithmsdisussed inearliersetionsonentrate onnding all
frequentmotifsinrespettosome thresholdvetor. But suppose wewantto
mine 100 best motifs. Doing this by hand usingany previously mentioned
searh algorithmwould require followingproess. First, we determine some
reasonable thresholds and support types for data traks. Seond, we mine
frequent motifs using these thresholds and deide, whether the number of
motifs was too small or too large. Third, we modify the thresholds by in-
reasingordereasing themand mineagain untilwehavedesired numberof
frequent motifs.
The proess we justdesribed isatually similar tobinary searh known
in omputer siene. The Algorithm 3.5.1 implements it to automate this
proess. On line3,wedetermine twosalars
α
andβ
, suh thatminingwithα · ~σ
returns all motifs present in data and mining with(β + ε) · ~σ
returnsnone of the motifs where
ε > 0
. It is trivial, thatα = 0
, beause in thatase allmotifswillbefrequent. Determining
β
ismoreompliated,beausewe do not have any prior knowledge about maximal supports in data. First
optionistomakeaguess,but abetteralternativeistondout thesupports
by alulating
~s = −−→ supp ∗ (S,
*)
and setβ = max{s i /σ i | i = 1, . . . , n}
(3.2)where
n
isthe numberofdatatraks and~σ
ontainsuser-dened thresholds.This way
β · ~σ
may returnonly minimalpossible numberof frequent motifs.Havingtheseboundaries xed, wean easilyombine anypreviouslydened
searhmethodwith binary searh. In other words, wekeep saling the orig-
inal vetor of thresholds
~σ
, until we get desired number of frequent motifs.The linearity of this approah may not always be the best hoie, beause
the relationsbetween the reasonable thresholds depend on the natureof the
data. We do not study further possibilities in this work, but it ould be a
possibleresearh area in the future.
1: proedure NBest(
S, ~σ, N, ℓ
)2:
~s ← −−→ supp ∗ (S,
*)
3:
α ← 0, β ← max{s i /σ i | i = 1, . . . , n} ⊲ n
is the numberof traks4:
C ← ∞ ⊲
The losest numberof best motifs5:
δ ← 0 ⊲
Salar tobe used to minelosest numberof best motifs6: while
β − α > ε
do⊲ ε > 0
limitsthe reursion depth7:
γ ← (α + β)/2
8:
k ←
NumFreqMotifs(S, γ · ~σ, θ, ℓ) ⊲ θ
is the zero-length motif 9: ifabs(k − N ) < C
then10:
C ← k, δ ← γ
11: endif
12: if
k > N
then13:
α ← γ
14: else if
k < N
then15:
β ← γ
16: else if
k = N
then17: break
18: endif
19: end while
20: return MineMotifs
(S, δ · ~σ, θ, ℓ)
21: end proedure
Funtion NumFreqMotifs an be used as a wrapper around searh
methodsdesribedinearliersetions. Therearestillafewthingstoonsider.
First, not always there exist some xed number of best motifs, beause two
motifsmayhave exatlysamesupportmeasures. In that ase,binary searh
goes intoinnite loop. Same happens, when the number of desiredmotifsis
greater than there are motifs present in input data. In both situations, we
need to limit the maximal depth of the reursion. But we an still return
the numberof motifs,that isverylose todesirednumberofmotifs. On line
3, wedene
C
that willremember, what was the losestnumberof frequentmotifs to the desired xed number of motifs. Salar
δ
an be used to sale~σ
to getC
frequent motifs. On line 6, we useε > 0
to limit the reursiondepth. On lines 12 18, we see binary searh in ation. The while loop