Direct construction of viable suffix trees

4. The viable suffix tree 43

4.4. Direct construction of viable suffix trees

The following imperative algorithm presents an efficient way of constructing a viable suffix tree and is based on Ukkonen’s algorithm [104] for online and linear time suffix tree construction.

Differences and extensions in the algorithm compared to Ukkonen’s description in his article are marked in red.

For constructing the viable suffix tree of string S with lengthn, we represent it as a pair T =(_V,_E)_{, where}_V denotes the set of nodes andEthe set of edges. An edge is represented as a quadruple(o,d,_i, _j), where the first component^odenotes the origin of the edge, the second component^ddenotes the destination of the edge, and the two last components define the edge labelSⁱ_j.

In general, the algorithm is divided into|_S|phases while the string is processed symbol by symbol from left to right. In each phasei+1, the (viable) suffix treeTⁱfrom the previous phase is used to generate the tree of the current phaseTⁱ⁺¹. For that, the phasei+1 can be further divided into i+1 extension operations. Extension operation j starts by identifying the path with the labelS_i^j. Then, we can simply use these three rules according to Gusfield [43]:

1. We find a leaf ^v with label(v) = S_i^j inTⁱ. Then, the edge leading to the leaf can be extended by characterSⁱ_i₊⁺₁¹.

2. The path with the labelS_i^j does not continue with the character at string position i+1.

Instead, it continues with one or multiple different characters. Then, we introduce a new leaf as well as a new edge leading from the end of the pathS_i^j to the new leaf. The new edge is labeled withSⁱ_i⁺₊¹₁.

3. There is already a pathS_i^j₊₁that includesSⁱ_i₊⁺₁¹. In this case, the suffix is implicit, but it will be explicit, at the latest with the processing of the sentinel.

The first rule does not have to be performed with Ukkonen’s algorithm, because we know that once a leaf is created, it will always be updated with rule one in any successive phase. Therefore, whenever a leaf node is created, we simply set the end of the edge that leads to the leaf to the

4.4. DIRECT CONSTRUCTION OF VIABLE SUFFIX TREES

last index of the input string. Thus, no leaf has ever to be extended again. Next, whenever rule three applies, it will also apply in all following extension steps of the same phase. Therefore, we can stop the current phase at this point, because all following suffixes are already represented implicitly. This describes the general strategy of Ukkonen’s algorithm, but there are several tricks to make it work in linear time for alphabets of constant size.

Additionally, we introduce the notion ofsuffix links.

Definition 4.6. LetS= cR be a string that is built from the concatenation of characterc and stringR. Furthermore, let^vbe a node such thatlabel(v)= S. If there exists another nodeu such thatlabel(u)=R, then the pointer from^vto^uis calledsuffix link.

It is important to note that if ^v is a branching node, then the endpoint of its suffix link is a branching node, too. (Gusfield [43] presents an excellent summary of the suffix tree construction according to Ukkonen, including the proof of the previous note.) However, this does not necessarily hold for viable suffix trees. The viable suffix tree in Example 4.1 shows that there is an internal node representing string “^(.)”, but “^.)” does not exist, because it is not well-formed. Therefore, for viable suffix trees, we use the lvp for the suffix link if the actual one does not exist. In this case, the lvp of “^.)” is “^.”; therefore, the suffix link of “^(.)” would point to the internal node representing “^.”.

In order to define the current state of the suffix tree construction, Ukkonen uses the notion ofreference pairs.

Definition 4.7. LetTbe a rooted, directed, edge-labeled tree andSa string. The pair(v,_R)_is calledreference pairofSwith respect toTif^vis a branching node fromTandS=label(v)++R.

Iflabel(v)is the longest prefix ofSrepresented by a branching node ofT, then the pair is called canonical reference pair.

If a node ^v directly represents string S, then the canonical reference pair of S is (v,)_. Furthermore, Ukkonen introduces theactive point, which is the point at which the traversal of any extension process starts. It is used so that the traversal for every extension does not have to restart at the root of the tree. Usually, the previous extension makes sure that the active point is set properly for the next extension. This can be achieved by using theend pointof the previous extension phase. The end point of a phase is the pair at which the processing for this phase stops. Often, this happens when rule three is applied.

The construction of a new (valid) suffix tree starts with a tree that only consists of the root node. The tableLinkstores all suffix links of the tree. Algorithm 4.1 processes the string from left to right and calls the u p dat e function at the start of every new phase. For the first step, the reference pair is set to(root,_S⁰

0). In general, the active point of phaseiis described here as (v,_S^k

i). The call of function u p dat e returns the end point of phaseiand function ca non i z e returns the active point, as canonical reference pair, for the next phase.

Algorithm 4.1Direct construction of the viable suffix tree

1: function con s t ruc t v i a b l e s u f f i x t r e e(S)

2: add node^root

3: Link[root] ←_NULL

4: v←root

5: k ←₀

6: i←₀

7: whilei< |_S|do

8: (v,_k) ←_{u p dat e}(v,_k,_i) .(v,_S^k

i)is the active point of phasei

9: (v,_k) ←ca non i z e(v, _k, _i+1, _end(_S, _k−_depth(v)))

10: i←_i+1

Algorithm 4.2Update the tree for phasei

1: function u p dat e(^v,_k,_i)

2: oldR←root

3: i⁰←_end(_S, _k−_depth(v)) .end position of the lvp of suffixS^k⁻^depth⁽^v⁾

4: (_endPoint,r,_edgeLen)= t e s t A n d Sp l i t(v,_k,_i,_i⁰,_Sⁱ

5: while notendPointdo

6: pos←_i−_depth(r)

7: if edgeLen> ₀then .check whether the new edge has empty label or not

8: add leaf with labelposand edge(r, pos, _i, _end(_S,_pos))

9: else ifedgeLen=0andr ,^root^then

10: add leaf with labelposand edge(r, pos,−₁, −₁)with empty label

11: if oldR< {root,_NULL} then

12: Link[oldR] ←r

13: oldR ←_r

14: if v< {root,_NULL}then

15: k←_k− (_depth(v) −_depth(_Link[v]) −₁) .suffix link distance may be> ₁

16: i⁰ ←_end(_S, _k−_depth(_Link[v])) .end position of the lvp of suffixS^k⁻^depth⁽^Link^[^v^])

17: (_v,_k) ←ca non i z e(_Link[v],_k,_i, _i⁰)

18: i⁰ ←_end(_S, _k−_depth(v)) .end position of the lvp of suffixS^k⁻^depth⁽^v⁾

19: (_endPoint,_r,_edgeLen)=t e s t A n d Sp l i t(v,_k,_i, _i⁰,_Sⁱ

20: if oldR< {root,_NULL}then

21: Link[oldR] ←_r

22: return(v, _k)

4.4. DIRECT CONSTRUCTION OF VIABLE SUFFIX TREES

Algorithm 4.2 transforms Tⁱ⁻¹ into Tⁱ. For every suffix starting at position k −_depth(v)_, positioni⁰ denotes the end position of the lvp. This new variable is used to call the function t e s t A n d Sp l i t, which returns three values. The first one is a boolean value that tells us whether the current pair (v,_S^k

i) is the end point of the current phasei. If not, then the new variableedgeLentells us whether the new edge has an empty label or not. We add the new leaf and edge according to the value of the variable. Obviously, we do not add edges with empty labels to the root, because then we would represent empty suffixes, i.e. suffixes without an lvp.

Next, the algorithm follows the suffix link of^vto the next shorter suffix. For the classical suffix tree, it is guaranteed that^vhas a suffix link to a node^uthat has a label that is only one character shorter than the one of^v. As explained before, this does not hold for viable suffix trees. Instead we follow the suffix link, but will decrease the value ofksuch that it reflects the number of characters by which the suffix link of^v is shorter than^v itself. This is necessary, because otherwise the reference pair for the next extension phase would be wrong.

The question is now: how does this influence the runtime of the viable suffix tree construction?

First of all, whenever the length difference betweenvand its suffix link is greater than one, it means that there are closing base pair characters that do not have a corresponding opening base pair character in the suffix that we are currently creating. This means also that the insertion of the suffixes starting with closing base pair characters will be skipped. Now, if we have node

“((..))”, its classical suffix link would point to a node “(..))”, which is not valid. Instead now, it points to node “(..)”. Furthermore, the new suffix link of “(..)” now points to “..” instead of

“..)”. Compared to the construction of the classical suffix tree, we have decreased the value of kalready twice as often as usual. But, on the other hand, in the next steps, we can skip the insertion of the two suffixes starting with the closing base pair characters. Also, during these steps, the algorithm will not walk down the tree any further. Instead, function ca non i z e will make sure that we walk up the necessary edges before inserting the next suffix after processing the two closing base pair characters. This means that the total number of down and up walking for constructing the viable suffix tree is still bounded byO(_n). This holds for the language of the grammar that generates well-formed RNA structures, because we have exactly one character, the closing base pair character, that might cause the suffix to get invalid at some position.

Additionally, it is the character that causes the suffix to be invalid right from the start. All other suffixes that start with(_or.are guaranteed to have a lvp of at least length one, because the input string of the tree construction is well-formed. For more complex grammars, this is not guaranteed to work.

The additional check whether the variable for the node^oldRequalsNULLis just an artifact of our implementation of the construction algorithm. In the original approach by Ukkonen, the parent of the root node is handled implicitly by having an edge of length one to the root via any character from the alphabet. We have chosen to handle this case explicitly; therefore, the check againstNULL, which is the artificial value for the parent node of the root, needs to be

Algorithm 4.3Check whether the tree for this step is complete

1: function t e s t A n d Sp l i t(^v,_k, _p, _p⁰,_t)

2: if p= kthen

3: ifv =NULLthen

4: return(True,v,_{_})

5: else ifthere is an edge(v,_{_},_{_},_{_})starting witht then

6: return(True,v,_{_})

7: else

8: return(False,v, _p⁰+1−_k)

9: else if p⁰ < _kthen .invalid suffix; empty leaf possible

10: return(False,v, _p⁰+1−_k)

11: else

12: minP←_min(_p, _p⁰)

13: search for an edge(_v,_v⁰,_i, _j)starting withS^k_k

14: if t =Sⁱ_i₊⁺^p_p⁻₋_k^kthen

15: return(True,v,_{_})

16: if j−_i> _minP−_kthen .regular case for new branching node

17: split edge(v, v⁰,_i, _j)at positioni+minP−_kusing branching node^mid

18: return(False,mid,_end(_S,_minP−_depth(mid))+1−_minP)

19: return(False,v⁰,_end(_S,_minP−_depth(v⁰))+1−_minP)

considered here.

Algorithm 4.3 tests whether the current point is the end point of phasep. Function t e s -t A n d Sp l i -t re-turns -true if i-t is -the end poin-t. The answer can be found by only -tes-ting whe-ther there is path that continues with charactert. If p=k, then we have to check whether there is an outgoing edge fromvthat starts witht. On the other hand, ifk < p, then the algorithm has to check whether the edge that describes the continuing path also contains charactertat the right position. If one of the two cases is true, then rule three applies and the current phase can be ended.

The new conditionp⁰ < _kchecks whether the suffix is invalid. In this case, the leaf that has to be inserted at node^vhas to have an empty label, because p⁰+1= k. If neither pnorp⁰is smaller thank, we have to correctly calculate the end point by taking the minimum from both.

Further, node^midis introduced to cover two different cases: first, the already known case from Ukkonen’s algorithmj−_i> _minP−k, which requires us to split the edge at positioni+minP−_k and introduces a new branching node. Second, the case j−_i=minP−k, which means that an lvp exists multiple times in the lvp-repertoire of stringS. For this, we do not need to introduce a new branching node, because the needed branching node already exists and is^v.

There is no change in Algorithm 4.3 that might influence the runtime of the overall algorithm.

Algorithm 4.4 is given a reference pair for some node^uand finds the canonical reference pair (v⁰,_S^k⁰

p)_foru. This is done by walking down as many edges as necessary to arrive at^v⁰ such

Im Dokument Methods for the identification of common RNA motifs (Seite 62-67)