• Keine Ergebnisse gefunden

Computation of lvp lengths for RNAs

4. The viable suffix tree 43

4.5. Computation of lvp lengths for RNAs

over the dot-bracket alphabet.

In order to store our precomputed values, a table namedEnd is used. The values in this array are equivalent to the ones computed by theEndfunction in Definition 4.3. In addition to the computation of the End table, the algorithm computes a second table named Depth as by-product8. This additional table shows the branching depth at a specific position of the RNA structure and is defined as the number of base pair open characters that have not yet been accompanied by their corresponding base pair closing characters. This additional table is required to judge (in constant time) whether any substring of the RNA structure constitutes a well-formed RNA secondary structure, i.e. the substring can be generated usingGS.

The algorithm iterates over all characters of the input stringS. It distinguishes four cases:

the current character stands for either base pair open, base pair close, unpaired base, or any other character that is not part of the alphabet, such as the sentinel character$.

In the first case, the linked listLLthat contains all elements inside the current branch, meaning all elements in between the opening and closing base pair characters, is extended by the current position. Subsequently, a new branch level for all characters in between the positioniand the corresponding closing base at position jis created by pushing the list for the current level on top of the stackSTand emptying the listLL. All new characters in betweeniand jare added toLLunless a new opening or closing base pair character appears in the string. In the end, the current branching depth is increased.

In the second case when a closing base pair character is observed, the end positions for all elements of the current listLL, including the current position, are set toi−1, which is the last position before the closing base pair character. Note that the corresponding opening base pair character is not part of the current list, because there might be unpaired bases following the current closing base pair character. In such a case, the lvp does not have to end at positioni.

After this, the top list of the stackSTthat holds all elements of the branch level that follows this position becomes the current active listLLagain. Subsequently, the current branching depth is reduced by one.

All unpaired base characters are added to the list LL that represents all positions of the current branch level. Since there is no change in the branch level, the branching depth remains unchanged, too. Last, for all characters that are not part of the alphabet, the procedure of case two can be used. That means the entry in theEndtable for all elements of the current list is set toi−1, including the current position. There are no lists left on the stackSTif the input string Sis a well-formed RNA structure.

Lemma 4.2. LetS be a well-formed RNA sequence string of length n. The End table correctly holds all end positions for the lvps of all suffixes and the Depth table holds all branch levels.

8For the construction of viable suffix trees and in other parts of thesis, we assume that the values of the tables DepthandEndare accessed by using the corresponding functionsdepthandend.

4.5. COMPUTATION OF LVP LENGTHS FOR RNAS

Algorithm 4.5Computation of theEndandDepthtable for a well-formed RNA structure

1: function com p u t e E N D(S, Σ)

2: initialize LinkedListLLand StackST .STis used as stack of lists of typeLL

3: i0

4: Depth[−1] ←0

5: whilei< |S|do .iterate over every character in the string

6: if Sii =(then .case1for base pair open

7: LL.append(i)

8: ST.push(LL)

9: LL.clear() .Remove all elements fromLL. STwon’t be affected by this.

10: Depth[i] ←Depth[i1]+1

11: else ifSii=)then .case2for base pair close

12: LL.append(i)

13: forxLLdo

14: End[x] ←i1

15: LLST.pop()

16: Depth[i] ←Depth[i1] −1

17: else ifSii =.then .case3for unpaired base

18: LL.append(i)

19: Depth[i] ←Depth[i1]

20: else ifSiithen .case4for other characters, like the terminal symbol

21: LL.append(i)

22: forxLLdo

23: End[x] ←i1

24: LL.clear()

25: Depth[i] ←Depth[i1]

26: ii+1

Initialization Sis input string.ST,LLare empty lists.Depth[−1] ←0.

Transitions

S ST LL S ST LL End&Depthtables

(iR T MR (i:M):T [] Depth[i] ←Depth[i1]+1 )iR T k :M)iR T M End[k] ←i1

)iR M:T [] ⇒ R T M End[i] ←i1

Depth[i] ←Depth[i1] −1

.iR T MR T i: M Depth[i] ←Depth[i1]

$iR [] k :M$iR [] M End[k] ←i1

$iR [] [] ⇒ R [] M End[i] ←i1

Depth[i] ←Depth[i1]

Figure 4.1.:Function com p u t e E N D as transition system. The subscript shows the current position in the text,[]denotes the empty list, : denotes the prepend operator, and

$ represents all characters that are not part of the alphabet.

Proof. In the case of a closing base pair character, according to the definition, there is no lvp.

This is covered in case two: the current position of the closing base pair characteriis added to listLLand subsequently seti1. Sincei1< i, the closing base pair character does not have a lvp. The branch level of this position gets reduced by one in tableDepth.

Next, in the case of an unpaired base character (case 3), we assume that the current position has branch level x and remember that it is guaranteed that the input RNA structure is well-formed. Based on this, we can conclude that the lvp only comprises the next characters that have a branch level≥ xand end either at the position before the closing base pair character that has branch levelx−1 or one position before the sentinel. Each branch level is kept separate; all characters of the current level are present in listLLand the other ones are ordered descending on the stack ST. Once the level ends at a closing base pair character, all positions that are stored inLLare set to one position before the closing base pair character (case 2). In case of the sentinel, the same holds (case 4). The branch level of this position does not change and is copied from the previous one in tableDepth.

Finally, the opening base pair character does not belong to the same branch level as the following characters that are enclosed by this and the corresponding closing base pair character;

it belongs to the previous branch level. The reason for this is that the lvp starting at the opening base pair character position does not necessarily end at the corresponding closing base pair character position, because there may be unpaired base characters that follow the corresponding closing base pair character; those would still extend the lvp. Therefore, the current position is added to listLLbefore this is pushed onto the stackST(case 1). In case of theDepthtable, the branch level increases by one already for this position, even though the new branch level only

starts with the following character.

Lemma 4.3. LetS be a well-formed RNA sequence string of length n. The End and Depth tables can be computed inO(n)using the co m p u t e LV P function.

Proof. The algorithm iterates over all characters of the string of lengthn. Every character is added to the linked listLLexactly once when the while loop iterates over its position in the string. In case one, the current list is pushed on top of the stackST; afterwards, the current list is cleared. This way, every element can only be in one list, either in the current one, namelyLL, or in a list on the stackST.

In cases two and four, there will be additional iterations over all items ofLL. For case two, the current list is emptied and replaced by the one that is on top of the stackST. This replacement ofLLis not necessary for case four, because it is a terminal symbol andSTis empty at this point.

So, in addition to the while loop every element’sEnd value is set either in case two or four.

Hence, the runtime stays linear and isO(n).

In order to decide whether a substringSijofSis viable, we use the function i s Va l i d. The decision can be made in constant time if the tablesEndandDepthare already computed.