Uncertainty regions and breakpoint intervals

2.2 Methods for recombination detection

3.1.3 Uncertainty regions and breakpoint intervals

For each query sequence position and each subtype in the given alignment, the posterior probability is calculated. Based on these probabilities, firstly uncertainty regions (UR) in the predicted recombination for the query sequence and secondlyinterval estimates of breakpoints, calledbreakpoint intervals(BPI), are defined (Workflow 3.1).

3.1.3.1 Uncertainty region

If at a certain position i of a query sequence S the posterior probability of the subtype predicted by jpHMM for this position is lower than a certain threshold 0 t_UR < 1 the prediction for this position is marked as uncertain(Figure 3.1 a). This classification accounts for the fact that there is a significant (≥ 1−t_UR) probability that the predicted subtype is wrong according to the probabilistic model.

00000000

Figure 3.1: Two examples for uncertainty regions in predicted recombinations. In both figures, for each query sequence position (qp), the posterior probabilities (post. p.) of three subtypes are plotted.t_URmarks the posterior probability threshold for the definition of uncertainty regions. Vertical dashed lines define the extent of uncertainty regions. The first bar (1) below the plot of the posterior probabilities shows the original recombination prediction with precise breakpoint positions. The second bar (2) shows the predicted recombination including uncertainty regions (hatched regions). In a) an uncertainty region is defined because the posterior probability of the predicted subtype (red) is belowt_UR. In b) the region around the predicted breakpoint is not defined as a breakpoint interval since the posterior probability of a third subtype (blue) is higher than the posterior probabilities of the subtypes predicted to the left (green) and to the right (red) of the breakpoint.

For uncertainty regions, no parental strain can confidently be determined. But both a text file with the posterior probabilities for all query sequence positions and all subtypes as well as a graph of the posterior probabilities are part of the new jpHMM output (see chapter 4, Implementation). Thus, information about which subtypes are most closely related in these regions is given.

32 Chapter 3. Improvements, extensions and modifications of jpHMM 3.1.3.2 Breakpoint interval

For each predicted breakpoint position (defined by the Viterbi path) the corresponding breakpoint interval is defined by the interval around the predicted breakpoint position, where the posterior probabilities of the two subtypes predicted to the left and the right of the breakpoint are lower than a certain threshold 0 t_BPI < 1, but higher than the posterior probabilities of all other subtypes (Figure 3.2). The maximum extent of such a breakpoint interval is limited by the position of the preceding and the successive predicted breakpoint (if one of the breakpoints does not exist the maximum extent is restricted by the corresponding sequence end). Therefore, if the posterior probability of the subtype pre-dicted to the left (to the right, resp.) of the breakpoint does not reach the thresholdt_BPIat any position within between the preceding and the current breakpoint (between the current and the successive breakpoint, resp.), the whole interval is defined as an uncertainty region.

This also happens if the posterior probability of a third subtype is higher than the posterior probability of one of the two predicted subtypes in this region (Figure 3.1 b), to indicate the possibility of an undetected recombination segment. If the predicted breakpoint is lo-cated outside of the breakpoint interval defined by the posterior probabilities (Figure 3.2 b) ) the breakpoint is extended to include the predicted breakpoint. The length of a predicted

Figure 3.2: Two examples for breakpoint intervals in predicted recombinations. In both figures, for each query sequence position (qp), the posterior probabilities (post. p.) of two subtypes are plotted. tBPI marks the posterior probability threshold for the definition of breakpoint intervals. Vertical dashed lines define the extent of breakpoint intervals. The first bar (1) below the plot of the posterior probabilities shows the original recombination prediction with precise breakpoint positions. The second bar (2) shows the predicted re-combination including breakpoint intervals (two-color region). In a) a breakpoint interval around the predicted breakpoint is defined by the region where the posterior probability of the predicted subtypes (green and red) is belowtBPI. In b) the original left end (dotted line) of the breakpoint interval defined by the posterior probabilities of the predicted subtypes is moved to the left (dashed line) to include the predicted breakpoint position.

3.1. Uncertainty regions and breakpoint intervals 33 breakpoint interval indicates how precisely the breakpoint can be located reliably. A large interval, for example, is the consequence of the uncertainty of the model to locate the exact breakpoint position between two subtypes.

Regions that are initially defined as uncertainty regions (e.g. often close to predicted breakpoint positions) and secondly defined as breakpoint intervals, are regarded as break-point intervals and not as uncertainty regions. Due to the order of defining uncertainty regions and breakpoint intervals (Workflow 3.1), it is appropriate to definet_BPI≤t_UR. The chosen thresholds are given in chapter 5, Results, section 5.3.

34 Chapter 3. Improvements, extensions and modifications of jpHMM Workflow 3.1 (Definition of uncertainty regions and breakpoint intervals)

LetS =s1, . . . , slbe a query sequence. LetSbe the set of subtypes andst=st[1, l], sti ∈ S, i ∈ [1, l], the predicted sequence of subtypes for S. A recombination breakpoint is usually locatedbetweentwo successive query sequence positions, e.g. iandi+ 1, which is notated as i/i+ 1. Here, a breakpoint b_j describes the breakpoint b_j/b_j + 1, i.e.b_j is the position to the left of the breakpoint. LetB = {b₁, . . . , b_s_k}be the set of all predicted recombination breakpoints inS.

1. Definition of uncertainty regions 2. Definition of breakpoint intervals:

for each breakpointbj/bj + 1define the surrounding breakpoint interval:

(a) definition of the left boundary of the breakpoint interval:

letst[b_j]be the subtype predicted to the left of the breakpoint.

Define the position i_left, bj−1 < i_left ≤ b_j in the query sequence where the posterior probability ofst[b_j]reaches the thresholdt_BPI,

i.e.P_post,i_left(st[b_j])≥t_BPI, decreasingi_leftand starting withi_left =b_j. (b) definition of the right boundary of the breakpoint interval:

letst[b_j+ 1]be the subtype predicted to the right of the breakpoint.

Define the position i_right, b_j < i_right ≤ b_j+1 in the query sequence where the posterior probability ofst[b_j + 1]reaches the thresholdt_BPI,

i.e.P_post,i_right(st[b_j + 1])≥t_BPI, increasingi_rightand starting withi_right =b_j+ 1.

IF (one of these positionsi_leftori_rightcannot be found) the region remains defined as uncertainty region.

ELSE

check the posterior probabilities of all other subtypes within[i_left, i_right] : IF (a subtypeS,S 6=st[b_j]andS 6=st[b_j + 1],exists

that has a higher posterior probability than the two predicted subtypes, i.e.

P_post,i(S)≥P_post,i(st[b_j])andP_post,i(S)≥P_post,i(st[b_j + 1])) the region is also defined as uncertainty region.

Im Dokument Improvement of the jpHMM approach to recombination detection in viral genomes and its application to HIV and HBV (Seite 47-51)