• Keine Ergebnisse gefunden

values is possible yet computationally expensive. It is possible to limit this overhead by boundingtN, e.g., by the deadline.

6.5 Evaluations of theoretical model

In this section, it is attempted to give a feeling for the relative importance of the various parameters for the responsiveness of a service, using the results developed in Section 6.4 (further examples can be found in [138, 139]). The parameter space is seven-dimensional: tS,tC,tR,,n,dandpcov. As is commonly the case with a large number of parameters, it is quite difficult to give a complete overview of the behavior. Therefore, only a few selected figures will be presented.

An important motivation for using responsiveness as the optimization metric in the analysis is dealing with real-time services that have a deadline. In particular, optimizing the checkpointing interval so as to minimize the mean execution time is a poor choice for real-time services. For the runtime distribution shown in Figure 6.3 (with parameters tS

= 10, tC

= 2, tR

= 1, = 0:1, pcov

= 1), n = 3 minimizes the mean execution time (to 30.8). While n = 5 has a mean of 33.6, at a deadline of, e.g.,d = 70, P(X3

70) = 0:984290 andP(X5 70) = 0:995138, which is over one percent point better. To obtain a given responsiveness, the deadline must be much longer withn=3than withn=5: Ensuring a responsiveness of

0:99999requires a deadline of142:67forn=3, while withn=5this responsiveness is already achieved at

d=115.

0.99 0.992 0.994 0.996 0.998 1

80 90 100 110 120

Deadline d

Probability

n=2 n=3 n=4 n=5

Figure 6.3: Completion time distributions Pr(Xn d)shown over deadlinedfor various numbers of checkpointsn. Other parameters:tS

=10,tC

=2,tR

=1,=0:1,pcov

=1:

Figure 6.4 shows the number of checkpoints n that maximizes responsiveness for increasing deadlines and three different service times. Note the characteristic shape of these curves: As the deadline increases, the best possiblenincreases in a roughly sawtooth-like pattern. An increase occurs when it is possible to execute an additional recovery step (before the deadline would expire). This is possible first for largern, but if two differentnpermit the same number of recovery steps for a given deadline, the smallernis preferable since it incurs smaller overhead and therefore a smaller chance of being hit by a fault—resulting in this characteristic sawtooth shape for small n. For largen, this effect is less pronounced since the difference in block length is sufficiently small to smooth out these bumps (whennstarts growing rapidly, the responsiveness is already

0 10 20 30 40 50 60

0 50 100 150 200 250 300

Deadline d

Optimal n

=10 =50 =100

tS tS tS

Figure 6.4: Number of checkpointsnmaximizing responsiveness shown over deadlinedfortS=10,

t

S

=50,tS

=100. Other parameters: tC

=2,tR

=1,=0:01,pcov

=1.

very close to 1). A simple rule of thumb is therefore to take the smallestnthat allows the maximum number of recovery steps to be taken before the deadline would expire.

The impact of the coverage probability is shown in Figure 6.5. Here,pcov is set to0:6. A low coverage probability implies a large risk of not detecting a fault during the acceptance test and terminating with an invalid result. Therefore, there are two antagonistic tendencies: A larger number of checkpoints is preferable to avoid losing much work when recovery becomes necessary, a smaller number is better to reduce the chances of not detecting an error that is actually present (owing to the imperfect fault detection).

This tradeoff is further complicated by the fact that an acceptance test also influences the checkpointing time. Namely, to improve an acceptance test’s coverage probability, it has to execute more tests and becomes more complicated and longer. Hence there is yet another tradeoff for the length of the checkpointing interval.

Figure 6.6 shows the optimal number of checkpoints for a few different combinations of checkpointing time and coverage probability (an acceptance test can often take up the majority of the time of writing a checkpoint).

The examples so far have used carefully selected, but perhaps slightly unrealistic parameter values to high-light some salient points of the checkpointing analysis. Varying the mean lifetime and the service execution time tS, the following Tables 6.1 and 6.2 on page 86 show the optimalnand the corresponding responsive-ness for a somewhat more realistic example: Consider a large, long-running program that has a considerable amount of state, so that checkpointing takestC

=30s and a recovery step takestR

= 10s. The acceptance test is the imaginary program is non-trivial, so that pcov = 0:999. To be able to recover from faults, the deadline has to allow some slack, and consequently the deadline in this example is assumed to be fifty percent longer than the service execution time: d = 1:5tS. Most importantly, service execution times of one hour, ten hours, and hundred hours are considered, combined with mean lifetimes of ten, hundred, thousand, and ten thousand hours. The results are as expected: with decreasing fault rate (increasing mean lifetime), the responsiveness becomes better and the optimal choice forndecreases.

The behavior with probabilistically described services as discussed in Section 6.4.2 is in principle similar, yet more complicated in detail. As an example, checkpointing with the same parameters as in Figure 6.4 was considered, however, the service execution time is here one of10;11;:::;19time units, each occurring with

10%probability. Figure 6.7 shows the responsiveness for this service with three different deadlines (d=40,

d =50,d=60) when the checkpointing intervaltNis varied.

The optimal checkpointing interval when the deadline is varied is shown in Figure 6.8 (for the same

6.5. EVALUATIONS OF THEORETICAL MODEL

0.5 0.6 0.7 0.8

50 100 150 200 250

Deadline d

Probability

n=1 n=2 n=4 n=8

Figure 6.5: Completion time distributionPr(Xnd)shown over deadlinedfor various numbers of checkpointsnwith coverage probabilitypcov

=0:6. Other parameters: tS

=50,tC

=2,

t

R

=1,=0:01.

0 1 2 3 4 5 6 7 8 9

50 100 150 200 250

Deadline d

Optimal n

(p=0.9, =2)tC (p=0.95, =4)tC (p=0.99, =8)tC

Figure 6.6: Number of checkpointsnmaximizing responsiveness shown over deadlinedfor different

(p

cov,tC

)combinations. Other parameters: tS

=50,tR

=1,=0:01.

mean lifetime1=in s

36000 360,000 3,600,000 36,000,000

t

Sin s (10 hr) (100 hr) (1,000 hr) (10,000 hr)

3600 (1 hr) 11 8 5 5

36000 (10 hr) 25 11 7 5

360000 (100 hr) 248 78 25 9

Table 6.1: Optimal number of checkpoints for varying mean lifetime1=and service timetS; other parameters: tC=30s,tR =10s,d=1:5tS,pcov=0:999.

mean lifetime1=in s

36000 360,000 3,600,000 36,000,000

t

Sin s (10 hr) (100 hr) (1,000 hr) (10,000 hr) 3600 (1 hr) 0 :999890 0 :999989 0 :999999 >0 :999999 36000 (10 hr) 0 :998958 0 :999899 0 :999990 0 :999999 360000 (100 hr) 0 :989632 0 :998987 0 :999900 0 :999990

Table 6.2: Responsiveness (corresponding to optimal number of checkpoints shown in Table 6.1) for varying mean lifetime1=and service timetS, other parameters: tC =30s,tR = 10s,

d=1:5t

S,pcov =0:999.

0.95 0.96 0.97 0.98 0.99 1

2 4 6 8 10 12 14

Probability

Checkpointing Interval d=40 d=50 d=60

Figure 6.7: Responsiveness shown over checkpointing interval tN for three different deadlines d. Other parameters: tS is one of10;11;::: ;19 with equal probability, tC = 2, tR = 1,

=0:01,pcov

=1.