Evaluations of theoretical model - Responsive Execution of Parallel Programs in Distributed Com

values is possible yet computationally expensive. It is possible to limit this overhead by bounding^t^N, e.g., by the deadline.

6.5 Evaluations of theoretical model

In this section, it is attempted to give a feeling for the relative importance of the various parameters for the responsiveness of a service, using the results developed in Section 6.4 (further examples can be found in [138, 139]). The parameter space is seven-dimensional: ^tS,^tC,^tR,,ⁿ,^dand^pcov. As is commonly the case with a large number of parameters, it is quite difficult to give a complete overview of the behavior. Therefore, only a few selected figures will be presented.

An important motivation for using responsiveness as the optimization metric in the analysis is dealing with real-time services that have a deadline. In particular, optimizing the checkpointing interval so as to minimize the mean execution time is a poor choice for real-time services. For the runtime distribution shown in Figure 6.3 (with parameters ^tS

= 10, ^tC

= 2, ^tR

= 1, ⁼ ^0:1, ^pcov

= 1), ⁿ ⁼ ³ minimizes the mean execution time (to 30.8). While ⁿ ⁼ ⁵ has a mean of 33.6, at a deadline of, e.g.,^d ⁼ ⁷⁰, ^P^(X³

70) = 0:984290 and^P^(X⁵ ⁷⁰⁾ ⁼ ^0:995138, which is over one percent point better. To obtain a given responsiveness, the deadline must be much longer withⁿ⁼³than withⁿ⁼⁵: Ensuring a responsiveness of

0:99999requires a deadline of^142:67forⁿ⁼³, while withⁿ⁼⁵this responsiveness is already achieved at

d=115.

0.99 0.992 0.994 0.996 0.998 1

80 90 100 110 120

Deadline d

Probability

n=2 n=3 n=4 n=5

Figure 6.3: Completion time distributions ^Pr(Xⁿ ^d)shown over deadline^dfor various numbers of checkpointsⁿ. Other parameters:^tS

=10,^tC

=2,^tR

=1,⁼^0:1,^pcov

=1:

Figure 6.4 shows the number of checkpoints ⁿ that maximizes responsiveness for increasing deadlines and three different service times. Note the characteristic shape of these curves: As the deadline increases, the best possibleⁿincreases in a roughly sawtooth-like pattern. An increase occurs when it is possible to execute an additional recovery step (before the deadline would expire). This is possible first for largerⁿ, but if two differentⁿpermit the same number of recovery steps for a given deadline, the smallerⁿis preferable since it incurs smaller overhead and therefore a smaller chance of being hit by a fault—resulting in this characteristic sawtooth shape for small ⁿ. For largeⁿ, this effect is less pronounced since the difference in block length is sufficiently small to smooth out these bumps (whenⁿstarts growing rapidly, the responsiveness is already

0 10 20 30 40 50 60

0 50 100 150 200 250 300

Deadline d

Optimal n

=10 =50 =100

t_S t_S t_S

Figure 6.4: Number of checkpointsⁿmaximizing responsiveness shown over deadline^dfor^t^S⁼¹⁰,

=50,^tS

=100. Other parameters: ^tC

=2,^tR

=1,⁼^0:01,^pcov

=1.

very close to 1). A simple rule of thumb is therefore to take the smallestⁿthat allows the maximum number of recovery steps to be taken before the deadline would expire.

The impact of the coverage probability is shown in Figure 6.5. Here,^pcov is set to^0:6. A low coverage probability implies a large risk of not detecting a fault during the acceptance test and terminating with an invalid result. Therefore, there are two antagonistic tendencies: A larger number of checkpoints is preferable to avoid losing much work when recovery becomes necessary, a smaller number is better to reduce the chances of not detecting an error that is actually present (owing to the imperfect fault detection).

This tradeoff is further complicated by the fact that an acceptance test also influences the checkpointing time. Namely, to improve an acceptance test’s coverage probability, it has to execute more tests and becomes more complicated and longer. Hence there is yet another tradeoff for the length of the checkpointing interval.

Figure 6.6 shows the optimal number of checkpoints for a few different combinations of checkpointing time and coverage probability (an acceptance test can often take up the majority of the time of writing a checkpoint).

The examples so far have used carefully selected, but perhaps slightly unrealistic parameter values to high-light some salient points of the checkpointing analysis. Varying the mean lifetime and the service execution time ^t^S, the following Tables 6.1 and 6.2 on page 86 show the optimalⁿand the corresponding responsive-ness for a somewhat more realistic example: Consider a large, long-running program that has a considerable amount of state, so that checkpointing takes^tC

=30s and a recovery step takes^tR

= 10s. The acceptance test is the imaginary program is non-trivial, so that ^p^cov ⁼ ^0:999. To be able to recover from faults, the deadline has to allow some slack, and consequently the deadline in this example is assumed to be fifty percent longer than the service execution time: ^d ⁼ ^1:5tS. Most importantly, service execution times of one hour, ten hours, and hundred hours are considered, combined with mean lifetimes of ten, hundred, thousand, and ten thousand hours. The results are as expected: with decreasing fault rate (increasing mean lifetime), the responsiveness becomes better and the optimal choice forⁿdecreases.

The behavior with probabilistically described services as discussed in Section 6.4.2 is in principle similar, yet more complicated in detail. As an example, checkpointing with the same parameters as in Figure 6.4 was considered, however, the service execution time is here one of^10;^11;^:^:^:^;¹⁹time units, each occurring with

10%probability. Figure 6.7 shows the responsiveness for this service with three different deadlines (^d⁼⁴⁰,

d =50,^d⁼⁶⁰) when the checkpointing interval^t^Nis varied.

The optimal checkpointing interval when the deadline is varied is shown in Figure 6.8 (for the same

6.5. EVALUATIONS OF THEORETICAL MODEL

0.5 0.6 0.7 0.8

50 100 150 200 250

Deadline d

Probability

n=1 n=2 n=4 n=8

Figure 6.5: Completion time distribution^Pr(Xⁿ^d)shown over deadline^dfor various numbers of checkpointsⁿwith coverage probability^pcov

=0:6. Other parameters: ^tS

=50,^tC

=2,

=1,⁼^0:01.

0 1 2 3 4 5 6 7 8 9

50 100 150 200 250

Deadline d

Optimal n

(p=0.9, =2)t_C (p=0.95, =4)t_C (p=0.99, =8)t_C

Figure 6.6: Number of checkpointsⁿmaximizing responsiveness shown over deadline^dfor different

cov,^tC

)combinations. Other parameters: ^tS

=50,^tR

=1,⁼^0:01.

mean lifetime¹⁼in s

36000 360,000 3,600,000 36,000,000

Sin s (10 hr) (100 hr) (1,000 hr) (10,000 hr)

3600 (1 hr) 11 8 5 5

36000 (10 hr) 25 11 7 5

360000 (100 hr) 248 78 25 9

Table 6.1: Optimal number of checkpoints for varying mean lifetime¹⁼and service time^tS; other parameters: ^t^C⁼³⁰s,^t^R ⁼¹⁰s,^d⁼^1:5t^S,^p^cov⁼^0:999.

mean lifetime¹⁼in s

36000 360,000 3,600,000 36,000,000

Sin s (10 hr) (100 hr) (1,000 hr) (10,000 hr) 3600 (1 hr) ^{0 :999890} ^{0 :999989} ^{0 :999999} ^>^{0 :999999} 36000 (10 hr) ^{0 :998958} ^{0 :999899} ^{0 :999990} ^{0 :999999} 360000 (100 hr) ^{0 :989632} ^{0 :998987} ^{0 :999900} ^{0 :999990}

Table 6.2: Responsiveness (corresponding to optimal number of checkpoints shown in Table 6.1) for varying mean lifetime¹⁼and service time^t^S, other parameters: ^t^C ⁼³⁰s,^t^R ⁼ ¹⁰s,

d=1:5t

S,^p^cov ⁼^0:999.

0.95 0.96 0.97 0.98 0.99 1

2 4 6 8 10 12 14

Probability

Checkpointing Interval d=40 d=50 d=60

Figure 6.7: Responsiveness shown over checkpointing interval ^tN for three different deadlines ^d. Other parameters: ^t^S is one of^10;^11;^:^:^: ^;¹⁹ with equal probability, ^t^C ⁼ ², ^t^R ⁼ ¹,

=0:01,^pcov

=1.

Im Dokument Responsive Execution of Parallel Programs in Distributed Computing Environments (Seite 103-107)