A simple special case - Responsive Execution of Parallel Programs in Distributed Computing Envi

5.4 Analysis

5.4.1 A simple special case

Consider the case of ⁿ ⁼ ³routines, executing on ^m ⁼ ² worker machines. We will attempt to compute

Pr(Z t), where^Z is a random variable denoting the time of successful completion of a parallel step under eager scheduling.

The eager scheduling algorithm starts by placing Routine 1, which has runtime ^x¹ with probability

I1 (x

), on Machine 1. This routine will finish, assuming Machine 1 does not fail first, at some unknown time^x1

= x

1. The scheduler will also assign Routine 2 (having runtime^x2 with probability^fI2 (x

2 )) to Machine 2, where it will finish at time^x²^=c²(again assuming that no fault occurs on Machine 2).

Now two cases must be distinguished: ^x1

2 (an overview over the following case distinction is shown in Figure 5.1).²

1. ^x¹^<^x²^=c² 2. ^x¹^>^x²^=c²

5.4.ANALYSIS

Figure5.1:Overviewoverpossiblecasesforeagerschedulingofthreeroutinesontwomachines(P1,P2).Arrowsindicateschedulingsteps,grayedboxeseagerlyscheduledroutines,andcrossedoutcasesdonotappearforc

2>c

After the first routine has finished, the scheduler will assign Routine 3 to the first machine asking for work.

Routine 3 has runtime^x3with probability^fI3 (x

). And again, the machine that has been assigned Routine 3 will become idle either before or after the other machine. Hence there are four cases so far:

1. ^x1

Note how adding one routine generates an additional inequality that bounds the runtime for the new routine by a linear combination of the runtimes of the previously scheduled routines. The bound is either from above or below, and the corresponding lower or upper bound is either 0, or^maxfc¹^;^c²^gt⁼^c²^t.

Now the “eager” part of eager scheduling comes into play. The first idle machine will be assigned a non-completed routine, which can be any of the three, but which is uniquely determined by the relative lengths of the three routines—as long as there are no faults. This extends the previous cases as follows:

1. ^x1

2, Routine 2 is eagerly scheduled on Machine 1, 2. ^x1

2, Routine 3 is eagerly scheduled on Machine 2, 3. ^x1

2, Routine 1 is eagerly scheduled on Machine 2, 4. ^x¹^>^x²^=c²and^x¹^<^x²^=c²⁺^x³^=c², Routine 3 is eagerly scheduled on Machine 1.

It depends on the actual length of the routines and on ^c2 whether or not the eagerly scheduled routine terminates before or after its first instance; both cases are possible. Hence we now have eight cases:

1. ^x1

2, Routine 2 is eagerly scheduled on Machine 1, and^x1 +x

2, Routine 1 is eagerly scheduled on Machine 2, and ^x1

2, Routine 3 is eagerly scheduled on Machine 1, and^x1 +x

2(also impossible for^c2

2, Routine 3 is eagerly scheduled on Machine 1, and^x1 +x

2has probability 0 for any distributions with continuous densities, which we have assumed above. In a discrete density case, ties can be broken arbitrarily.

5.4. ANALYSIS

In case a fault occurs, the other machine has to complete all unfinished routines. So conceptually, the above schedules can be extended by appending all routines that have not been scheduled on a given machine to this machine. The schedule finishes when all routines have been completed. Since a processor can (potentially) be assigned all three routines, it can fail during the execution of the first, second, or third routine or only fail after all three routines have been completed—this number of routines that a given machine^jsurvives is indicated by

j. Of course, at least one processor must survive until the schedule is completed. Hence, for all the eight cases shown above, there are a number of subcases that enumerate the possible fault combinations and determine their respective termination time. Also note that after the first eager scheduling step has occurred, the relative execution times of tasks on the two machines is of no consequence since these redundant assignments are only executed if one machine has failed.

As an example, consider the last case from the eight cases shown above. Conceptually, Routine 2 is additionally scheduled on Machine 1, and Routine 1 is added to Machine 2, to compensate for a potential failing of the other machine already during its very first routine. Table 5.2 gives an overview of the termination times for this case with all pertaining combinations of faults. For the other cases, the termination times in this table would be different.

Table 5.2: Successful termination times of the various subcases of Case 7. Columns indicate the number^s¹of routines that Machine 1 survives, rows indicate^s².

Let us now begin to compute^Pr(Z ^t). We are going to look at all possible combinations of ^x¹,^x², and^x3. For such a combination, we compute the probability of its occurrence. Given such a combination, we consider the combination of fault scenarios^(s1

)that can occur when the routines are executed. Hence, the law of total probability lets us start with the formulation (^~^x⁼^(x¹^;^x²^;^x³⁾and^~^s⁼^(s¹^;^s²⁾):

where^S is the set of fault combinations,^Pr(~^x;^~^s)is the probability that a certain fault scenario occurs for a given combination of ^xi, and ^h is a function that tests if the execution of the three routines with the given runtimes succeeds before time^tunder a given fault scenario^~^s. Of course,^c2is an implicit parameter of^h.

Since the function^hbasically requires the implementation of an eager scheduling algorithm, we want to break this down into simpler functions. To do so, we take advantage of the case distinctions introduced above.

The important point to note here is that any of the eight cases as defined above determine a subset of routine combinations such that all combinations in this set have the same behavior; namely, their scheduling order on the two given machines is the same. Moreover, as we have noted above, each additional routine introduces one additional inequality that can be used to bound its value, and can be directly used as a limit for the corresponding integral. The other bound is either⁰ or infinity, where we will refine infinity as an upper bound later.

The situation is complicated by the inequality introduced by the first eager scheduling step. This inequality cannot be directly mapped onto an integral. However, we can express this inequality by means of the Heaviside function: ^a^<^b^,^H(b ^a)⁼¹, where^H(x)⁼¹^,^x⁰and⁰otherwise.

This allows us to refine the above expression for^Pr(Z ^t)as follows (for Case 8 as an example):

Pr(Z

and similarly for the other seven subcases. As can be seen from Table 5.2, the set of fault scenarios ^Sis just the set^f(0;^3);^(1;^2);^{(1 ;}^{3) ;}^{( 2;}^{1 );}^(2;^{2) ;}^{(3 ;}^{0 );}^{( 3;}^1);^{(3 ;}^{2 );}^{( 3;}^3)g; all other fault scenarios do not result in a successful completion of the parallel step.

Instead of using¹as upper limit,^c2

tis also sufficient, since no routine longer than this has any chance of being completed before^t, even if it runs alone on the faster of the two workers. The feasibility test function^h only means comparing the termination time for this case and fault scenario to the actual time^t. At this point, it is straightforward to write down the complete expression for the runtime distribution for this example. For a complete, rather lengthy, expression, please refer to [134].

Im Dokument Responsive Execution of Parallel Programs in Distributed Computing Environments (Seite 74-78)