Managing Idle Workers - Importance of Polling for Coarse-grained Parallelism

3.6 Importance of Polling for Coarse-grained Parallelism

4.1.1 Managing Idle Workers

In this section, we introduce a termination detection barrier that does not depend on shared state nor burdens workers with separate control messages, which are required by distributed memory algorithms. The former is the result of using channel commu-nication; the latter is achieved by collecting information about steal requests. There is no need to inject more messages into the system when steal requests are already passed between workers. Additional data, if needed, can be piggybacked with steal requests.

In addition, forwarding makes sure that workers enter termination detection only when it is likely that no work remains [185].

We take the following approach: A worker is assigned the task of determining whether termination has occurred by keeping track of steal requests and counting the number of idle workers. We call this worker manager. But simply counting every steal request towards the number of idle workers may lead to early termination detection.

There are two complications that must be dealt with: First, workers may send specu-lative steal requests, expecting to run out of work but not knowing for sure. Receiving a steal request does not necessarily mean that the thief is idle. It may be that the thief is trying to prefetch some work. Second, work stealing happens between workers,

4.1.1 Managing Idle Workers 75 without the manager’s knowledge. If the manager were involved in every steal, it could quickly become a bottleneck for scalability, especially with frequent load balancing.

The first problem to address is that of counting only idle workers. To distinguish idle from busy workers, we extend steal requests with a field, status, that indicates whether a worker is working, idle, or registeredIdle by the manager. Initially, a new steal request identifies a worker as working, unless, of course, the worker is already idle when it starts to steal. A worker is idle if it has nothing to do besides handling messages and waiting for tasks.

Generally speaking, there are two possible outcomes of a steal request: it either succeeds or fails. If it fails repeatedly, sayn times, it is returned to allow the thief to change the steal request to idle (if true). This is a point worth emphasizing: if the steal request were not returned, the transition from working toidle could not happen;

only the thief itself is in a position to confirm that it has no tasks left.

Figure 4.1 shows the changes to the handling of steal requests. In addition tostatus, a steal request is extended with another field, failed, that counts how many times a steal request has been forwarded. This count is used to decide when a steal request is returned to the thief (lines 29–31). A returned steal request is either discarded if the worker has tasks (lines 1–5), forwarded again if the worker is still busy, but likely running out of work soon (lines 21–23), or sent to the manager if the worker is idle.

(lines 18–20).

The manager handles steal requests just like other workers, except for the addi-tional requirement to examinestatus in order to keep track of idle workers. When the manager runs out of tasks and receives a steal request that points to an idle worker, it changes the steal request to registeredIdle, updates the set of idle workers, and checks for termination, that is, if every worker is registered as idle (lines 9–12). Afterwards, it selects a new victim for the steal request. Note that, according to Figure 4.1, the manager will eventually send a message to itself, passing up the opportunity to directly register itself as idle. An implementation ofHandleStealRequest should consider this optimization. It is also possible to use a dedicated manager that forwards every steal request. Because a dedicated manager has no need for a deque, nor for sending own steal requests, it makes sense to write two versions ofHandleStealRequest so that manager and workers avoid unnecessary runtime checks.

The second problem is related to work stealing: idle workers, including those iden-tified as registeredIdle, may receive tasks and start working again. Clearly, we do not want to put the manager in a position to acknowledge every single steal, but we still have to eliminate any possibility of detecting termination on the basis of outdated

in-HandleStealRequest()// Second version

Let Q_i be the private deque of tasks of worker i,

C_i be the channel for sending steal requests to worker i, Cm be the channel for sending steal requests to manager m, S be the steal request to handle,

n be the number of steals to attempt 1 if Qi is not empty

2 if i== S.thief

3 // Own steal request is no longer needed

4 Discard S

5 return

6 Pop taskt from the top ofQ_i 7 Send taskt to channelS.chan 8 else

9 if i== m∧S.status == idle 10 S.status = registeredIdle

11 Add S.thief to the set of idle workers 12 Check for termination

13 if S.failed == n

14 // Steal request must have been returned

15 assert i==S.thief

16 // Start new round of stealing (alternatively, back off if S.status = = registeredIdle)

17 S.failed = 0

18 if worker iis idle

19 S.status = idle

20 Send S to channelC_m

21 else

22 Select a workerj,j 6=i, at random

23 Send S to channelC_j

24 else

25 S.failed = S.failed + 1 26 if S.failed < n

27 Select a workerj,j 6=i∧j 6=S.thief, at random

28 Send S to channelCj

29 else

30 // Return steal request to S.thief

31 Send S to channelCS.thief

Figure 4.1: Because steal requests can be sent ahead of time, a worker must confirm that it is idle before it can be counted as such by the manager.

Im Dokument Embracing Explicit Communication in Work-Stealing Runtime Systems (Seite 94-97)