• Keine Ergebnisse gefunden

Importance of Steal-Half for Fine-grained Parallelism

Different victim selection strategies may affect the overhead associated with stealing a task, but may not prevent frequent stealing in unbalanced computations. The cycle of stealing a task, executing it, and running out of work again becomes increasingly

0

Figure 3.9: Being able to reduce the work-stealing overhead is essential for scheduling fine-grained parallelism. Informed by the results of Section 3.4, we combined steal-one with random victim selection in (a) and last-victim selection in (b). (GCC 4.9.1, -O3, AMD Opteron multiprocessor)

inefficient with decreasing task lengths.

Supposem workers are trying to balance n tasks. Let t be the time it takes to run a task (task length) and tws be the time it takes to steal a task. (Recall Equations (3.2) and (3.4).) If ttws, that is, ift+twst, the speedup of the computation will approach the number of workers m. Using a concrete example, ten workers are able to execute 100 tasks, each taking one second, in ten seconds. If ttws such thatt+tws ≈ 2t, the speedup will not exceed m/2. Generally speaking, the speedup is bounded by

n·t

(n·(t+tws))/m, which simplifies to t+tt·mws. We expect no speedup iftis approximately equal to m−1tws . If t falls below m−1tws , the parallel computation will end up being slower than the sequential one; the speedup will turn into a slowdown. Theoretically, if t were so small that t+twstws, the speedup would tend towards zero.

3.5.1 Stealing Single Tasks

Figure 3.9 uses the SPC benchmark as an example to show the practical limitations of stealing single tasks, a strategy we call steal-one. Judging from the speedup curves, it takes more than 10µs to receive a task in Figure 3.9 (a) and less than 100µs in Figure 3.9 (b). We can estimate tws using the speedup formula above, but must be aware of the underlying assumption thatntasks are distributed evenly amongmworkers, which may not hold true in practice. Solving S = t+tt·mws for tws gives tws = t·mSt. With 48 threads, one of them managing termination detection, we measure speedups of 4.5 and 32.5 for t = 10µs and t = 100µs, respectively. Substituting the values of S, t, and m,

3.5.2 Stealing Multiple Tasks 61 we get tws = tRws = 94.4µs for t = 10µs and tws = tLVws = 44.6µs for t = 100µs. (We use different victim selection strategies, see Figure 3.9.) Though difficult to determine precisely, we measure a median latency of 121µs for t = 10µs and 44µs for t = 100µs.

(The measured values show a high variability; the 50% central ranges are 142µs for t = 10µs and 49µs for t = 100µs.) One value is very close to the estimate, the other value is off by 28%. In fact, t = 10µs yields an uneven distribution of work, unlike t= 100µs, as hinted at in Figure 3.8. Worker 0, which creates all tasks, also runs many more tasks than other workers. The actual overhead associated with work stealing is thus higher than the estimate, which does not account for load imbalance.

We should be convinced by now that steal-one is inadequate for scheduling fine-grained parallelism, unless a few steals suffice to maintain load balance. Reducing the cost of work stealing also means reducing the frequency with which workers need to steal. The idea is simple: workers may try to steal multiple tasks at a time to amortize the cost of stealing over n, n > 1, tasks that can be executed in sequence. Stealing multiple tasks helps spread the work, especially when dealing with large numbers of tasks. With more workers having tasks to spare, subsequent steals are more likely to succeed, even if victims are selected randomly.

3.5.2 Stealing Multiple Tasks

There are static and dynamic approaches to stealing multiple tasks. A possible strategy is to steal a fixed amount of work, sayn tasks, and fall back to steal-one when a victim has less thanntasks left. Such a strategy, though easy to implement, becomes difficult to use when the workload is unknown or changes at runtime. Finding a value ofnthat is neither too small nor too large requires experimentation.

Dynamic strategies solve this problem by choosingn based on the number of tasks m in a victim’s deque such thatn =f(m), where f is a monotonically non-decreasing function [44]. The more tasks are there for thieves to steal, the larger the value ofnwill tend to be. In this way, thieves are able to adapt to the workload. Among the different possibilities, stealing half of a victim’s tasks, that is, f(m) = bm2c, has emerged as a robust strategy [111, 44]. The idea behind steal-half is to transfer half of a victim’s remaining work with a single steal, assuming a correlation between the number of tasks and the amount of work to do.

The results of using steal-half are included in Figure 3.9. Efficiency is up from 10%

in (a) and 69% in (b) to 88% in (a) and 96% in (b). To understand where the large difference in performance comes from, Figure 3.10 shows execution profiles. Execution time is broken down into the time spent running tasks, enqueuing/dequeuing tasks,

Execution time

ID of worker thread

Run task Send / Recv req Send / Recv task Enq / Deq task Idle

0%

20%

40%

60%

80%

100%

0 2 4 8 12 16 20

(a)steal-one

Execution time

ID of worker thread

Run task Send / Recv req Send / Recv task Enq / Deq task Idle

0%

20%

40%

60%

80%

100%

0 2 4 8 12 16 20

(b) steal-half

Figure 3.10: Execution time profile of SPC withn= 106 and t= 10µs under (a) steal-one and (b) steal-half work stealing. Worker 1 was dedicated to termination detection and is excluded from the list of worker IDs. The different sections of code were timed by inserting RDTSCinstructions to read the processors’ time-stamp counters (see [125], Chapter 17). This slowed down execution by 10% and 2% compared to the uninstrumented versions of steal-one and steal-half. (GCC 4.9.1, -O3, AMD Opteron multiprocessor, 24 worker threads)

3.5.3 Implementing Steal-Half with Private Deques 63 sending/receiving tasks, sending/receiving steal requests, and idling. The time spent for work stealing, directly or on behalf of other workers by means of forwarding steal requests, is derived from handling steal requests and tasks. A worker is idle while waiting to receive tasks.

Under steal-one, worker 0 does more work than necessary, executing over twice as many tasks as other workers, which spend less than 20% of their time on useful work. By contrast, under steal-half, workers get to run a lot more tasks as the cost of scheduling and work stealing shrinks to 5–8%. Apart from worker 0, which is busy creating and distributing tasks, workers spend more than 90% of their time running tasks. This vastly better utilization of workers explains the significant speedup of steal-half over steal-one.

3.5.3 Implementing Steal-Half with Private Deques

Private deques have the benefit that stealing multiple tasks is a relatively straightfor-ward extension. The same cannot be said for concurrent deques, which must resort to coarse-grained locking or implement intricate synchronization protocols [111].

In our implementation, every worker has a doubly-linked list that serves as a private deque, withpush and pop operating on one end, andsteal operating on the other end.

Steal-half requires a worker to traverse the list, find the middle element, split the list in half, and send the second half to the thief using the channel reference in the steal request. Sending tasks sequentially, one by one, incurs overhead proportional to the length of the list. In addition, it is no longer possible to guarantee that channel_send

never blocks as victims may try to send more tasks than a channel can buffer.

Shared memory permits an efficient solution with constant overhead: a single pointer-sized message can move a list of tasks between workers without copying. Steal-ing proceeds as follows: The victim splits its deque in half, intendSteal-ing to give away one half to the thief. Splitting a deque with head hv creates a new deque with head ht whose ownership is transferred by sending ht (a pointer) to the thief. This requires that channel_send and channel_receive copy pointers rather than tasks by using the technique outlined in Section 3.1.3. Upon completing the steal, the victim’s lo-cal pointer goes out of scope, permitting only the thief to access tasks through ht. The thief receives ht, prepends the list of stolen tasks to its own deque, and continues working. While copying tasks is inevitable in the absence of shared memory, similar op-timizations may be used to limit the number of messages in a distributed environment, provided that tasks are stored in a way to facilitate one-sided access [185].

A doubly-linked list is not the most efficient data structure when it comes to

split-ting [29], but in practice, workers rarely pile up huge numbers of tasks while other workers remain idle. Tasks need not be stored in lists; they can also be stored in trees, which permit asymptotically efficient implementations of steal-half at the expense of adding overhead to task insertion and removal [122]. Using forests of binomial trees, for example, would enable thieves to steal the largest trees in the forests. Because task queues are private, no synchronization is needed, no matter how complex the underlying data structures are.

An alternative to arranging tasks in trees is building up nested lists. Cong et al.

insert tasks into lists before making them available to thieves [67]. This task creation strategy saves deque operations and supports efficient stealing, but does not expose parallelism while a batch of tasks is being created. A batch is unsplittable and must be executed sequentially. The challenge is to schedule batches that are neither too small nor too large for the problem at hand. Cong et al. use a bounded exponential growth function to generate small batches when tasks are needed for load balancing and large batches when enough tasks are available to keep workers busy.