Remembering the Last Victim - Victim Selection

3.4 Victim Selection

3.4.2 Remembering the Last Victim

Figure 3.6 shows the inefficiency of random victim selection on the example of the matrix multiplication benchmark in which a single worker creates all tasks. Instead of choosing victims independently at random, we avoid choosing a victim more than once, as described above. This can be done in two ways:

The first is to store the set of previously selected victims in the steal request and update it in case of failure. If there are many potential victims, it may be more practical to focus on the m, m < n, most recently selected victims, rather than trying to fit n worker IDs into a steal request⁶.

The second is to take advantage of shared memory. Every worker i keeps a list of potential victims, which it starts to shuffle to pick the first victim [80]. If the steal

6A more compact representation such as a bitset could help.

3.4.2 Remembering the Last Victim 55

Figure 3.6: Random victim selection may not be the best strategy when a single worker creates all tasks, as in this example of multiplying two 2048×2048 matrices using blocks of size 32×32. The figure on the right shows the numbers of failed attempts before a steal request succeeded (medians of averages from ten program runs). (GCC 4.9.1, -O3, AMD Opteron multiprocessor)

request fails, the first victim, which now assumes the role of thief, continues the shuffling of workeri’s list to pick the second victim, and so on, until a single victim remains. A pointer to this (partially shuffled) list of victims can be passed along with each steal request, sharing state by communicating [97].

Figure 3.6 (b) confirms that, on average, a steal request succeeds after ⁿ⁻¹₂ failures.

It seems pointless to ask the same workers over and over again if only one of them can possibly send tasks. We can devise a simple strategy without necessarily knowing which worker we are looking for: remember the victim of the last successful steal, and target this victim first when running out of tasks next time. If this last-victim check fails to have the desired effect because the victim in question has run out of tasks in the meantime, victim selection proceeds in the same way as previously. The result is up to 35% better performance compared to completely random selection, as shown in Figure 3.6 (a).

Is it always preferable to send steal requests to the last victim if we know that other workers will decline? The answer is, perhaps surprisingly, negative. The following analysis assumes that out ofnpotential victims (n+1 workers), only one has tasks that workers are trying to steal, and that, under random victim selection, a steal request is expected to succeed after ⁿ⁻¹₂ failures.

Let t_sendR be the time it takes to send a steal request and t_sendT be the time it takes to send a task. In addition, let trecvR be the time it takes to receive a steal request. We further define t_lat, t_sel, and t_steal to be the message handling latency,

the time it takes to select a victim, and the time it takes to steal (dequeue) a task, respectively. The time for a failed steal that results in a forwarded request can be broken down as t_{f ail} = t_lat +t_recR+t_sel+t_sendR, while a successful steal amounts to t_succ = t_lat +t_recR +t_steal+t_sendT. Randomly selecting victims, the time it takes to receive a task becomes

t^R_ws =t_sel+t_sendR+ n−1

2 ·t_{f ail}+t_succ

= n+ 1

2 ·(t_sendR+t_lat+t_recR+t_sel) +t_steal+t_sendT. (3.1) By timing the different operations, we observed that t_sel ≈t_steal, whereas communica-tion is an order of magnitude more expensive. (Recall that neithert_selnort_stealinvolves synchronization.) To simplify, we drop t_sel and t_steal and say that

t^R_ws ≈ n+ 1

2 ·(tsendR+tlat+trecR) +tsendT. (3.2) In other words, work stealing is dominated by the cost of communication, including message handling latencies.

Now consider that thieves may prevent further failure (assuming enough tasks are available) by sending steal requests to the victim from which they received their last tasks. When n steal requests are lined up (worst case), and all orderings are equally likely, the expected time it takes to receive a task is

t^LV_ws =t_sel+c·t_sendR+t_lat+ n+ 1

2 ·(t_recR+t_steal+t_sendT), (3.3) with c≥1 accounting for the possibility that sending steal requests to a single victim may increase contention among thieves, andt_latbeing amortized over ⁿ⁺¹₂ steal requests, which can be handled in succession. We omit t_sel and t_steal just like we did above and conclude that

t^LV_ws ≈c·t_sendR+t_lat+n+ 1

2 ·(t_recR+t_sendT). (3.4) Last-victim selection may not be able to reduce the number of messages, but it increases the efficiency with which steal requests are handled. Random victim selection is sensitive to variations in t_lat. Workers that are busy running tasks cannot respond to steal requests, causing latency to increase, which in turn increases linearly with the number of workers. On the other hand, workers that are idle have nothing to do besides handling messages, in which case latency becomes negligible. If tasks are so short that

3.4.2 Remembering the Last Victim 57

1 1.2 1.4 1.6 1.8 2 2.2

8 16 24 32 40 48

Contention factor c

Number of thieves n

Figure 3.7: When steal requests cause contention among thieves, the overhead of last-victim selection may exceed that of random last-victim selection if the time required to send a steal request increases by more than a factor ofc. The graph shows how cincreases with the number of concurrent thieves, assuming that steal requests are 5% more expensive to send than tasks. (Steal requests use MPSC channels, whereas tasks use faster SPSC channels.)

t_lat is of no significance, random victim selection incurs a communication overhead of

n+1

2 · (t_sendR+t_recR) +t_sendT, compared to c·t_sendR+ ⁿ⁺¹₂ ·(t_recR +t_sendT) for last-victim selection. Assuming ideal channels, that is, assumingc= 1 and t_sendR =t_sendT, both strategies have the same communication cost. But more realistically, we expect c >1 and possiblytsendR≥tsendT because steal requests are sent over MPSC channels, whereas tasks are sent over SPSC channels. Therefore, c·t_sendR+ ⁿ⁺¹₂ ·t_sendT > ⁿ⁺¹₂ · t_sendR +t_sendT leads to last-victim selection having higher communication cost than random victim selection. Under the assumption of observable contention (c > 1), t_sendT ≥ t_sendR or t_sendR > t_sendT and c > ^(n+1)/2·(t^sendR_t ^−t^sendT^)+t^sendT

sendR suffice for

last-victim selection to be outperformed by random last-victim selection.

As an example for the latter case, suppose that sending a steal request takes 5%

longer than sending a task. Figure 3.7 plots the values of c above which last-victim selection would incur more overhead in terms of channel operations. Judging from these numbers, it is entirely possible that, for sufficiently short tasks, random victim selection provides better load balancing, despite the number of failed attempts caused by sending steal requests to random workers.

The results of testing our hypothesis on the SPC benchmark are shown in Figure 3.8.

Up to a task length of roughly 25 microseconds, random victim selection is preferable to last-victim selection because it achieves a better distribution of work, as measured by the number of tasks assigned to each consumer. For longer tasks and thus longer message handling latencies, the opposite is true, with last-victim selection providing better load balancing and performance. Interestingly, last-victim selection is fast for

Figure 3.8: In a single-producer and multiple-consumers setting, as in this example of running SPC with n = 10⁶ and t between 0 and 100 microseconds, last-victim selection leads to a poor distribution of work when scheduling fine-grained tasks of up to roughly 25 microseconds. Above that task length, however, it achieves a better distribution of work than random victim selection. The bottom figures show the numbers of tasks executed per consumer (medians of 460 data points, along with 10th and 90th percentiles). A horizontal line labeled “Ideal” indicates a perfectly even distribution of work. (GCC 4.9.1, -O3, AMD Opteron multiprocessor, 48 worker threads)

3.4.3 Limitations 59

Im Dokument Embracing Explicit Communication in Work-Stealing Runtime Systems (Seite 74-79)