Adapting the Choice at Runtime - Efficient Fork/Join Parallelism

4.4 Efficient Fork/Join Parallelism

5.1.2 Adapting the Choice at Runtime

Choosing a strategy for every instance of a problem is tedious, let alone finding the strategy that works best. Moreover, it is possible for an application to create different task graphs over the course of several parallel phases. A single strategy, be it steal-one or steal-half, may not be able to schedule all available parallelism efficiently, giving rise to suboptimal performance.

If we assume that neither steal-one nor steal-half is strictly better than the other, we might be able to get the best of both worlds by letting the runtime system decide which strategy to use under what circumstances. This implies that workers switch between steal-one and steal-half, depending on which strategy is deemed more effective.

Implementation-wise, the chosen strategy is stored in a binary flag that is added to the steal request structure definition. A victim reads the value of the flag to perform the desired steal on behalf of the thief. The main question that remains is, how should workers decide which strategy to use?

We follow a simple heuristic: Prefer steal-one to steal-half when parallelism is lim-ited or created recursively; otherwise choose steal-half. Put the other way round, prefer steal-half to steal-one when steals are frequent despite abundant potential parallelism². Figure 5.3 illustrates the process of choosing a stealing strategy. A worker is in one of two states corresponding to steal-one and steal-half. Depending on the outcome of the previous N steals, a worker may pursue a different strategy or keep using the current strategy for the next N steals.

Frequent stealing is an indication that a better strategy is needed. How do we

2Cilk and its descendants are built on the assumption that, given sufficient parallelism, steals are rare. It is thus understandable that Cilk-style work stealing lacks support for steal-half.

5.1.3 Performance 107 define “frequent”? Our main concern is fine-grained parallelism—tasks that are small enough that efficient scheduling matters. Coarse-grained parallelism tends to dwarf the time spent scheduling, to the point that work stealing may have little effect on overall performance. Fine-grained parallelism, in contrast, is highly sensitive to the choices made at runtime. In general, the fewer the number of steals, the better the performance, provided that the work remains balanced throughout the computation.

That is to say, performance benefits from a high ratio of executed tasks to steals (not stolen tasks!). The smaller the ratio, however, the stronger the case for a change of strategy to try to reduce the work-stealing overhead.

Returning to Figure 5.3, we see that every transition is determined by the ratio of executed tasks to steals. For a worker to calculate this ratio, the interval is limited to the lastN steals, during which a worker executesM tasks. A transition from steal-one to steal-half as a result ofM/N = 1 usually means that allM tasks had to be stolen, or that other workers were quick to steal every task that was created recursively. Similarly, a transition from steal-half to steal-one as a result ofM/N < 2 means that steal-half has failed to reduce the work-stealing overhead by averaging less than two tasks per steal. A worker takes M/N ≥ 2 as a sign that enough parallelism exists to be able to benefit from steal-half. It may well be the case that some workers continue to use steal-half, while others switch to steal-one, or vice versa. This helps avoid situations like the motivating example above, where there is not enough parallelism for every worker to steal more tasks than needed.

The value of N determines the number of steals and, as such, the length of the interval after which a worker reevaluates its stealing strategy. The smaller the value of N, the more transitions may occur. If steals are frequent, intervals tend to be short, allowing workers to adapt their strategies continually. If steals are rare, intervals tend to be longer, as workers are busy running tasks until the next load imbalance arises.

5.1.3 Performance

Figure 5.4 shows how adaptive work stealing compares to the best-performing strategies from Figure 5.2. Before we discuss the results, it is time to fill in the last blank in our algorithm, namely the choice of initial strategy. Referring back to Figure 5.3, workers have the choice of starting with steal-one or steal-half. In fact, different workers might start with different strategies, should some workloads warrant it. In our tests, at least, it does not matter whether workers start with the one or the other; performance is identical. For presentation, we pick the results of starting with steal-one and omit the other results to avoid duplication.

600

Figure 5.4: Adaptive work stealing versus the best-performing strategy for each of the two workloads from Figure 5.2. The value of N is the number of steals after which a worker reevaluates its stealing strategy (see Figure 5.3). Whether workers start out with steal-one or steal-half does not make a measurable difference, so we omit the latter. (GCC 4.9.1, -O3, AMD Opteron multiprocessor)

Most importantly, adaptive work stealing can match steal-one in Figure 5.4 (a) and steal-half in Figure 5.4 (b). The latter results suggest that any reasonably small value forN serves to approximate steal-half; larger values (>100) slowly start to affect performance (not included in the figure). With 48 threads, there is practically no difference between N = 3 and N = 50. Workers rely on steal-half 93.2% of the time,

±0.4% for N = 3 and±0.7% for N = 50 (standard error of the mean).

Figure 5.4 (a) highlights the importance of choosing a good value for N. Among the plotted values, only N = 25 and N = 50 lead to performance on par with steal-one. Smaller values increase the chance of switching to steal-half, given the workload’s 9:1 ratio of consumer to producer tasks. For N = 3, workers fall back to steal-half 37.1%±0.02% of the time (48 threads). As we increase N, it becomes clear that steal-one is winning over steal-half. For N = 5 and N = 10, the percentage of steal-half drops to 28.2%±0.02% and further to 14.6%±0.03%. A value of 25 is large enough that workers dismiss steal-half and opt for steal-one 98.4%±0.02% of the time.

To summarize Figure 5.4, the number of steals between intervals should be neither too large nor too small. Workers need to base their decisions on more than a few steals, but at the same time be quick to adapt when a strategy turns out to be inefficient.

Figure 5.5 shows four more benchmarks where the right strategy makes some, al-beit small, difference in performance. Matrix multiplication and UTS benefit from steal-half; LU decomposition and Cilksort benefit from steal-one. The results make

5.1.3 Performance 109

(a) Multiplication of two 2048×2048 matrices partitioned into blocks of 32×32 elements

400

(b) LU factorization of a 4096×4096 matrix partitioned into blocks of 64×64 elements

800

(c) Cilksort of 100 million integers

500

Figure 5.5: Adaptive work stealing combines steal-one and steal-half to select the better-performing strategy at runtime. Matrix multiplication and LU factorization were configured to use “last-victim-first” instead of random victim selection. (GCC 4.9.1,-O3, AMD Opteron multiprocessor)

Ratio of tasks to steal requests

Number of workers

Steal-one Steal-half Steal-adaptive

(b) SPC withn= 10⁶and t= 1µs

Figure 5.6: Creating a large number of very fine-grained tasks poses a problem to either stealing strategy. (GCC 4.9.1, -O3, AMD Opteron multiprocessor)

us confident: steal-adaptive manages to combine the best of both strategies, relatively independent of the value ofN, save for UTS, where a largerN means fewer chances for steal-half as workers err on the side of steal-one. WithN = 25, workers are led to choose steal-half 20.8%±0.98% of the time, in stark contrast to less than one percent with N = 50 (48 threads). The problem is not so much that steal-one is inefficient (UTS includes recursive task creation), but rather that steal-half is sometimes preferable, hence the need for shorter intervals to take advantage of steal-half, if only temporarily.

We will use N = 25 for all remaining experiments.

Im Dokument Embracing Explicit Communication in Work-Stealing Runtime Systems (Seite 126-130)