Ideas for Future Work - Embracing Explicit Communication in Work-Stealing Runtime Systems

Section 3.4 mentioned some alternatives to random victim selection. In sampling victim selection [88] or the closely related group-based victim selection [45], a worker samples n potential victims to determine and pick the one with the most tasks. Thanks to asynchronous steal requests, sampling victim selection can be implemented by sending and then forwarding a steal request n −1 times and recording which worker has the most tasks. After the sampling is done, the steal request is forwarded once again, this time to the designated victim, which handles it like a normal steal request. Because random victim selection can be thought of as sampling a single victim, it should be possible to devise an adaptive strategy that variesn depending on the number of failed attempts that a steal request has caused, with the goal of improving the effectiveness of steal-half and, by extension, adaptive stealing. We are not aware of previous work that has investigated the combination of sampling victims and steal-half.

When parallelism is limited, workers can reduce contention by backing off from stealing. Saraswat et al. have proposed an interesting strategy in which workers back off by sending steal requests to other workers, establishing “lifelines” that determine how new work will be distributed [218]. The basic idea of remembering steal requests is easy to implement in our runtime system. The more interesting question is whether the benefits of work stealing and work sharing can be combined without having to precompute suitable lifelines for each worker.

149 Channel-based work stealing can be used on any system that is capable of support-ing channels through shared memory, message passsupport-ing, or a combination of both. This includes future multi- and manycore processors as well as manycore clusters. To target the latter, we may run an instance of channel-based work stealing on each node and have managers relay messages between nodes. This makes it easy to specialize channels for intra-node and inter-node communication to reduce latency where possible.

Since managers are responsible for termination detection, they can pass those steal requests that could not be handled within their nodes on to other managers, initiating global load balancing when local load balancing has failed. If we assume that only managers are able to communicate with other nodes, managers act as proxies for inter-node steals. Neither thief nor victim need to take special action; steal requests are flexible enough to be “hijacked”, meaning a worker can change the channel reference contained in a steal request and intercept tasks. By doing so, managers are able to forward tasks from their nodes to other nodes and from other nodes to workers within their nodes.

Hierarchical work stealing can help exploit locality in the presence of increasingly complex memory hierarchies, including those of manycore clusters [170, 264]. Workers running on cores in close proximity can be grouped together into places [102], with managers being in charge of local termination detection and inter-place communication.

If workers can communicate directly with other places, for instance, within a single node, managers need not participate in inter-place steals.

Work stealing may suffer from long message latencies. If parallelism is not the limiting factor, workers can try to prefetch tasks by sending steal requests further ahead of time. While prefetching did not improve performance in our tests, its potential for hiding latency is worth exploring on more systems. It has been shown in the past that prefetching benefits load balancing in high-latency networks [248].

Another way to prefetch tasks is to continue stealing even after succeeding. Workers can forward steal requests until the desired number of tasks has been prefetched. Since there is still only one steal request per worker, it is easy to ensure that tasks are never sent concurrently so that workers can keep using SPSC channels. This would not be possible if workers were allowed to send multiple steal requests. An alternative would be to allocate a second SPSC channel to support two concurrent steal requests per worker. For example, a worker could initiate a local and a remote steal request, the latter for the purpose of prefetching, similar to the wide-area work-stealing strategy of van Nieuwpoort et al [248].

We mentioned that steal requests may be hijacked in order to intercept tasks.

Sup-pose worker i is idle. While waiting for tasks, it forwards steal requests from other workers. Some steal requests, especially those meant for prefetching, can be considered less urgent than others. Worker i could hijack such a steal request, hoping to reduce its own time spent waiting. This would allow starving workers to get back to work faster, without changing the upper bound for the number of steal requests, at the cost of requiring MPSC channels for sending tasks between workers.

This much is certain: channel-based work stealing opens up many interesting av-enues for future work, which we look forward to exploring.

A | CPU Architectures

The following tables summarize the CPU architectures on which we have run our tests.

Most of the information is taken from the ^lscpu command and from /proc/cpuinfo. Minimum and maximum clock speeds are determined by reading

/sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_{min,max}_freq.

The Intel Core i7 is included for completeness; it is used only in Figure 2.4. The processor topologies in Figures A.1 and A.2 are gathered from^lstopo with

lstopo --no-legend --no-io.

For lack of space, we omit similar topology information for the 240-thread Intel Xeon Phi and point the interested reader to the Portable Hardware Locality (hwloc) project’s web page athttps://www.open-mpi.org/projects/hwloc, which contains a number of examples, including the graphical output of running^lstopoon a Xeon Phi coprocessor.

151

Im Dokument Embracing Explicit Communication in Work-Stealing Runtime Systems (Seite 168-172)