From Tasks to Futures - Embracing Explicit Communication in Work-Stealing Runtime Systems

In general, tasks may have arbitrary dependencies that must be respected. As in [166], Section 2.1, page 39, we will not distinguish between control and data dependencies and simply say that task B depends on taskA if A must precede B, either because of A’s side effects, or because A produces data that B consumes. Formally, A precedes B is written A≺B, which tells us that Aand B are ordered and forbidden to execute in parallel.

4.3.1 Channel-based Futures

Consider a simple Fibonacci-like tree recursion (without the base case):

int x = spawn f(n-1); // Create task for f(n-1)

int y = f(n-2); // Proceed recursively with f(n-2)

sync; // Wait until result of f(n-1) is available return x + y;

Here, a task depends on the result of its child task, which in turn depends on the result of its child task, and so on. Cilk and OpenMP provide constructs to suspend a task until its children have finished execution. The same can be achieved with futures:

future fx = FUTURE(f, n-1); // Create a future for f(n-1) int y = f(n-2); // Proceed recursively with f(n-2) int x = AWAIT(fx, int); // Wait for future’s result

return x + y;

The important insight is that futures can be viewed as channels: a future opens a channel over which the result will be delivered. Setting the value of a future is equivalent to sending the value to the channel. Forcing a future is equivalent to receiving the value from the channel. When the value is needed, it is simply received from the channel, blocking the receiver until the value is determined.

Creating a future for ^f(n-1) involves allocating a channel, creating a task, and storing a reference to the channel in the task descriptor. The latter is taken care of by

ASYNC (cf. lines 28–36 in Listing 2.6):

Channel *ch = channel_alloc(sizeof(int), 1, SPSC);

ASYNC(f, n-1, ch);

The channel should be buffered (capacity > 0) to avoid the possibility of a blocking send when a worker uses the channel reference after evaluating ^f(n-1). This means that we cannot use ^ASYNC_DECL to generate the task function for ^f because it would insert code to dereference ^ch (cf. lines 12–20 in Listing 2.6). We need a modified version, FUTURE_DECL, that inserts a call to channel_sendinstead:

// At the end of the task function int tmp = f(n-1);

channel_send(ch, &tmp, sizeof(int));

Before the future’s result can be used, it must be received from the channel. Until the value is available, the task is suspended:

while (!channel_receive(ch, &x, sizeof(int))) suspend();

channel_free(ch);

This is known as data flow synchronization: waiting for data to become available, rather than waiting for a task to finish execution. While a thread is blocked on a future, it can try to schedule other work by calling back into the runtime system:

rts_force_future(ch, &x, sizeof(int));

channel_free(ch);

In this case, the runtime system takes care of receiving a value from channel^ch. Finally, by hiding channels behind a ^future type, we arrive at the macros that we introduced back in Section 2.4.1:

future fx = FUTURE(f, n-1);

4.3.1 Channel-based Futures 93

11 FUTURE_DECL(int, sum, int a; int b, a, b);

Listing 4.2: A minimal task-parallel program with future-based synchronization.

...

int x = AWAIT(fx, int);

Listing 4.2 repeats the toy example from Section 2.4, replacing the task barrier with future-based synchronization. Note the use ofFUTURE_DECL in the declaration of ^sum. With help from the compiler, synchronization could be made implicit by figuring out when a future’s result is needed and forcing it upon first touch.

Listing 4.3, which shows Listing 4.2 after macro expansion, reveals the underlying channel operations. After creation (line 39), the future is stored in the task descriptor (line 43) and later retrieved to send the result (line 27). Forcing a future translates into a call torts_force_futurefollowed by freeing the associated channel. Note the use of wrapper functions that act as getters and setters for channels. A future is a handle to a channel and may contain different data, depending on how channels are passed between workers. On the SCC, for example, we used a pair of integers³ to identify a channel [201], hence the need for converting a “portable reference” to a regular^{Channel *}and vice versa. Shared-memory futures are raw pointers to channels and can be cast as such.

3(ID of channel owner, byte offset into owner’s message passing buffer)

1 int sum(int a, int b) 2 {

3 return a + b;

4 } 5

6 // FUTURE_DECL expands to a data structure to hold the task’s arguments, 7 struct sum_task_data {

8 int a; int b; future f;

9 };

11 // a function to allocate a future/channel, 12 static inline future make_sum_future(void)

19 // and a task function that wraps the call to sum 20 void sum_task_func(struct sum_task_data *d)

36 future f = ({ // FUTURE creates a task, enqueues it, and returns a future 37 Task *__task = task_alloc();

38 struct sum_task_data *__d;

39 future __f = make_sum_future();

40 __task->parent = get_current_task();

41 __task->fn = (void (*)(void *))sum_task_func;

42 __d = (struct sum_task_data *)__task->data;

43 *(__d) = (typeof(*(__d))){ a, b, __f };

44 rts_push(__task);

45 __f;

46 });

48 s = ({ // AWAIT forces the future and returns its result

49 int __tmp;

Listing 4.3: Program 4.2 after preprocessor macro expansion (abbreviated to make it more readable). Both FUTURE and AWAIT macros use statement expressions ({...}), a GNU extension supported by GNU, Clang, and Intel compilers [2].

4.3.2 Futures for Nested Parallelism 95 4.3.2 Futures for Nested Parallelism

What is left to explain is the implementation of rts_force_future. Although fu-tures can express more than nested parallelism [228], we experiment with a specialized implementation as sketched in Listing 4.4. This implementation operates under the assumption that if a task creates a future, the future’s result will be needed later on.

Herlihy et al. have shown that “well-structured futures” incur fewer deviations from sequential execution than general, unstructured futures, resulting in better cache local-ity, as measured by the number of cache misses [115]. Critical to this is using futures in a disciplined way: making sure that every future is touched only once, either by the task that created it or by a descendant of the task that created it. As a result, a well-structured future is always created prior to being touched. Creating a future and passing it around is certainly possible, but not the use case thatrts_force_future is trying to address, namely that of structured local-touch computations [115].

When forcing a future, we first check if the future’s result is already computed, and if so, just return (lines 5–6). If not, we try to resolve the future by running all child tasks of the current task, until the future’s result is available or no child tasks are left (lines 8–12). Finally, we have to assume that the corresponding task has been stolen and switch to work stealing (lines 14–22). We can safely return from rts_force_future

in line 19 because send_steal_request preserves the invariant of one pending steal request per worker and will not generate a new request until the thief has successfully received a task and cleared the channel.

To summarize, there are three possibilities: (1) The future has been evaluated in parallel, and its result can be received. (2) The future has not been started yet, in which case it is evaluated sequentially. (3) The future is being evaluated by another worker, in which case other work is picked up until the result can be received.

Forcing a future may have the side effect of evaluating other futures. For example, imagine a worker creates three futures f₁, f₂, and f₃, in this order, pushing each task onto the bottom of its deque, and then forcesf₂, which we assume has not been stolen.

Because f₃’s task sits on top of f₂’s task in the worker’s deque, rts_force_future

evaluates f₃ before it evaluates f₂, with the result that a subsequent touch of f₃ will immediately return its value. Again, this is only reasonable if every future will be touched, and there is no priority involved. It would be undesirable to evaluatef₃ if its result were not needed, or iff₂ had a higher priority. A more flexible implementation would have to deferf₃’s task when touching f₂.

Unrestricted work stealing in rts_force_futurecannot guarantee that a task will return from the function as soon as the value it is waiting for is available, leading

1 void rts_force_future(Channel *chan, void *data, unsigned int size) 2 {

3 Task *task;

Im Dokument Embracing Explicit Communication in Work-Stealing Runtime Systems (Seite 111-116)