Task Scheduling - Embracing Explicit Communication in Work-Stealing Runtime Systems

Task-parallel programming raises the level of abstraction from managing threads to specifying tasks, which may or may not run in parallel, depending on decisions made at runtime. These decisions are the responsibility of the task scheduler, which deter-mines the mapping from tasks to worker threads. Consequently, much of the efficiency of using tasks depends on the scheduler and its ability to handle workloads, which differ in the number, granularity, and order of tasks, including potential dependencies. The granularity of a task is defined by the amount of computation in relation to commu-nication/synchronization. Tasks that compute very little are fine grained. Dividing a workload into many fine-grained tasks is good for load balancing and scalability, as long as the cost of scheduling does not outweigh the benefit of parallel execution. The finer the granularity, the more attention must be paid to low overheads.

Depending on workload parameters, we roughly distinguish between static and dy-namic workloads. Static workloads are predictable, not subject to change at runtime, and lend themselves to static scheduling. Dynamic workloads depend on input or other program parameters not known until runtime and tend to vary as the computation un-folds, making static scheduling impractical in most cases.

2.5.1 Static Scheduling 23 2.5.1 Static Scheduling

The scheduling algorithm determines how tasks are mapped to threads for execution.

In static scheduling, the assignment of tasks to threads is fixed, either as part of the compilation process or at runtime. As an example, consider the static scheduling (schedule(static)) of loop iterations in OpenMP: a parallel loop of N iterations is compiled such that every thread executes N/T iterations, assuming N is an even mul-tiple ofT, the number of threads. Thread 0 will then execute the firstN/T iterations, thread 1 will execute the nextN/T iterations, and so on. This division incurs (almost) no runtime overhead, but is only efficient as long asN/T iterations roughly equal N/T percent of the total work. If that is not the case and load imbalance arises, static scheduling has no means to counter it.

2.5.2 Dynamic Scheduling

Dynamic scheduling, on the other hand, assigns tasks to threads as needed, thereby maintaining load balance. The price to pay for dynamic scheduling is increased runtime overhead. Returning to the example of scheduling parallel loops in OpenMP, it is a good idea to use dynamic scheduling (schedule(dynamic)) whenever the amount of work per iteration is unknown or varies widely. The dynamic scheduler keeps track of iterations in a shared location, so that idle workers can claim new iterations and come back for more once they have finished their current batch of work. The higher runtime overhead compared to static scheduling is the result of fetching iterations, including contending for iterations with other threads.

The shared location can be generalized to a task pool, a data structure that holds all tasks that are ready to run, not necessarily in any particular order. Tasks are added to the task pool and retrieved for execution [203].

2.5.3 Task Graphs

Tasks and their dependencies form a task graph: a directed acyclic graph (DAG), where every node represents a task, and a directed edge from node A to node B indicates that B depends on results computed by A [34]. Figure 2.1 shows an example task graph that can be scheduled with the help of a task pool. We use similar notation to Blelloch et al. [46]: Each vertex v represents a task of duration t(v). The sum of all t(v) is the work of the graph, or the time it takes to execute all tasks sequentially. A dependency (u, v) between two tasks u and v means that u must complete before v can start execution. Hence, v is not ready until all the tasks it depends on have been

Figure 2.1: Example task graph showing tasks (vertices) with their dependencies (edges).

The graph on the right-hand side indicates that explicit synchronization may be used to schedule tasks in an order that respects their dependencies, though possibly at the cost of losing parallelism, as in this example. The resulting schedule is non-greedy [48], unlike the schedule on the left-hand side.

completed. For simplicity, we assume a perfect task pool with zero overhead for task insertion and removal, so that tasks are scheduled as soon as they enter the task pool.

Figure 2.1 includes two P-processor schedules, P ≥ 3. A schedule is a sequence (V₁, V₂,· · ·, V_T), where V_i is the set of tasks that are running during time stepi. For a complete formal definition, we refer the reader to [46]. Thegreedy schedule on the left-hand side assumes that tasks are created and scheduled as soon as their dependencies are satisfied [48]. Task H, for example, depends on bothE andF and is guaranteed to be scheduled during the time step following the completion of eitherE orF, whichever finishes last. More generally, if E is scheduled during time step i and F is scheduled during time step j, H will be scheduled during time step max(i +t(E), j +t(F)).

This does not hold for the schedule on the right-hand side, which relies on explicit synchronization to ensure that tasks are scheduled in an order that respects their dependencies. A task may be ready, but its creation may be deferred until other, possibly unrelated tasks have finished execution, due to explicit synchronization. The potential loss of parallelism is often accepted as a trade-off for a simpler programming model and a runtime system that avoids task dependencies and the associated overhead of determining when tasks are ready to be scheduled.

Im Dokument Embracing Explicit Communication in Work-Stealing Runtime Systems (Seite 42-45)