Outline - Embracing Explicit Communication in Work-Stealing Runtime Systems

The main text is structured as follows:

Chapter 2 describes the notion of tasks in parallel programming, and explains why tasks are an effective abstraction on top of “threads and locks”. With a task abstrac-tion comes the need for a runtime system that hides lower-level details, including thread management, task scheduling, and load balancing. We look at task pools, typical task pool implementations, and the scheduling technique of work stealing.

Chapter 3 describes a work-stealing scheduler that employs private task queues and shared channels for communication between worker threads. Channels provide a message passing abstraction that allows the scheduler to operate on any system that is capable of supporting message queues.

Chapter 4 deals with constructs for termination detection in task-parallel computa-tions: a task barrier to wait for the completion of all tasks and futures to support tree-structured computations including strict fork/join parallelism in the style of Cilk. Both constructs are based on channels.

Chapter 5 focuses on fine-grained parallelism. We introduce a heuristic for switch-ing stealswitch-ing strategies at runtime and propose extensions to the lazy schedulswitch-ing of splittable tasks that achieve comparable performance to dedicated loop schedulers.

Chapter 6compares the performance of channel-based work stealing with three work-stealing schedulers that use concurrent deques, both lock based and lock free, on a set of task-parallel benchmarks and workloads, demonstrating that channel commu-nication does not prevent efficient scheduling of fine-grained parallelism.

Chapter 7concludes by summarizing our findings and proposing ideas for future work.

2 | Technical Background

This chapter provides the necessary background on task parallelism, task-parallel pro-gramming, and runtime systems based on work-stealing scheduling.

Recent years have witnessed the growing importance of parallel computing. Section 2.1 draws an important distinction, that between concurrency and parallelism. Tasks make it easier to express parallelism, without giving concurrency guarantees. Sections 2.2–2.4 deal with threads and tasks, the benefits of programming with tasks compared to programming with threads, and the task model we are going to use, which offers portable abstractions for writing task-parallel programs.

The supporting runtime system is responsible for mapping tasks to threads. Section 2.5 contrasts static with dynamic scheduling. Section 2.6 describes the data structures behind dynamic schedulers —task pools—whose implementations can be centralized or distributed. Central task pools limit the scalability of dynamic schedulers. Dis-tributed task pools solve this scalability problem, but add complexity in the form of load balancing. Section 2.7 elaborates on load balancing techniques, primarily on work stealing, and summarizes the pioneering results of Cilk that continue to influence the design and implementation of task schedulers [21]. Section 2.8 concludes with a list of task-parallel benchmarks and a few words about performance.

2.1 Concurrency and Parallelism

Due to the proliferation of microprocessors with increasing numbers of cores, concur-rency and parallelism are becoming more and more important, as is the search for better programming abstractions than “threads and locks” [233]. While threads have long been used as building blocks for concurrent and parallel systems, higher-level abstractions tend to be designed with either concurrency or parallelism in mind [133].

Concurrency and parallelism are related but distinct concepts (see, for example, the introductory chapters in [156], [244], and [60], or refer to [213] for a thorough discussion of concurrency as used in different programming paradigms). In practice, however, the

distinction is often obscured by a tendency to view both concurrency and parallelism as a means to improve performance, despite the fact that concurrency is a way to structure programs and not necessarily a recipe for parallel speedup [155, 109, 197].

Concurrency refers to multiple activities or threads of execution that overlap in duration [212]. Consider two threads T₁ and T₂. If one of the two threads, say T₁, completes before the other thread, T₂, starts running, T₁ and T₂ execute in sequence without interleaving. If T₂ starts running before T₁ completes, T₁ and T₂ happen logically at the same time; both threads have started and neither has completed [221].

We say T₁ and T₂ happen concurrently. It is left to the implementation whether T₁ and T₂ happen physically at the same time, that is, in parallel.

Parallelism results from simultaneous execution of two or more independent com-putations. By contrast, concurrency describes the structure of systems, programs, and algorithms in terms of threads and their interactions through memory. In that sense, concurrency facilitates parallelism: a concurrent program is easily turned into a paral-lel program by executing two or more threads simultaneously, for example by binding threads to different cores of a multicore processor. When forced to run on a single core, a program can be concurrent without being parallel.

Multiple threads are often a prerequisite for parallel execution, but parallelism is not tied to threads. At the machine level, independent instructions may execute in parallel (instruction-level parallelism), and SIMD instructions operate on multiple data elements packed into vectors (data parallelism). Because concurrency can be seen as dealing with more than one thing at the same time, we might think of parallelism as an instance of concurrency [56, 224]; programs must exhibit concurrency at some level of abstraction to make use of parallelism. For this reason, concurrency is usually considered to be a more general concept than parallelism.

Modern systems based on multicore processors benefit from both data and task parallelism. Data parallelism can be considered a subset of task parallelism [36]. It is possible to express a data-parallel computation as a task-parallel computation in which tasks are set up to perform the same operations on different elements of the data. Task and data parallelism are not mutually exclusive. Consider for example a blocked matrix multiplication that creates a task per matrix block and uses vector operations to speed up block-wise multiplications.

Im Dokument Embracing Explicit Communication in Work-Stealing Runtime Systems (Seite 28-31)