Benchmark Programs - Embracing Explicit Communication in Work-Stealing Runtime Systems

This section introduces the set of benchmarks and microbenchmarks on which we evalu-ate our runtime system throughout the remaining chapters. We do not claim that these programs (written in C) exhibit the best performance among all possible implementa-tions. In particular, matrix multiplication and LU decomposition cannot compete with optimized routines from linear algebra libraries such as Intel’s Math Kernel Library [8].

We are primarily interested in programs with challenging task structures and their par-allel speedups, which we use to compare the performance of different runtime systems.

The variables that appear in the benchmark descriptions correspond to parameters that can be changed by the user.

SPC A Simple Producer-Consumer benchmark. A single worker produces n tasks, each running for t microseconds. This benchmark allows us to test how many con-current consumers a single producer can sustain.

BPC A Bouncing Producer-Consumer benchmark, which is a producer-consumer benchmark with two kinds of tasks, producer and consumer tasks [74]. Each pro-ducer task creates another propro-ducer task followed by n consumer tasks, until a certain depthd is reached. Consumer tasks run fortmicroseconds. The smaller the values of n and t, the harder it becomes to exploit the available parallelism. This benchmark stresses the ability of the scheduler to find and load-balance work.

TreerecA simple tree-recursive computation, similar in structure to Fibonacci, which is often used to estimate task scheduling overheads [78]. Each task n ≥ 2 creates two child tasks n−1 and n−2 and waits for their completion. Leaf tasks n < 2 perform some computation for t microseconds before returning. This simulates a cut-off, as if tasks were inlined after reaching a certain recursion depth.

Matmul A blocked matrix multiplication of two N ×N matrices of doubles, each partitioned into (^N_B)² B ×B blocks [139]. The block size B determines the task granularity and must be a divisor ofN.

LUA blocked LU decomposition of a sparseN×N matrix of doubles, partitioned into (^N_B)² B×B blocks. The block size B determines the task granularity and must be a divisor of N. The sparsity of the matrix—the fraction or percentage of blocks that contain only zeros—increases with the number of blocks in each dimension. Blocks that contain only zeros are not allocated. The code is based on the OpenMP version from the Barcelona OpenMP Tasks Suite (BOTS) [78, 20].

Quicksort A recursive algorithm that performs an in-place sort of an array of n in-tegers by partitioning it into two sub-arrays according to some pivot element and recursively sorting the sub-arrays. The pivot is chosen as the median of the first, middle, and last array elements. For sub-arrays ≤100 elements, the algorithm falls back to using insertion sort, which is usually faster on small inputs.

Cilksort A recursive algorithm inspired by [35] that sorts an array of n integers by dividing it into four sub-arrays, recursively sorting the sub-arrays, and merging the sorted results back together in a divide-and-conquer fashion to expose additional parallelism. For sub-arrays ≤ 1024 elements, the algorithm performs a sequential quick sort with median-of-three pivot selection and partitioning, and for sub-arrays

≤ 20 elements, the algorithm falls back to using insertion sort. The code is based on the version distributed with MIT Cilk [21].

NQueens A recursive backtracking algorithm that finds all possible solutions to the N-Queens problem of placingN queens on an N ×N chessboard so that no queen can attack any other queen. The code is based on the OpenMP version from the BOTS project [78, 20], which in turn is derived from the version distributed with MIT Cilk [21].

UTS The Unbalanced Tree Search: an algorithm that counts all nodes in a highly unbalanced tree [184]. The tree is generated implicitly; each child node is constructed from the SHA-1 hash of the parent node and child index. The code is based on the Pthreads version from the official UTS repository [25].

All benchmarks have the form init(); compute(); fini();, with application-specific initialization/finalization in^init/^fini, as well as the required calls to start/stop the runtime system. When we talk about a benchmark’s execution time, we mean the execution time of ^compute.

Benchmark Tasks Median task length, IQR Phases Category

Matmul 2048 32 262 144 77µs, 1µs 64 flat

LU 4096 64 23 904 556µs, 15µs 128 flat

—^fwd 1024 363µs, 3µs

—^bdiv 1024 332µs, 4µs

—^bmod 21 856 557µs, 17µs

Quicksort 10⁸ 1 697 314 5µs, 11µs 1 recursive

Cilksort 10⁸ 1 070 421 34µs, 57µs 1 recursive

—^cilksort 87 380 82µs, 294µs

—^cilkmerge 983 041 19µs, 23µs

NQueens 14 27 358 552 7µs, 12µs 1 recursive

UTS T1L 102 181 081 <1µs, 1µs 1 recursive

UTS T2L 96 793 509 1µs, 1µs 1 recursive

UTS T3L 111 345 630 <1µs, 1µs 1 recursive

Table 2.1: Workload characteristics of selected benchmarks. LU and Cilksort comprise different types of tasks, as itemized above. Task lengths were measured on one core of the AMD Opteron multiprocessor, using GCC 4.9.1 with all optimizations enabled (-O3). Parallel phases end with barrier synchronization to ensure that all tasks have run to completion.

2.8.1 Speedup and Efficiency

Sequential execution can mean two things: running a sequential version of ^compute or running a parallelized version of ^compute, but using a single thread to do so. To avoid confusion, we follow the convention of Cilk and denote sequential execution times with T_S and T₁. Unlike T_S, T₁ includes the overhead of parallelization so that T₁ ≥T_S.

We define the speedup over sequential execution as S_P = T_S

, (2.1)

where T_P is the parallel execution time of ^compute with P processors or threads [82].

An alternative definition of speedup over sequential execution would be S_P = T₁

T_P. (2.2)

It is usually safe to assume that T_S/T_P gives a lower bound forT₁/T_P.

Intuitively, a program is considered scalable if additional processors speed up the program’s execution. If we increase the number of processors by a factor ofN, and the program runs roughly N times faster as a result, we speak of linear scaling. Efficiency is defined as the speedup divided by the number of processors used to obtain that speedup [82]:

E_P = S_P

P . (2.3)

Expressed as a value between zero and one, efficiency indicates how well parallel pro-cessors are utilized, compared to how much effort is spent on communication and synchronization. High efficiency means good utilization with little overhead. Programs that scale linearly have an efficiency close to 1.

A note on the graphs in this thesis: when we report execution times, speedups, or efficiencies, we plot the median of ten program runs, except where noted, along with 10th and 90th percentiles, where visible. The difference between the 10th and 90th percentiles, also known as the interdecile range, covers 80% of the dispersion of a data set and is a useful measure of spread around the median. As such, it gives a sense of the variability that we observe between program runs, after excluding minimum and maximum execution times.

Im Dokument Embracing Explicit Communication in Work-Stealing Runtime Systems (Seite 55-59)