Group Z Projects - System-Wide Parallel Efficiency

2.3 System-Wide Parallel Efficiency

3.1.5 Group Z Projects

The general administration tasks of the Invasive Computing project are performed in the Z project groups. Contacts with important individuals, research sites and companies are managed by these groups. These groups also provide the necessary coordination for con-solidated results, such as the demonstration platform produced by the Z2 project.

Z2: Validation and Demonstrator

The main goal of this project is to provide an FPGA based hardware platform for vali-dations and demonstrations. Contributions from multiple other projects are integrated in this platform.

3 Invasive Computing

4 Related Work

A broad discussion on related work is presented in this chapter. First, an incomplete overview of programming languages, interfaces and resource managers that do not sup-port elastic execution is presented. In spite of the lack of supsup-port for elastic execution, meaning that these works are not alternatives to what is proposed in this document, their overview is provided for completeness. Afterwards, closely related works that support resource-elasticity are discussed in detail.

4.1 Programming Languages and Interfaces without Elastic Execution Support

The work presented in this document is related to programming models and resource managers that target applications with high synchronization requirements. In this sec-tion, an overview is provided about related works that are not direct replacements to what is proposed in the rest of this document, but that could potentially be extended to sup-port resource-elasticity in the future. First, programming languages and interfaces that target only parallel shared memory systems are discussed. Afterwards, programming languages and interfaces for distributed memory that only support resource-static exe-cution are covered. Finally, solutions for cloud and grid computing systems that support resource-elasticity are described and compared to HPC solutions.

4.1.1 Parallel Shared Memory Systems

Shared memory is a term used to indicate the presence of a single address space across all processing elements. There are several parallel programming languages and interfaces that target shared memory systems exclusively. These have increased in number and im-portance, due to the increase of parallelism in shared memory systems in recent years.

These languages and interfaces are related to this work, since each node that is managed in a distributed memory HPC system is itself a parallel shared memory system. Brief descriptions of the most widely recognized languages and interfaces that target parallel shared memory systems are presented here. Additionally, a brief discussion on hybrid programming models with MPI is provided.

Open Multi-Processing (OpenMP)

Open Multi-Processing (OpenMP) [9, 76, 191] is an Application Programming Interface (API) that can be used by programmers to parallelize serial applications. The API is stan-dardized by the OpenMP Architecture Review Board and is currently at version 4.5. The API is provided as pragma directives to Fortran, C and C++. Additionally, environment variables and compiler extensions are also defined in the specification. OpenMP has en-joyed support from several compilers over the years, both free and commercial, such as:

GCC, IBM XL, Intel, PGI, Cray, Clang [4, 17], and others.

4 Related Work

The pragmas allow for the annotation of regions that can run in parallel in the source code of programs. Additionally, developers can provide instructions that specify how these parallel regions should be executed. For example, a region can be executed following a fork-join threading model or, since version 3.0 of the specification, following a task-based model [35]. OpenMP is known to allow for relatively simple conversions of serial Fortran and C programs into threaded programs. This makes it an attractive choice for software projects where significant resources have already been dedicated to the development and validation of preexisting source code.

Within the Invasive Computing research project, extensions to OpenMP and a runtime system have been developed to support resource-aware computing [100, 118]. The goal is to adjust the computing resources allocated to OpenMP applications running simulta-neously in shared memory systems, such as CPU cores, based on their scalability. This previous research targeted parallel shared memory systems, while the research presented in this document targets distributed memory systems.

POSIX Threads (PThreads)

The POSIX [1, 172] standard defines a thread API and model that is usually referred to as POSIX Threads or PThreads [169] for short. The API is supported by several Unix and Unix-like operating systems, such as: FreeBSD, NetBSD, OpenBSD, Solaris, Linux, Mac OS, etc. There are also implementations available for Microsoft Windows and other oper-ating systems.

The API provides operations for thread management, such as: creation, termination, synchronization, scheduling, etc. For synchronization, there are mutexes, joins and condi-tion variables. Mutexes are used to ensure exclusive access to resources, such as memory or external devices. Joins are operations that are used to wait at a specific thread for other threads to complete. A thread may create multiple other threads and then wait on them with a join operation, following a fork-join pattern. Finally, condition variables can be used to make threads wait until certain conditions are met.

Cilk Plus

Cilk Plus [147, 210, 148] is a parallel programming language based on the C language with extensions for the definition of parallel loops and fork-join patterns. The language and its runtime has been improved over time and has become a commercial product.

The Cilk Plus language is designed to expose parallelism in the source code. Once the parallelism of a program is defined, a runtime system can schedule its work automatically on parallel shared memory systems. The spawn keyword is used for the creation of tasks, while the sync keyword is used to wait for them. These can be used for the creation of fork-join patterns. Cilk Plus also extends C with keywords for the creation of parallel loops, the specification of reduction operations, the creation of arrays and simplifications to array accesses, among other things.

The schedulers found on Cilk Plus runtime systems implement work-stealing [22] tech-niques that can effectively balance the load across executing units. With work-stealing, threads that are idle due to finishing their own tasks early can take and execute entries from the work queues of other threads, effectively stealing them.

4.1 Programming Languages and Interfaces without Elastic Execution Support

Threading Building Blocks (TBB)

Threading Building Blocks (TBB) [184, 73] is as a C++ template library that provides high level constructs that ease the creation of parallel programs, such as: parallel loops, reduc-tions, pipelines, queues, vectors, maps, memory allocareduc-tions, mutexes, atomic operareduc-tions, etc. Similarly to Cilk Plus, it attempts to separate the definition of tasks from their actual execution in a parallel system. Its schedulers also implement work-stealing techniques.

Hybrid Programming Models

Shared memory programming models are numerous and are generally orthogonal to mes-sage passing. The passing of mesmes-sages is unnecessary when data can be accessed directly in the same address space, without the need of copies. The passing of messages is needed across nodes in distributed memory systems. There have been research efforts to bring the benefits of shared memory to MPI applications with minimal modifications to application source code and libraries [94, 168].

The combinations of message passing and shared memory programming models are usually referred to as hybrid programming models. Message passing with MPI can be combined with shared memory APIs and libraries, such as MPI and OpenMP [179, 56, 222, 141], MPI and POSIX Threads [173], etc. The idea is that a shared memory programming model can better abstract the parts of a program that map to the hardware inside of a node, while MPI can better abstract the operations that take place across nodes.

Shared memory programming models are not alternatives to the work presented in this document and are instead complimentary. In this work, resource adaptations are done at the node level, meaning that resource adaptations are done based on whole nodes. Because of this, resource adaptations have no impact on any shared memory language or interface used for intra-node parallel programming in hybrid scenarios.

Im Dokument Resource-Elasticity Support for Distributed Memory HPC Applications (Seite 39-43)