• Keine Ergebnisse gefunden

2.4 The language Eden

3.1.2 Parallel runtime environment

The basic layer implementing the primitive operations is based on the GHC run-time environment, and manages communication channels and thread termination.

The GHC runtime environment (RTE) has been extended such that it can exe-cute in parallel on clusters. Furthermore, small changes have been made to the compilation process, so that the compiled program is accompanied by a run script to make it execute in parallel with suitable parameters. We briefly summarise and systemise the extensions made to the RTE.

Communication infrastructure inside the runtime system is concentrated inside one single “Message Passing System” interface (file MPSystem.h). The module provides only very basic functionality assumed to be available in virtu-ally any middleware solution, or easily self-implemented, which enables different implementations on different hardware platforms. Fig. 3.2 shows the functions to provide. Apparently, the parallel runtime system has to start up in several instances on a whole group of connected machines (PEs). The primitive opera-tions, and also the entire runtime system code, address the n participating PEs simply by numbers from 1 to n. Mapping these logical addresses to the real, middleware-dependent addressing scheme is one task to implement. Two imple-mentations of the MPSystem interface have been carried out, for MPI [MPI97]

or PVM [PVM] as a middleware.

Startup and shutdown infrastructure manages that, upon program start, the runtime system instances on all PEs get synchronised before the main evalu-ation can start, and that the distributed system does a controlled shutdown both upon success and failure.

CHAPTER 3. A LAYERED EDEN IMPLEMENTATION

/*******************************

* Startup and Shutdown routines (used inside ParInit.c only) */

/* - start up the PE, possibly also spawn remote PEs */

rtsBool MP_start(char** argv);

/* - synchronise participating PEs

* (called by every node, returns when all synchronised */

rtsBool MP_sync(void);

/* - disconnect current PE from MP-System */

rtsBool MP_quit(int isError);

/*******************************

* Communication between PEs */

/* - a send operation for p2p communication */

void MP_send(int node, OpCode tag, long *data, int length);

/* - a blocking receive operation. Data stored in *destination */

int MP_recv(int maxlength, long *destination, // IN OpCode *code, nat *sender); // OUT /* - a non-blocking probe operation */

rtsBool MP_probe(void);

Figure 3.2: RTE message-passing module (interfaceMPSystem.h)

The protocol for the startup procedure is deliberately simple and depends on the underlying middleware system. For middleware with the ability to spawn programs on remote nodes (such as PVM [PVM]), a “main” PE starts up first, and spawns RTE instances on all other participating PEs. PEs are synchronised by the main PE broadcasting the array of all PE addresses, which the other PEs acknowledge in a reply message (PP Ready). Only when the main PE has received all acknowledgements, it starts the main computation.

When the middleware manages the startup of programs on multiple PEs by itself (this is the case for MPI implementations, where the MPI report [MPI97]

imposes that MPI processes are synchronised by the mpirun utility upon startup), no additional synchronisation for the runtime system needs to be implemented.

In order to implement the controlled system shutdown, basic message passing methods had to be implemented, and the scheduling loop of GHC has to regularly check for arriving messages before executing the next runnable thread.

Shutdown is realised by a system message PP Finish. Either this message is broadcasted by the main PE (with address 1), or from a remote PE to the main PE, when the remote PE fails. In the failure case, the parallel compu-tation cannot be recovered, since needed data might have been lost. Remote

3.1. IMPLEMENTATION OF EDEN

PEs receiving PP Finish simply stop execution, while the main PE, in the fail-ure case, broadcasts the message to all other remote PEs, thereby initialising a global shutdown.

Basic (Runtime) Computation Units, managed by the runtime system, are addressed by globally unique addresses as follows.

A running parallel Eden program splits up, in the first instance, into a set of PEs 1 to n (also called machines in the following). Machine 0 is invalid. Fur-thermore, already the sequential GHC runtime system internally supportsthread concurrency addressed using (locally) unique thread identifiers (IDs). Multiple threads at a time can thus run inside one machine. Threads are uniquely identi-fied by their machine number and ID.

A useful mid-level abstraction of a thread group in a machine is introduced by the Eden language definition: a process. Each thread in Eden belongs to a process, a conceptual unit of language and runtime system. A process consists of an initial thread, and can add threads by forking a subcomputation (concurrent haskell). All threads in one process share a common heap, whereas processes are not assumed to share any data; they need to communicate explicitly. Grouping threads inside one machine to processes like this is useful in general, and also relates to the extensions made to garbage collection with respect to heap data transfer.

Support for data transfer between PEs is a more than obvious requirement of any parallel system implementation. In the context of extending GHC, specif-ically, any data is represented as a graph in the heap. Data transfer between PEs thus means to serialise the subgraph reachable from one designated start node (or:

heap closure), and to reconstruct it on the receiver side. In our implementation, heap data structures are transferred ascopies, which potentially duplicates work, but avoids implementing a virtual global address space in the runtime system (we will come back to this in Section 4.7). An important property of the data serialisation routine is that on the one hand, it does not evaluate any data (but sends it as-is, in its current evaluation state). On the other hand, serialisation is instantly aborted when a placeholder for data under evaluation is found in the subgraph. Thus, in terms of concurrent heap access, data serialisation behaves like evaluation, even though it does not evaluate anything.

Data is always sent viachannels previously created on the receiver side, where the placeholder nodes which synchronise concurrent threads in the sequential system may now stand for remote data as well. The RTE keeps a list of open channels and manages the replacement of placeholder by data which has been received through the channel.

Several data message types are implemented: the normal Dataand the Stream

mode. Data sent inDatamode just completely replaces the placeholder when it is

CHAPTER 3. A LAYERED EDEN IMPLEMENTATION

received. When data is sent in Stream mode, the receiver inserts it into the heap as the first element of a list and leaves the channel open for further list elements (until the closingnil,[]is eventually sent inDatamode). Another communication modeConnectserves to establish a producer-consumer link between two PEs early, before results of a potentially expensive evaluation are transmitted. Finally, because computations, as data, are first-class citizens in Haskell, and therefore nothing but a heap graph structure, the creation of a remote computation could be implemented as yet another data communication modeInstantiate, where the transmitted data is actually the unevaluated computation to be executed.

Please note that our extensions for data transfer change the meaning of place-holder nodes in the heap, which has consequences for the GHC garbage collection mechanisms. In the sequential system, a thread may only find a placeholder in the heap if there is another thread that evaluates the data behind it. Garbage collection in the sequential system evacuates data needed by runnable threads in the first instance. If none of the runnable threads will ever update a certain placeholder any more, threads blocked on this placeholder are effectively garbage and will be removed. This is not the case any more in our system, where place-holders may also stand for remote data. But the implementation of the Eden language constructs (described later) ensures that a remote data source sender exists. Thus, the modified garbage collection keeps threads alive whenever they are registered as members of a process (i.e. not created for internal reasons).

Changes to the compilation process have been made only for the linking phase and for convenience reasons. The compilation of an Eden program in the extended GHC remains largely the same as compiling a sequential program with GHC. Differences are that libraries for the message passing system have to be linked to the application, and that the compiled and linked program needs cus-tom mechanisms to be started in parallel. The latter issue is solved by generating a separate startup script, depending on the middleware in use. It must be men-tioned that the start script is a minimalistic solution, and might cause problems for unfamiliar users or in custom-configured clusters. However, the Eden System in its current state was not developed as a commercial off-the-shelf solution but is a research software system.