Channel Implementation - Embracing Explicit Communication in Work-Stealing Runtime Systems

2.9 Summary

3.1.3 Channel Implementation

see later, we can give an upper bound on the number of messages in the scheduler so that channels can be sized appropriately at allocation time to guard against the possibility of blocking sends or message loss.

void channel_free(Channel *ch);

Frees the memory associated with channel ^ch.

bool channel_send(Channel *ch, void *data, size_t sz);

Sends an element of ^sz bytes at address ^data to channel ^ch. Returns ^false if the channel is full, ^true otherwise.

bool channel_receive(Channel *ch, void *data, size_t sz);

Receives an element of ^sz bytes from channel ^ch. The element is stored at address

data. Returns ^false if the channel is empty, ^true otherwise.

unsigned int channel_peek(Channel *ch);

Returns the number of buffered items in channel ^ch. This function checks for avail-able messages without receiving them.

There is no distinction between the two endpoints of a channel, which means that every thread that holds a reference to a channel may use that reference to send and receive messages. In situations where the communication behavior can be analyzed or is known beforehand, specialized channels (MPSC, SPSC) may be used in place of more general ones (MPMC) to reduce overhead and improve performance [206].

3.1.3 Channel Implementation

Channels have a practical advantage over work-stealing deques: channels—and FIFO queues in general—are easier to implement than deques with concurrentpush,pop, and steal operations. Deques require expensive synchronization operations that cannot be eliminated without relaxing the semantics of the work-stealing algorithm [169, 39] or assuming bounded store/load reordering [176]. Channels, on the other hand, can be implemented efficiently, especially under the assumption of limited concurrency [71, 39, 144]. As it turns out, single-consumer queues (MPSC, SPSC) suffice to construct efficient schedulers.

Shared Memory Listing 3.1 sketches a simple implementation of SPSC channels on a typical shared-memory multiprocessor [142, 116]. Channels are implemented as ring buffers with head and tail pointers (actually buffer indices). Elements are added to the tail and removed from the head, so that head lags behind tail. We prefer ring buffers to linked lists because channels need not be resizable; they always contain a bounded number of elements. The memory barriers in lines 21 and 38 ensure that writes to head and tail occur in program order. If left unconstrained, non-sequentially-consistent² architectures are free to reorder loads and stores in accordance with their memory models [177, 31], potentially violating the correctness of the algorithm [142].

Shared-memory channels are suitable for passing values as well as references. Con-sider the following example to send a large data structure over a channel:

Channel *chan = channel_alloc(sizeof(Large_data_structure), 1, SPSC);

Large_data_structure *d = ...

channel_send(chan, d, sizeof(Large_data_structure));

Instead of copyingsizeof(Large_data_structure) bytes into and out of the channel, first, when sending, and second, when receiving, data can be moved between threads without copying. Assuming ^d is allocated from heap memory, a send can be used to transfer ownership of the referenced data:

Channel *chan = channel_alloc(sizeof(Large_data_structure *), 1, SPSC);

Large_data_structure *d = ...

channel_send(chan, (void *)&d, sizeof(Large_data_structure *));

Notice how the channel has changed from storing values to storing pointers. A thread must not access data for which it has relinquished ownership, in much the same way that data must not be touched after it has been freed.

Distributed Memory Channels can be used to exchange data among processes in a distributed environment. A possible implementation of SPSC channels is shown in Listing 3.2, where we take advantage of MPI’s nonblocking send and receive operations instead of maintaining channel buffers ourselves. Point-to-point messages as in MPI enforce the property that only one process receives from a channel. Additionally, MPI’s semantics guarantee that messages are received in the order they were sent (see [15], Section 3.5, pp. 42–45), provided the channel is used as intended³. Handing over a channel between two processes requires copying the channel descriptor and changing the value of ^receiverID to point to the new receiver. Each channel must tag its

2Sequential consistency [141] forbids observable reordering of memory operations.

3Using the same implementation for MPSC messaging would violate the FIFO property of channels!

3.1.3 Channel Implementation 45

1 typedef struct channel { 2 unsigned int cap;

3 size_t itemsize;

4 unsigned int head;

5 // Appropriate padding to avoid false sharing ...

6 unsigned int tail;

7 char *buffer;

8 } Channel;

10 bool channel_send(Channel *chan, void *data, size_t size) 11 {

18 memcpy(chan->buffer + chan->tail * chan->itemsize, data, size);

27 bool channel_receive(Channel *chan, void *data, size_t size) 28 {

35 memcpy(data, chan->buffer + chan->head * chan->itemsize, size);

Listing 3.1: Implementation sketch of SPSC channels on a typical shared-memory multi-processor. Elements are added to the tail and removed from the head of a circular array of bounded size. For performance reasons, it is important that head and tail occupy separate cache lines, or otherwise, sender and receiver end up constantly invalidating each other’s cached values of head and tail (false sharing). Memory barriers (affecting both compiler and hardware) prevent reordering of prior loads and stores with updates of head and tail.

1 typedef struct channel {

7 bool channel_send(Channel *chan, void *data, size_t size) 8 {

9 MPI_Request req;

11 MPI_Isend(data, size, MPI_BYTE, chan->receiverID, chan->tag, MPI_COMM_WORLD, &req);

12 MPI_Wait(&req, MPI_STATUS_IGNORE);

14 return true;

15 } 16

17 bool channel_receive(Channel *chan, void *data, size_t size) 18 {

19 MPI_Request req;

20 int flag;

22 MPI_Iprobe(MPI_ANY_SOURCE, chan->tag, MPI_COMM_WORLD, &flag, MPI_STATUS_IGNORE);

23 if (flag) {

24 MPI_Irecv(data, size, MPI_BYTE, MPI_ANY_SOURCE, chan->tag, MPI_COMM_WORLD, &req);

25 MPI_Wait(&req, MPI_STATUS_IGNORE);

26 }

28 return (bool)flag;

29 }

Listing 3.2: Channels as thin wrappers around two-sided communication operations using the example of nonblocking send and receive in MPI. TheMPI_Waitin line 12 waits until the data has been copied out of the send buffer; it does not necessarily mean that the message has been received. Similarly, the MPI_Wait in line 25 waits until the message has arrived in the receive buffer. The receiver first probes whether a message is available that matches the channel’s tag before it initiates the receive operation.

messages with a unique identifier so that receivers are able to distinguish messages belonging to different channels.

Channels may be thin wrappers around message passing primitives, as we have seen in Listing 3.2. Lower-level implementations may utilize remote memory access (RMA) to send and receive messages. As part of our work in [201], we have implemented channels based on one-sided put and get operations between the local message passing buffers on Intel’s SCC processor. Figure 3.1 shows the communication latencies for bouncing a small, cache-line-sized message between a pair of cores. This “ping-pong”

benchmark measured the round-trip latency between core number 0, at the bottom left corner of the chip, and a second core that varied from being 0 to 8 hops away.

General MPMC channels added 23–41% overhead on top of the native communication library’s send and receive operations (RCCE). MPSC and lock-free SPSC channels, however, allowed faster communication than RCCE, showing the benefit of specializing an implementation to the communication pattern [206].

Figure 3.1: Round-trip latencies in microseconds on the Intel SCC for passing a 32-byte message back and forth between core 0 and a second core that varies from being 0 to 8 hops away. A distance of 0 corresponds to core 0 communicating with core 1 of the same tile, a distance of 8 corresponds to core 0 communicating with core 47. The results show the fastest of ten trials, each trial being the average of 1000 round trips. The chip was operated in its default Tile533/Mesh800/DDR800 configuration (all values in MHz).

Distributed-memory implementations of channels have copy semantics, regardless if a machine supports shared memory. Send and receive operations are required to copy messages between private memory and channel buffers. Friedley et al. propose an API for ownership passing that enables move semantics⁴ in MPI programs [94]. Combined with other recent developments, such as the extended RMA model of MPI-3 [119]

and improved producer-consumer communication [43], MPI may permit increasingly efficient channel implementations.

Im Dokument Embracing Explicit Communication in Work-Stealing Runtime Systems (Seite 63-67)