Calypso programs and scheduling servers

8.3 Prototype description

8.4.3 Calypso programs and scheduling servers

In this section, the use of Calypso programs is considered with the Calypso master process not subject to a scheduling server: the master is always eligible to run, but (typically) shares a machine with a worker process.¹⁰ If the Calypso master is also eligible to run only periodically, the same line of argument as in Section 8.4.2 applies and synchronizing these executions is necessary.

While Calypso uses a BSP programming model, its actual implementation follows a master/worker style, enhanced with load balancing techniques. There is no explicit synchronization of all processes (either worker or master) in a Calypso program. Hence, the main motivation for synchronizing the processes with each other is missing. Moreover, if worker processes are synchronized with each other, they all attempt to access the master at (potentially) the same time. This can cause the master and the network to become a bottleneck, in particular for programs with high traffic, and suggests different behavior of Calypso programs than BSP programs when run under scheduling servers.

The Calypso program from Section 4.3 is used for experiments. As that section has indicated, the imbal-ance parameter^vis of lesser importance, whereas the previous discussion suggested a potential impact of the traffic parameter^a.¹¹ Therefore, the experiments used granularity^gand traffic^aas parameters, the scheduling servers were again set to provide time slices of²⁰ms every¹⁰⁰ms, spin-blocking with²⁰⁰s was used, and the numbers shown here are averaged over at least⁵⁰runs.

Figure 8.10 shows the results with unsynchronized scheduling servers, Figure 8.11 with synchronized scheduling servers; for easier comparison, the ratio of times with synchronized servers divided by times with unsynchronized servers is provided in Figure 8.12. The scenario here is much more complicated than in the case of BSP programs. The overall behavior is such that for most cases, synchronization actually harms the performance, but with increasing granularity, synchronization becomes competitive and outperforms unsyn-chronized servers. This is commensurate with the initial discussion: with larger granularity, fewer requests are made at the master process, which becomes less of a bottleneck. However, the exact point where syn-chronization becomes beneficial is highly dependent on the actual program, and differences are not very big anyway. Synchronizing scheduling servers has the additional advantage that in almost all cases the variation coefficient of runtimes is smaller.

Only numbers for a balanced load are shown here. For other settings of the imbalance parameter, the behavior is similar, however, the effects of load balancing are reduced by the slotted CPU availability un-der scheduling server control. Therefore, programs with unbalanced load perform comparably worse unun-der scheduling servers than they do without.

8.5 Conclusions

In this chapter, the problem of controlling the CPU share given to a parallel program running on a cluster of workstations and the effects of such a control on the distributed execution of a parallel program have been considered. A prototypical implementation of a scheduling server for the Linux operating system has been presented. Among several design choices, a signal-based implementation has been used due to its stability and portability.

While such a straightforward implementation works well for stand-alone programs, the performance im-pact on symmetric parallel programs (as represented by the BSP programming model) proved to be disastrous.

A synchronization mechanism has been suggested and implemented that copes with the limited clock resolu-tion of PC-based commodity systems and still achieves reasonable performance for symmetric parallel pro-grams, even in the presence of background load. The performance benefits of this synchronization mechanism are due to the fact that coscheduling is achieved by this synchronization mechanism. For asymmetric parallel

10Since the worker process is subjected to scheduling server control, the master process runs at a fixed priority higher than the worker, but lower than the scheduling server—otherwise the master would not service any worker requests while the local worker is running.

11Following the description of the test program in Section 4.3, a traffic parameter^aindicates the number of pages of size 4 KBytes that are read and written by every routine.

8.5. CONCLUSIONS

0 50 100 150 200 250

1 5 10 50 100

Granularity g (ms)

Time (ms)

a=0 a=1 a=2

Figure 8.10: Average runtime of a Calypso program with unsynchronized scheduling servers, shown for various granularities^gand traffic parameters^a.

0 50 100 150 200 250 300

1 5 10 50 100

Granularity g (ms)

Time (ms)

a=0 a=1 a=2

Figure 8.11: Average runtime of a Calypso program with synchronized scheduling servers, shown for various granularities^gand traffic parameters^a.

0 0.2 0.4 0.6 0.8 1 1.2 1.4

1 5 10 50 100

Granularity g (ms)

Ratio

a=0 a=1 a=2

Figure 8.12: Ratio of runtimes of a Calypso program, comparing synchronized and unsynchronized scheduling servers (larger values indicate that unsynchronized scheduling servers per-form better), shown for various granularities ^gand traffic parameters^a.

programs (like Calypso programs), synchronization is superfluous in many cases and can, in particular for very traffic intensive programs, actually harm performance.

Using extensive measurements, the influence of various parameters like granularity, load imbalance, or communication pattern were investigated. There are several observations for symmetric programs, the most relevant ones are: for fine-grained, moderately communicating programs synchronized scheduling servers provide a reasonable means of achieving coscheduling; for heavily communicating programs, spin-blocking has to be added; synchronized scheduling servers reduce the variation coefficient of program execution time.

For asymmetric parallel programs, coscheduling can actually be harmful since the master process can become a bottleneck.

8.6 Possible extensions

There are a number of possibilities to extend this work. For the experiments described in Section 8.4, only four PCs were available. This small number does not allow to address questions of scalability. To do so, experiments with a larger cluster, along with a reimplementation of the synchronization to use multicasting (or a tree-based dissemination of the synchronization messages if no multicasting is available, similar to Score-D), are desirable. Some more experiments with multiple distributed programs running under scheduling server control at the same time or with additional, uncontrolled background load could also be performed.

A more generalized version of a scheduling server should not only control CPU resources, but other resources as well. Memory is an obvious candidate; Linux implements the necessary memory locking prim-itives to make this relatively straightforward. Networking bandwidth is more complicated and depends on what kind of network is to be used. In ATM, e.g., a priori reservation of bandwidth is a possibility but should be integrated with Quality of Service end system architectures.

As has been discussed in Section 8.3.2, the coarse resolution of operating system timers is a major obstacle for an efficient, time-driven synchronization of distributed servers. A promising opportunity is represented by the UTIME extensions [21] to the Linux kernel that promise programmable timers with microsecond accuracy.

8.6. POSSIBLE EXTENSIONS

However, it remains to be seen how this timer accuracy can be provided to application processes as well (and not only kernel modules).

To make such a distributed scheduling control practical, it has to be integrated in a general resource management scheme. BARATLOOet al. [24] propose a resource management system that fits particularly well with the prototype described here. The distributed scheduling server can be used to act out Quality-of-Service related decisions made by the resource broker. Programs that fork off remote processes could at the same time inform the resource broker that they require gang scheduling for this process. Some additional effort for integrating these two systems as well as some studies regarding Quality-of-Service specifications and policies are needed here; the Globus resource description language [89] could serve as a starting point. Additionally, such an integrated resource management/scheduling system can react flexibly to the requirements of tunable applications [49].

Reaching out to Wide Area Networks

Satisfying the unlimited resource needs of parallel applications leads to the prospect of metacomputing in geographically widely distributed environments. In such wide area environments, communication can become a serious bottleneck. To reduce communication overhead, an annotation scheme for communication patterns of a program is proposed here. This scheme is investigated with an implementation in the Charlotte system. It is shown that these annotations considerably increase Charlotte’s efficiency and can also serve as a stepping stone for responsive metacomputing. Additionally, an infrastructure for resource allocation in wide area networks is briefly discussed.

Im Dokument Responsive Execution of Parallel Programs in Distributed Computing Environments (Seite 150-154)