• Keine Ergebnisse gefunden

An infrastructure for resource allocation in the WWW

NYU HU

Location of Workers

Time (s)

Unchecked annotations

Unchecked annotations + caching

Unchecked annotations + caching + colocation

Figure 9.9: Average communication times for matrix multiplication shown for workers at NYU or HU;Dintplus annotations,Dintplus annotation and caching, andDintplus annota-tion and caching and colocaannota-tion (averaged over 1000 runs).

9.3 An infrastructure for resource allocation in the WWW

Charlotte makes it easy for a volunteer to contribute his idle CPU time to a parallel application, but it does not answer the question how a volunteer can find such an application. This problem is solved by KnittingFactory, described in much more detail in [25, 26].

One main component of KnittingFactory is a directory service. The requirements for such a directory service are slightly different from a typical name server. One requirement is to allow lookups not only from programs, but also from within a standard web browser. Another requirement is to accommodate highly dynamic registration and deregistration of processes and to consider the topological structure of processes to favor applications that are close to volunteers.

The requirement to use browsers as lookup tools implies that such a directory server should be integrated into the Web infrastructure. In KnittingFactory, applications looking for workers can register with a directory service by sending standard Hypertext Transfer Procotol (HTTP) requests to well known KnittingFactory servers. These servers store the requests along with information about peer KnittingFactory servers.

A volunteer looking for work directs his browser to such a KnittingFactory server and retrieves an Hy-pertext Markup Language (HTML) page from this server that includes a list of known applications and peer KnittingFactory servers and also contains a small Javascript [79] program. This program inspects the page to check if there are any applications looking for work known to this server. If so, the browser is redirected to the Uniform Resource Locator (URL) of the application (corresponding to, e.g., a Charlotte program) and, transparently for the user, downloads the corresponding applet and starts to execute it. If no application is found, the Javascript program constructs a new URL from the current URL, starting with a peer server, ap-pends the name of the current server to it, and redirects the browser to this new URL. This passes on state information between several pages and allows the Javascript program to implement different search strategies.

9.4. CONCLUSIONS

One strategy would be breadth-first search, favoring topologically near servers (as defined by the topology of the server graph). Such topological considerations are difficult when using other directory services. Addition-ally, this client-based search implementation moves the actual search from the servers to the clients and can therefore be regarded as an implementation technique of the Smart Client concept advocated by YOSHIKAWA

et al. [309].

KnittingFactory also removes another limitation of typical Java applications. For security reasons, a Java applet is by default only allowed to open network connections to the machine from which it was downloaded.

Since downloading requires an HTTP server, a machine running such a server can easily become a bottleneck if multiple (e.g., Charlotte) applications are running on it. On the other hand, it is often impractical to install a complete HTTP server on all available machines. To overcome this limitation, KnittingFactory provides a core HTTP server functionality that can be easily integrated into any Java application. Using this integrated server, a Java application can be started on any machine, optionally registers itself with one or more KnittingFactory servers, and awaits requests for applets. This easy access to core HTTP functionality increases the flexibility of Java applications.

9.4 Conclusions

Problems arising from applying Calypso’s techniques to wide area network environments, in particular in the context of reducing communication overhead, have been considered in this chapter. In Section 9.2, possibil-ities for improving Charlotte’s efficiency by means of annotating parallel routines with their communication dependencies have been proposed. These annotations reduce the communication overhead of a Charlotte program, enable the use of simpler memory management techniques, and can additionally be interpreted as bridging the gap between the DSM semantics of Charlotte and simpler, yet more efficient message passing systems.

These annotations can have the character of hints, allowing the runtime system to improve communication efficiency while still guaranteeing the correctness of a program. They can also be used as a precise description of read and write sets, which allows the sharing of primitive types like intacross multiple machines. The stepwise nature of this concept enables a programmer to gradually incorporate knowledge about a program’s behavior into the code and to freely mix pure DSM objects with annotation-based objects or shared primitive types.

Sharing primitive types results in data access efficiency usually found only in message passing systems or hardware-supported DSM systems. By building on top of Charlotte, this efficiency is now available for Java-based Web computing without putting the burden of low-level communication primitives on a programmer while properties that are crucial for Web computing, e.g., fault tolerance, are maintained. In this sense, advantages from both DSM systems and message passing are incorporated in this concept.

The practicability and ease of use of this approach has been shown with matrix multiplication as an ex-ample. A number of measurements substantiate the claim to vastly improved performance—runtime im-provements of up to a factor of nine over standard Charlotte and competitive with a pure message passing implementation were observed. These results show that with modest overhead for programmer and runtime system, even problems of only moderate granularity can be efficiently solved in a Java-based DSM program-ming environment.

To support Charlotte’s need to find volunteers contributing to the computation, and in general to make Web-based applications more feasible, KnittingFactory provides an easy-to-use, Web-based directory service and a mechanism to execute Java applications on any host without the need for an external HTTP server.

9.5 Possible extensions

There are a number of possible extensions to this work. With regard to responsiveness, these annotations provide hints to the runtime system about the communication overhead of the parallel execution. Adding further information about the execution times of routines along with a fault model is simple, allowing the

runtime system to make on-line estimations of the responsiveness, based on the techniques developed in Chapter 5. Combining it with emerging technologies for real-time Java adds to the predictability of the program execution. Additionally, the knowledge about communication requirements can be used in concert with information about the network status (as obtained from systems like the Network Weather Service [307]

or the Network Status Predictor [146]) to further improve the responsiveness of a program. If the runtime system decides that it is unlikely to meet a requested deadline, it can request additional resources from a system like KnittingFactory (see Section 9.3). Conversely, resources can be released if the probability of meeting the deadline is sufficiently high.

Another extension is studying the impact of problem size and communication/computation ratio on the relative performance of the various annotation levels. Generating the annotations by a compiler-based data-flow analysis would also be most interesting. Overlapping computation with communication is an orthogonal issue: Coordinated execution of multiple workers within one browser is an obvious approach to this problem.

Taking a long-term perspective, it must be noted that JIT compilers for Java still do not deliver the per-formance that has been expected from them when they first become popular. This shortcoming makes Java somewhat less attractive for implementing metacomputing systems. Nevertheless, Java is still attractive as a coordination language for such metasystems. In such a scenario, Java programs concert the execution of lower-level programs at different sites, probably larger facilities based on supercomputers. For such installations, the deployment of, e.g., ATM-based virtual circuits becomes viable, providing guaranteed communication Quality of Service. Such a scenario would allow to reconsider responsive computing in wide area networks from a new perspective.

“Begin at the beginning”, the King said gravely, “and go on till you come to the end: then stop.”

– Lewis Caroll

Chapter 10

Conclusions and Future Work

10.1 Conclusions

Clusters of standard, off-the-shelf workstations have become more and more popular for parallel computing and are currently a viable alternative for custom-built high-performance systems (e.g., parallel supercomput-ers) in many application areas. This dissertation has concentrated on additional challenges of parallel com-puting beyond mere performance: timely and dependable execution of parallel programs. Given the prevalent focus on high performance in cluster computing research, the questions of dependability and timeliness, and in particular their combination, have received comparatively scant attention. To address these issues, ques-tions of scheduling analysis, fault tolerance, resource management and communication should be answered and expressed in a concise metric; solutions should be compatible with the commodity nature of cluster-based systems.

Such a concise metric has been found in the existing notion of responsiveness, which has been refined to fit the needs of this work. Responsiveness is the probability of correctly completing a service before or at a given deadline, even in the presence of faults. If a deadline is not given, then the distribution of the service’s response time is an appropriate metric. Based on the large amount of work in cluster computing, the systems of the Milan project, namely Calypso and Charlotte, have been selected to serve as a case study from which concrete responsiveness needs of parallel computing could be extracted. Four such needs have been identified: a response time analysis, dealing with single points of failure, providing guaranteed access to resources, namely CPU time, and reducing communication overhead.

The response time distribution of the eager scheduling mechanism employed by the Milan systems has been analyzed under some general assumptions about the behavior of machines and programs. The gen-eral analysis considers arbitrary probabilistic distributions of routine execution times and machine lifetimes and derives the response time distribution from these assumptions. The general solution is only of limited practical value since its numerical complexity is large, and simulations are often preferable over analytical approaches—hence carefully restricted assumptions about the program behavior (which implies guidelines for program design and implementation) are necessary. If the assumptions are restricted to fixed routine exe-cution times, the analytical solution is competitive with simulation and practically feasible. Moreover, under both sets of assumptions, analytical and simulation results show an correspondence.

To remove a single point of failure, two popular mechanisms for fault tolerance, checkpointing and replica-tion, have been investigated with regard to responsiveness. For checkpointing, a simple yet general theoretical solution of the problem of maximizing responsiveness by an appropriate choice of the checkpointing interval has been given. This analytical solution has been validated by experiments with a Calypso version extended by checkpointing functionality—the analytically predicted optimal checkpointing interval matches the one found in experiments (as close as stochastic claims can be made). These experiments have shown that checkpointing is a viable means to ensure that parallel programs meet their deadlines with high probability. Checkpointing

has the additional valuable property that the responsiveness is actually fairly robust with respect to the em-ployed checkpointing interval, as long as it is in the vicinity of the optimal interval (also as could be expected from the analysis).

Based on replication, a general-purpose system, Fault-Tolerant Distributed I/O (FT-DIO), has been intro-duced that increases application fault tolerance of existing legacy software by observing their input/output behavior. FT-DIO is also characterized by a flexible configuration and adaption of the fault-tolerance level to the needs of the application, even at runtime. Experiments with FT-DIO have shown that not only the fault model but also the replication mechanism have a large impact on performance, in particular, that removing a single point of failure incurs high overhead. To assess the suitability of FT-DIO for responsiveness, the Totem protocol, which is a key component of FT-DIO, has been investigated experimentally; the theoretical results for Totem’s predictability have been confirmed for simple fault models. However, for more complicated fault models or in the presence of additional background load on some machines, Totem’s predictability suffers considerably. On the basis of FT-DIO, a replicated version of Calypso has been designed. Experiments have shown that replication increases the responsiveness of a Calypso program under heavy fault injection; com-pared with checkpointing, however, it does not perform favorably in the concrete experiments that have been considered. These experiments indicate that for practical environments a combination of checkpointing with a modest degree of replication (i.e., a duplex system) promises a high degree of responsiveness.

The need of a parallel program for guaranteed amounts of resources to complete execution in time has been addressed by a resource management scheme that both conforms to the standards used in clusters and is appropriate for the use with parallel programs. Compared with existing resource management systems, the one presented here combines guaranteed CPU share with temporally coordinated execution of distributed processes (coscheduling) without modifying the underlying operating system or hardware. Experiments have shown that different parallel programming models have different synchronization requirements; in particular that for BSP-style programs coscheduling is both necessary (runtime improvements of over one order of magnitude are observable) and feasible, but also that coscheduling can be harmful to the performance of master/worker style programs (as represented, e.g., by Calypso).

In the last chapter, the impact of reducing communication between distributed parts of a program has been investigated, which is especially important for parallel computing in wide area environments. An annotation-based solution has been presented that reduces the communication overhead of Charlotte programs, enables the use of simplified memory management mechanisms, and can serve as a first stepping stone towards re-sponsive parallel computing in these complex settings. Additionally, these annotations can be interpreted as bridging the semantic gap between distributed shared memory and message passing systems. Experiments show that these annotations improve the efficiency of Charlotte programs by up to a factor of nine.

Based on the case of a concrete system and its requirements, some general models of program behavior with respect to responsiveness have been derived in this dissertation; analytical solutions for scheduling and checkpointing and standard-conforming solutions to questions regarding replication and resource manage-ment have been proposed and corroborated by experimanage-ments—over 10,000 machine hours were used to run numerical analysis and experiments. The need to carefully select assumptions has become apparent, resulting in guidelines for the development of programs suitable for responsive execution. Also, handling problems at different abstraction levels is of paramount importance: No solution for timeliness or dependability at any sin-gle abstraction level is sufficient, since statements made at one level can be jeopardized by system properties at another level. This dissertation has made the first step towards an integrated treatment of multiple levels, facilitated by applying responsiveness as a single metric, but research towards integration must still continue.

Much of the techniques proposed here are also applicable in other environments and represent both theoretical and practical contributions to questions in responsive execution of parallel programs. However, much work remains to be done and some directions for future research are considered in the following section.