Visualization System Engineering - Point-based visualization of molecular dynamics data sets

4 MegaMol

To recapitulate one important statement of the introduction: The purpose of visual-ization is to gain insight into large data sets from diverse applications (cf. [CMS99], [JH04], and Chapter 1). As such, visualization needs, apart from presenting some novelty, to work with real-world data sets and needs to be applied to real-world problems to advance the scientific field itself. To be able to handle the continuously growing data sets and problem sizes the software complexity also increases to achieve the required solutions. Software systems emerge to cope with this situation and are quickly gaining importance, which can be seen in the fact that major visu-alization venues recently introduced the system paper type. However, such software systems tend to be huge, continuously growing, and more long-lasting than origi-nally intended and expected [Phi98]. Creating such a large software system benefits from expertise beyond the field of visualization itself: namely from a deeper under-standing of software design and the principles of software engineering (SE). In this chapter this concept is detailed by the example of MegaMol [Meg], a visualization system focused on particle-based visualization for MD data sets. Almost all visuali-zations presented in this thesis were implemented as part of MegaMol.

thor-ough error handling as such an implementation is enthor-ough to produce performance results as well as few screen shots to add to the publication. Especially with at-the-edge concepts like assembler shaders, GPGPU APIs, and the constantly evolving graphics card drivers and bugs, researchers often are happy enough that the code works at all. And, as it is not a task for researchers to write production software to generate revenue, delivering proof of concept with disposable prototypes seems sufficient.

To cope with the increasing problems, however, disposable proof-of-concept prototypes are not acceptable. Suitable software systems will exhibit the attributes of production software, like sophisticated file formats and memory management to cope with large datasets of different format and from inhomogeneous sources, end-user-friendly interface for the parameterization, and robustness against software and operating errors, which seems a contradiction to the priorities of a research environment. The situation is most pressing in the visual analytics community, where, by design, a large number of publications are systems-centred since the visual analytics workflow itself utilizes techniques from many different fields to perform complex analysis tasks. This is duly recognized as a significant challenge with no immediate solution in [KKEM10]. But also for the visualization community this issue is gaining importance. There are many books on data structures and op-timization at the algorithmic level for visualization [GP07], [KKEM10], and utilising SE for the field is probably just a next step. Cleanly Designing and implementing software is by no means impossible. It usually corresponds to the effort for applica-tion papers, where the focus is partially shifted from the core algorithm to, e.g. the user interface. Systems require significantly more time than writing a disposable prototype, as each of their different parts or modules already have roughly the same complexity of implementation as a single disposable prototype, but need to be more generic, which adds some further overhead [VHS01]. The question arises how to efficiently use the available working resources to tackle this problem.

Employing existing programs or modules other researchers offer as public domain (implying due acknowledgement) seems an acceptable solution. Integrat-ing research modules into existIntegrat-ing systems might actually be the best solution, if applicable. Each system, however, was designed with a specific goal in mind and an integration of research code will only succeed if this coincides with the required research direction. AVS [AVS] pioneered the field of extensible, generic visualization software systems, as it features a thin framework, allowing for the interactive com-position of functional modules to define the resulting application. However, it only offers low utility functionality and it is not actively maintained, thus lacking mod-ern rendering techniques. VTK [SML97] offers many classes for data types and algo-rithms, often with multiple CPU-based implementations. As the main goal of its developers is general applicability optimizations for special cases or special hard-ware (even GPUs) are mostly omitted. ParaView [JAL05] is the most prominent reconfigurable frontend based on VTK. It offers good performance through custom-ized modules and is ready for distributed and parallel rendering in cluster

environ-4.1 Visualization System Engineering 139

ments. Representatives for generic frameworks with focus on information visuali-zation are e.g. Prefuse [HCL05] and Improvise [Wea04]. They are extremely flexible and offer sophisticated data management including object sharing, events for data manipulation and user interaction, and garbage collection, implemented through intrinsic features of Java. These tools include a wide range of important functions required like data query languages, graph layouting support, flexible data tables as well as most of the commonly used visualizations. Writing additional computation or visualization modules is relatively easy for all of these systems once the user gets acquainted with the corresponding architecture. However, if a certain concept of the system, like a data handling paradigm or available data types conflict with the requirements of the researcher the benefits from the framework’s functionality are extremely limited. A secondary framework will most likely be required to be grafted onto the existing one, which results in almost the same effort as writing everything from scratch. This is the case for particle-based visualization for large MD data sets.

Existing frameworks often focus on continuous grid-based or mesh-based data and almost always lack support for large particle-based data, e.g. linear block storage for fast rendering or streaming capabilities. Available open-source tools especially written for MD data or other particle data focus on specific problems, e.g. protein visualization, and are thus too inflexible (e.g. VMD builds up internal data struc-tures for protein analysis tasks, which hamper high performance rendering of very large particle data sets, cf. Chapter 2.3). To meet the requirements for the large re-search project funded through the SFB 716 [DFG] the MegaMol project was started with the goal to visualise data sets with 10 particles from physics, material science, biochemistry, thermodynamics, and engineering. There was a program available at the VIS⁸ research group for point-based visualization of galaxy data sets [HE03], which was extended to handle MD data [GRVE07]. Being a classical research proto-type, the program had an extremely monolithic structure and was yet extended several times. It had reached a state where maintaining its full functionality was not economically acceptable any more.

But, if there is the necessity to start a new software system at least some publicly available components could be used to lower the implementation work-load. While this is true to a certain extent, a system which was quickly put together by only existing components, e.g. as it is common practice in web programming, exhibits similar problems as a disposable prototype: while the functionality, the usability, and perhaps even the flexibility required to be used as a system might be reached, the lack of an adequate software architecture is bound to decrease the scalability in terms of performance, as well as the stability in terms of robustness, extensibility, and maintenance of the software itself.

Scalability in terms of rendering performance and data sets size is achieved, among other factors, by the scalability of the employed algorithms and the

8Institute for Visualization and Interactive Systems; University of Stuttgart; Germany;

http://www.vis.uni-stuttgart.de/

mented memory management. Performance of the algorithms can be ensured even in a composed application, as it mainly is an issue of optimizing the individual modules internally. In contrast, optimizing the in-core storage must be accom-plished on a larger scale. Issues arise from the fact that the different modules might require different data layouts and may even be written in different programming languages. The latter aspect is not a problem by itself, at most a conceptual chal-lenge, but might result in the need for data marshalling between the different lan-guage paradigms which, again, can introduce overhead. In addition, many freely available implementations favour educational usefulness and thus simplicity of code instead of efficiency or scalability, and many algorithms are implemented in Java because of the popularity of the language, resulting in the well-known scala-bility issues and memory restrictions of un-optimized implementations. For exam-ple, the developers of Prefuse explicitly mention a performance bottleneck for visu-alizations of even medium-sized data sets with a few thousand items⁹.

A more severe problem, however, is given by the need for different data lay-outs. This often results in data replication, which in turn, even assuming every component scales well in terms of performance, increases the required in-core memory with every additional module. Only a centralised base architecture of a shared data handling can overcome this problem. This, however, requires either extensive work on the framework, e.g. providing facade access interfaces for all required modules and programming languages, or requires adaption of the original-ly unchanged, incorporated code modules.

Multiple imported modules, which must be considered semi-black boxes in a composed system, also introduce the problem of maintenance stability. When the different modules communicate with each other, a common case e.g. for coordinat-ed views or multi-pass algorithms (with different passes being realiscoordinat-ed by separate modules), the number of code paths between the modules calling each other inevi-table grows. This interconnection is referred to as coupling [PJ88] and is an essential aspect of software quality. There are five different levels of coupling, but only the worst two need to be avoided to obtain maintainable software. One is content cou-pling, where modules can address internal data of other modules, which endangers the validity and consistency of the data. The other one is global coupling, referring to global data storage without explicit data ownership. Therefore, every access potentially interferes with every module. The other three coupling variants influ-ence internal control flow (control coupling), pass complex data types (stamp cou-pling) or just elementary data types (data coucou-pling) between modules. It is usually not feasible to guarantee this minimum coupling in research prototypes, as the interfaces for all modules would be needed to be known beforehand. Instead, mod-ules communicate with each other in arbitrary fashion though interfaces written and extended on demand, in the worst case resulting in content coupling. Making changes to one module or adding a new module therefore has negative side effects

9 http://prefuse.org/doc/faq/#tec.1 (last visited 29.12.2011)

4.1 Visualization System Engineering 141

on many other parts of the system, rapidly decreasing the maintainability and robustness of the whole software, which will thus eventually break down com-pletely.

Figure 79: Top: the classic bathtub curve as known from traditional reliability analysis. The integral of the curve until a given point in time relates to the effort invested. Bottom: the bathtub curve for software [Rel96] exhibits two main differences: software does not decay and thus cannot wear out. However, it will be updated several times over its useful life, which will always add new bugs. In the long run, functionality will not fit the original design and thus deteriorate quality, increasing the failure probability with each update.

For the formalization of this scenario, the bathtub curve, known from tradi-tional reliability analysis [Rel96], can be adapted to the life cycle of software (cf.

Figure 79), which is not prone to physical fatigue, but which gets less and less main-tainable with each update introducing a significant amount of new functionality.

The increasing complexity of the software results in an increasing asymptote for the reliability, i.e. the software quality steadily decays over time. The diagram also reflects the effort invested into the development process as the integral of the curve up to a given point in time. The authors of the software-adapted curve suggest improved SE as the only way to counter this effect and to reduce failure rates again.

The question arises if this is true for all types of software, or only for systems.

Figure 80: Bathtub curves for different kinds of software developed in the context of re-search. The blue curve represents the base workload required for development of framework aspects, e.g. data set loading or the main loop. The green curve (identical in all three dia-grams) adds the workload required for the algorithmic method development for each publi-cation.

The software bathtub curve can be further adapted to represent the features of different types of software. These curves are shown in Figure 80. The top-most curve models the characteristics of proof-of-concept prototypes, which exhibit a one-time development effort, ideally resulting in a publication. Only the core algorithm is optimized for performance. The whole process basically is repeated for each pub-lication anew, although some code elements can be copied and the new work

bene-4.1 Visualization System Engineering 143

fits from lessons learned, resulting in slightly less cost. Through reuse of code, fea-tures and bug fixes accumulate, but at the same time code will be reused for tasks, which can be made to possible, but which the code was never intended for. This results in a reduction of software quality. If the resulting pseudo-system reaches a state where it is no longer usable, e.g. because of the missing architecture adding new features is not possible at acceptable cost, much of the code base is removed and restarted from scratch. Because of this, research prototypes exhibit a nearly constant effort and nearly stable reliability. From an SE point-of-view, research software for application papers is just a variant of this, where part of the effort is shifted towards a viable user interface and increased robustness.

The centre diagram depicts the situation for systems composed from freely available components. The workload required for the system decreases for each publication, as later publications can utilize the existing framework functions.

However, these systems are started with a specific goal in mind, usually for a specif-ic first publspecif-ication. Further work will change this goal as research questions change over time. However, just because the system is available, it will be used whenever possible. As the development time is going to be reduced because of the well-known time constraints there will be no thoroughly designed architecture. Each time new functionality is added via a new module, either newly written or import-ed from existing code, it neimport-eds to be connectimport-ed to the existing system. The lack of interfaces and the need for quick results will result in code replication, redundan-cies, and, in the worst case, content coupling [PJ88]. The manipulation of internal data of most modules and ubiquitous side effects will thus deteriorate the stability of the framework and will increase the failure rate, resulting in increased mainte-nance effort, as shown in the original software bathtub curve (bottom image of Figure 79). The system will eventually reach a point of instability where the archi-tecture will break down and the maintenance cost will increase unacceptably.

Since a composed system has basically similar issues like proof-of-concept prototypes, it might also have the same justification of being sufficient for a single or only few publications, rendering the overhead of SE for visualization systems superfluous. This, however, is not true. On the one hand, the required effort to set up a system, even just a system composed from existing code, is too high to be justified for a single or very few publications. On the other hand, the sheer availa-bility of a system of any kind results in the desire to employ it for further applica-tions. The life span of the software thus increases beyond to original plan, which results in tweaking of modules towards changing goals. The system’s structure, however, will change only very little or not at all [AR99]. As a result of the issue that the framework was not designed properly, the structure will develop into a hin-drance in the long run.

These effects of framework degeneration are countered by a well-thought-out design created before implementation starts, which yields stable frameworks of engineered systems. These will also be extended for each publication, but because of the initial design effort the function updates to the framework will be very limited

and will allow for a slowly but steadily decreasing failure rate. The ideal situation is a non-periodic bathtub curve as baseline representing the effort invested into the framework which becomes stable over time. This is sketched out in the bottom diagram in Figure 80. Additional effort almost exclusively results from inserting research-grade prototype modules which obtain good stability through correct interfacing with the system. The quality of these modules is not relevant for the quality of the system as a whole. Early publications will be more expensive as the framework will still be under heavy development. However, when the system reaches a certain degree of stability and usefulness, publications will require signif-icantly decreased development effort.

Stable system architecture obviously requires some effort up front. This is why SE is not very popular: it means shifting a lot of effort to very early stages of the development process, and the ensuing positive effect takes a relatively long time to manifest. What we need is just the right dose of SE. The framework needs to be cleanly designed and developed. Many other parts, however, like well-known basic algorithms required for some ground truth and baseline, can be implemented and integrated by anyone with a decent understanding. This can even be accom-plished by involving students or interns. To control the influences of the different parts of the system, they simply need to be known first. As trivial as this sounds, this first step, installing a rudimentary configuration management, is often reduced to the simple use of version control systems for source code, e.g. Subversion¹⁰. As useful, even important, such a version control system is in general, for a software system’s development of the complexity discussed here, more elaborate configura-tion management funcconfigura-tions are required, e.g. test configuraconfigura-tion definiconfigura-tion, auto-matic testing, and approval or rejection of new code, which cannot be delivered by these tools.

Im Dokument Point-based visualization of molecular dynamics data sets (Seite 137-144)