Level 1: RMA · Kalypso in a Cluster - Multilevel Parallelization of a Hydrodynamic Model

Hydrodynamic Simulation

MIKE FLOOD

7.3. Multilevel Parallelization of a Hydrodynamic Model

7.3.2. Level 1: RMA · Kalypso in a Cluster

On the cluster level, a simple, yet effective, improvement of RMA·Kalypsohas been made by setting up the sparse linear system of equations in parallel and using an MPI-parallel sparse solver. This improvement was done as part of this thesis and has brought it to the state-of-the-art of parallel flow models.

An advantage of RMA·Kalypsoover other numerical models has further been achieved by interfacing it with an extensible library, PETSc (see below), which acts as an adapter to a collection of distributed matrix management, preconditioning, and sparse system solver routines.

The PETSc Library for Parallel Computation

The Portable, Extensible Toolkit for Scientific Computation (PETSc, [BG+97; BB+11]) contains data structures for distributed sparse matrices and external interfaces to a large collection of preconditioners as well as direct or indirect sparse linear and nonlinear solvers. PETSc makes heavy use of MPI and thus requires a tightly-coupled computing environment. For good performance, PETSc needs a fast, low-latency interconnect

— faster than gigabit ethernet — and high per-CPU memory performance, which is typically not available in standard multi-core machines. The reason for this is that solving sparse matrices depends more on fast memory than on fast processors [BB+11].

PETSc was chosen for its flexibility, good documentation, and easy integration into an existing software.

Solution Techniques for Sparse Linear Systems

In the case of a conservative weak formulation of the finite element discretization, the systems to be solved are of the general, unsymmetric form

A B^T

(compare [Hei11]). Two alternative global solution methods in PETSc have been tested for this problem (see below): (1) a direct solve and (2) an additive Schwarz preconditioned, iterative domain decomposition method with direct solves on the subdomains (often referred to as a Krylov-Schwarz). MUMPS (MUltifrontalMassively Parallel sparse directSolver) was used for direct factorization and solution of the linear systems in both cases and a stabilized biconjugate gradient method (BiCGstab) was used for iterative solution.

MUMPS [AD+01] is a distributed multifrontal solver for general unsymmetric matrices arising from linear systems of equations. The software is based on MPI, is fully asynchronous, and has parallel LU factorization and solution phases. In order to reduce fill-in, it supports METIS or PARMETIS matrix reorderings¹, among others. PETSc provides an interface to MUMPS using its sparse, distributed matrix format (MPIAIJ).

BiCGstab(l) (BiConjugateGradientstabilized) is a Krylov subspace method for unsym-metric systems developed by Sleijpen et. al [SF93; Sv95]. For brevity the algorithm will not be described here. Besides, BiCGstab(l) could easily be replaced by other Krylov methods, such as GMRES. The Krylov method requiresO(l)matrix-vector products to find the next search direction. These matrix-vector operations lead to most of the network traffic.

Domain decomposition methods can be regarded as a family of hybrid methods between iterative and direct solvers where the problem is decomposed into subproblems on adjacent, possibly overlapping regions. Assigning one subproblem to each computing resource yields a natural parallelization of the problem. Domain decomposition can be done either algebraically (on the matrix) or geometrically (on the mesh). The additive Schwarz method is an algebraic domain decomposition method. It goes back to an iterative method originally developed by Schwarz [Sch70], which is now referred to as multiplicative Schwarz. Due to its practical applicability and natural parallelization, the additive Schwarz method has been rediscovered and improved several times to implement domain decomposition methods for the solution of partial differential equations [Bab57; DW87; SBG96; TW05]. If additive Schwarz is used as a preconditioner in PETSc, the global matrix is partitioned algebraically and distributed to all computing resources. On each resource, the local subdomain problem is either solved directly or by means of an iterative method. The results from all resources are gathered, added on the interface — giving the method its name — and scattered back.

1http://glaros.dtc.umn.edu/gkhome/views/metis

Solution of the Nonlinear System

The shallow water equations result in a nonlinear system of equations. RMA·Kalypso employs Newton’s method to transform the problem to a series of linear problems. In each linear solve, the Jacobian matrix, i. e. the matrix of all first-order partial derivatives, is assembled. Each computing resource is assigned a subset of elements for which to build the local element stiffness matrices. In each Newton iteration, all resources add their local matrices to a global, distributed Jacobian matrix. This matrix is managed by PETSc and used in the solution process. An inexact Newton method [DES82], where the linear problems are only solved up to some error, can also easily be implemented by relaxing the convergence limits of the inner Krylov iteration. Inexact Newton methods have the advantage that a fewer total of inner iterations may be needed to obtain convergence of the outer loop.

Performance Comparison

PETSc allows to set the combination of a preconditioner and a Krylov solver by a simple configuration file. This makes it easy to compare the performance of the two different global approaches, i. e. Newton with (1) a direct linear solve or (2) an iterative Newton-Krylov-Schwarz solve, where the Schwarz domain decomposition is treated as a parallel preconditioner of the Krylov method. The results of this performance comparison can be found in Appendix C on page 145. In the examined test case, the second approach performs slightly better than the first, but both show a total improvement over the original shared-memory implementation. It was not a goal of this thesis to find the optimal configuration for the level1parallelization.

Limitations and Suggestions for Improvement

Only parts of the calculation core RMA·Kalypsohave been parallelized, while the rest of the program still runs sequentially, i. e. performing the same operations on the same data. All data is kept on all computing resources, which results in the same memory requirements for the parallel application as if it was executed on a single computing resource. This has the implication that the simulation does not scale to arbitrarily large domains. The implementation of an approach for data distribution could have be done based on the PETSc^DMMeshor^DMComplexdistributed mesh objects [BB+11], but this was out of the scope of this thesis.

Although the presented level1MPI-parallelization of RMA·_K^alypsowith PETSc could as well be executed across several clusters (e. g. using MPICH-G2, see Subsection2.1.3),

performance would decrease dramatically. The reason for the expected degradation is the large number of messages and amount of data that has to be sent in both direct and iterative solvers. The major requirement of PETSc, a fast, low-latency interconnect, is not met in this case. However, no performance measurements have been done that would demonstrate this performance loss, because MPICH-G2is still not supported in the German D-Grid infrastructure.

The following section introduces a loosely-coupled algorithm suited for multi-cluster environments (level2). The computations on each cluster will then be parallelized using one of the level1algorithms presented before. This choice is independent of the level2algorithm.

Im Dokument Grid Infrastructures (Seite 121-124)