• Keine Ergebnisse gefunden

Efficient Local Resorting Techniques with Space Filling Curves

N/A
N/A
Protected

Academic year: 2022

Aktie "Efficient Local Resorting Techniques with Space Filling Curves"

Copied!
49
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Efficient Local Resorting Techniques with Space Filling Curves

Applied to a Parallel Tsunami Simulation Model

Natalja Rakowsky and Annika Fuchs AWI, Tsunami-Modelling-Group

The 10th International Workshop on Multiscale (Un-)structured Mesh Numerical Modelling for coastal, shelf and global ocean dynamics

Alfred Wegener Institute for Polar and Marine Research Bremerhaven, 22 - 25 August 2011

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 1 / 35

(2)

Outline

introducing TsunAWI motivation for resorting

construction of Hilbert space filling curve (SFC) ordering comparison to other sortings

conclusions

(3)

The AWI Tsunami Modell TsunAWI

TsunAWI in a nutshell

shallow water equations with inundation unstructuredP1−P1NCfinite element grid explicit time stepping scheme

OpenMP parallel Fortran90 code

Most important application:

German-Indonesian Tsunami Early Warning System 3470 scenarios for different prototypic ruptures 3h modeltime (10.800 timesteps of 1s)

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 3 / 35

(4)

The AWI Tsunami Modell TsunAWI

TsunAWI in a nutshell

shallow water equations with inundation unstructuredP1−P1NCfinite element grid explicit time stepping scheme

OpenMP parallel Fortran90 code

Most important application:

German-Indonesian Tsunami Early Warning System 3470 scenarios for different prototypic ruptures 3h modeltime (10.800 timesteps of 1s)

(5)

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 4 / 35

(6)

TsunAWI: example for a computational domain

regional grid for the Sunda Arc

(7)

TsunAWI: example for a computational domain

regional grid for the Sunda Arc

The computational grid discretizes the domain with

varying resolution 50m areas of interest 500m all other coastal areas 15km deep ocean

2.366.319 nodes 4.721.884 elements

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 6 / 35

(8)

TsunAWI: example for a computational domain

regional grid for the Sunda Arc, focus on Bali

(9)

TsunAWI: example for a computational domain

regional grid for the Sunda Arc, focus on Bali

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 8 / 35

(10)

TsunAWI: example for a computational domain

regional grid for the Sunda Arc, focus on Bali

(11)

TsunAWI: example for a computational domain

regional grid for the Sunda Arc, focus on Bali

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 10 / 35

(12)

TsunAWI: example for a computational domain

Original numbering of nodes as provided by the grid generator

(13)

adjacency matrix, original grid

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 12 / 35

(14)

adjacency matrix, original grid

(15)

Motivation for resorting

Data locality on the original grid isvery, verybad.

E.g., each computation on all nodes of one element results in at least one cache miss.

Most time consuming routines in every timestep: compute velocity at nodesv(node) = F(adjacent edges, elems) compute velocity v(edge) = F(adjacent elems, nodes) compute ssh ssh(node) = F(adjacent elems, nodes) compute gradient gradx,y(elem) = F(adjacent nodes)

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 14 / 35

(16)

Motivation for resorting

Data locality on the original grid isvery, verybad.

E.g., each computation on all nodes of one element results in at least one cache miss.

Most time consuming routines in every timestep:

compute velocity at nodesv(node) = F(adjacent edges, elems) compute velocity v(edge) = F(adjacent elems, nodes) compute ssh ssh(node) = F(adjacent elems, nodes) compute gradient gradx,y(elem) = F(adjacent nodes)

(17)

Ideas for resorting

SFC like Sierpinski curve in adaptive grid (J. Behrens et al., KlimaCampus Uni Hamburg) could help.

But how to derive SFC for highly unstructured grid?

@

@

@

@

@

@

@

@

@

@

@

@

@@

@

@

@

@@

@

@@

@

@@

Construct SFC like 3D Hilbert curve in particle code Gadget-2 (communication with T. Rung, TU

Hamburg-Harburg)

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 15 / 35

(18)

Ideas for resorting

SFC like Sierpinski curve in adaptive grid (J. Behrens et al., KlimaCampus Uni Hamburg) could help.

But how to derive SFC for highly unstructured grid?

@

@

@

@

@

@

@

@

@

@

@

@

@@

@

@

@

@@

@

@@

@

@@

Construct SFC like 3D Hilbert curve in particle code Gadget-2 (communication with T. Rung, TU

(19)

SFC construction

•n

0

1 2

3

0

1 2

3

1 0

2 3

For all nodesncalculate the index in the Hilbert curve as a quad number:

SFC index(n) =

132. . .

e.g. for 8 levels: SFC index(n) =

1

·48+

3

·47+

2

·46+. . .

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 16 / 35

(20)

SFC construction

•n

0

1 2

3

0

1 2

3

1 0

2 3

For all nodesncalculate the index in the Hilbert curve as a quad number:

SFC index(n) =

132. . .

e.g. for 8 levels: SFC index(n) =

1

·48+

3

·47+

2

·46+. . .

(21)

SFC construction

•n

0

1 2

3

0

1 2

3

1 0

2 3

For all nodesncalculate the index in the Hilbert curve as a quad number:

SFC index(n) =

1

32. . .

e.g. for 8 levels: SFC index(n) =

1

·48+

3

·47+

2

·46+. . .

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 16 / 35

(22)

SFC construction

•n

0

1 2

3

0

1 2

3

1 0

2 3

For all nodesncalculate the index in the Hilbert curve as a quad number:

SFC index(n) =

1

32. . .

e.g. for 8 levels: SFC index(n) =

1

·48+

3

·47+

2

·46+. . .

(23)

SFC construction

•n

0

1 2

3

0

1 2

3

1 0

2 3

For all nodesncalculate the index in the Hilbert curve as a quad number:

SFC index(n) =

13

2. . .

e.g. for 8 levels: SFC index(n) =

1

·48+

3

·47+

2

·46+. . .

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 16 / 35

(24)

SFC construction

•n

0

1 2

3

0

1 2

3

1 0

2 3

For all nodesncalculate the index in the Hilbert curve as a quad number:

SFC index(n) =

13

2. . .

e.g. for 8 levels: SFC index(n) =

1

·48+

3

·47+

2

·46+. . .

(25)

SFC construction

•n

0

1 2

3

0

1 2

3

1 0

2 3

For all nodesncalculate the index in the Hilbert curve as a quad number:

SFC index(n) =

132

. . .

e.g. for 8 levels: SFC index(n) =

1

·48+

3

·47+

2

·46+. . .

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 16 / 35

(26)

SFC construction

•n

0

1 2

3

0

1 2

3

1 0

2 3

For all nodesncalculate the index in the Hilbert curve as a quad number:

SFC index(n) =

132. . .

e.g. for 8 levels: SFC index(n) =

1

·48+

3

·47+

2

·46+. . .

(27)

SFC construction

•n

0

1 2

3

0

1 2

3

1 0

2 3

For all nodesncalculate the index in the Hilbert curve as a quad number:

SFC index(n) =

132. . .

e.g. for 8 levels:

SFC index(n) =

1

·48+

3

·47+

2

·46+. . .

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 16 / 35

(28)

SFC reordering

Reorder the nodes according to SFC index.

Reorder the elements by an SFC separatly, or numerically by node indicees (more efficient for TsunAWI)

Edges are constructed in TsunAWI (sorted along the nodes)

(29)

SFC ordering of the nodes

for TsunAWI regional indonesian grid

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 18 / 35

(30)

SFC ordering of the nodes

for TsunAWI regional indonesian grid

(31)

adjacency matrix for SFC sorted grid

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 20 / 35

(32)

adjacency matrix for SFC sorted grid

(33)

Comparison: RCM ordering

adjacency matrix

RCM (reverse Cuthill McKee) ordering obtained via adjacency matrix and Matlab symrcm for sparse matrices.

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 22 / 35

(34)

Comparison: RCM ordering

adjacency matrix

RCM (reverse Cuthill McKee) ordering obtained via adjacency matrix

(35)

Comparison: RCM ordering

for TsunAWI regional indonesian grid

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 24 / 35

(36)

Comparison: RCM ordering

for TsunAWI regional indonesian grid

(37)

Comparison: AMD ordering

adjacency matrix

AMD (approximate minimum degree) ordering obtained via adjacency matrix and Matlab symamd for sparse matrices.

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 26 / 35

(38)

Comparison: AMD ordering

adjacency matrix

AMD

(39)

Comparison: AMD ordering

for TsunAWI regional indonesian grid

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 28 / 35

(40)

Comparison: AMD ordering

for TsunAWI regional indonesian grid

(41)

SFC compared to unsorted, RCM, SymAMD

computation time: IBM Power6

Computational time [seconds] for timestep on a cluster node 1×IBM Power6 (4 Cores, 2×hyperthreading)

OMP NUM THREADS

1 2 4 8

orig. 9.77 4.08 2.91 1.57 RCM 2.78 1.77 0.97 0.69 AMD 2.76 1.42 0.95 0.66 SFC 2.69 1.58 0.92 0.60

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 30 / 35

(42)

SFC compared to unsorted, RCM, SymAMD

Hardware counters: IBM Power6

IBM Hardware counter hpmcount for 1000 timesteps on 1×IBM Power6 (4 Cores, 2×hyperthreading,

OMP NUM THREADS=8)

hpmcount event

L2 cache misses Number of loads per load miss orig. 274,478,564,540 17.8

RCM 57,244,100,260 64.0

AMD 54,709,662,295 65.6

SFC 49,980,798,689 88.5

(43)

SFC compared to unsorted, RCM, SymAMD

computation time: Intel Xeon Nehalem-EX

Computational time [seconds] for one timestep on

one blade SGI Altix UV (HLRN, ZIB Berlin and RRZN Hannover) 2×Intel Xeon 5570 (8 Cores, 2×hyperthreading)

OMP NUM THREADS

32, No

1 2 4 8 16 32

64 First Touch

orig. 3.84 2.16 1.48 0.89 0.52 0.40

1.63 0.51

RCM 1.64 1.12 0.59 0.35 0.20 0.19

0.37 0.32

AMD 1.47 0.77 0.50 0.30 0.18 0.16

0.32 0.19

SFC 1.47 0.90 0.51 0.31 0.17 0.14

0.30 0.18

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 32 / 35

(44)

SFC compared to unsorted, RCM, SymAMD

computation time: Intel Xeon Nehalem-EX

Computational time [seconds] for one timestep on

one blade SGI Altix UV (HLRN, ZIB Berlin and RRZN Hannover) 2×Intel Xeon 5570 (8 Cores, 2×hyperthreading)

OMP NUM THREADS

32, No

1 2 4 8 16 32 64

First Touch

orig. 3.84 2.16 1.48 0.89 0.52 0.40 1.63

0.51

RCM 1.64 1.12 0.59 0.35 0.20 0.19 0.37

0.32

AMD 1.47 0.77 0.50 0.30 0.18 0.16 0.32

0.19

SFC 1.47 0.90 0.51 0.31 0.17 0.14 0.30

0.18

(45)

SFC compared to unsorted, RCM, SymAMD

computation time: Intel Xeon Nehalem-EX

Computational time [seconds] for one timestep on

one blade SGI Altix UV (HLRN, ZIB Berlin and RRZN Hannover) 2×Intel Xeon 5570 (8 Cores, 2×hyperthreading)

OMP NUM THREADS 32, No

1 2 4 8 16 32 64 First Touch

orig. 3.84 2.16 1.48 0.89 0.52 0.40 1.63 0.51 RCM 1.64 1.12 0.59 0.35 0.20 0.19 0.37 0.32 AMD 1.47 0.77 0.50 0.30 0.18 0.16 0.32 0.19 SFC 1.47 0.90 0.51 0.31 0.17 0.14 0.30 0.18

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 32 / 35

(46)

Remark on OpenMP

importance of first touch for data locality

allocate(array(dim)) array(:) = 0.

!$OMP PARALLEL DO do n=1,dim

array(n) = 0. end do

!$OMP END PARALLEL DO

!$OMP PARALLEL DO do n=1,dim

array(n) = ...

end do

(47)

Remark on OpenMP

importance of first touch for data locality

allocate(array(dim))

array(:) = 0.

!$OMP PARALLEL DO do n=1,dim

array(n) = 0.

end do

!$OMP END PARALLEL DO

!$OMP PARALLEL DO do n=1,dim

array(n) = ...

end do

!$OMP END PARALLEL DO

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 33 / 35

(48)

properties of resorting by a SFC

SFC is a very valuable method, because

it is cheap to compute provides good data locality

onalllevels of the memory hierarchy

as domain decomposition, it keeps interfaces small (though not optimal)

(49)

work to do

Influence of SFC ordering on ILU based preconditioners

fill-in

computational load convergence rate

sparse matrix computations in general

SFC compared to generic partitioning algorithms (MeTiS, scotch,. . . )

TsunAWI

further optimize OpenMP parallelization MPI parallelization

N. Rakowsky, A. Fuchs SFC in TsunAWI IMUM 2011, Bremerhaven 35 / 35

Referenzen

ÄHNLICHE DOKUMENTE

We consider a new network design problem that general- izes the Hop and Diameter Constrained Minimum Spanning and Steiner Tree Problem as follows: given an edge-weighted

A real valued function is holomor- phic if and only if it is constant.. Consider the complex conjugate

Auf der N4000 können sowohl Programme gerechnet werden, die die schnellen PA-RISC Prozessoren für eine skalare Leistung nutzen, als auch parallele Pro- gramme, die auf bis zu

.Xauthority wird von SSH benutzt, um das Authorization Cookie für den X11- Server zu speichern; SSH überprüft, daß die X11-Forward-Verbin- dungen dieses Cookie betreffen; wenn

From To Function From To Function Receptacle a,nd Receptacle and Recepta.cle a.nd Receptacle and..

- Parameter a in eine Nachricht packt - Nachricht an Server S versendet - Timeout für die Antwort setzt - Antwort entgegennimmt (oder evtl. exception bei timeout auslöst)

We present a technique to infer lower bounds on the worst- case runtime complexity of integer programs.. To this end, we construct symbolic representations of program executions using

16. Meister des Boisser&eschen Bar- tholomäus, Die Heiligen Christina, Jacobus minor; Bartholomäus,Agnes, Cäcilia, Johannes Evangelista, Mar- garete. Meister des Todes der