• Keine Ergebnisse gefunden

.icp .uni-stuttgar t.de

N/A
N/A
Protected

Academic year: 2021

Aktie ".icp .uni-stuttgar t.de"

Copied!
23
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

.icp .uni-stuttgar t.de

ScaFaCoS and P 3 M

Recent Developments

Olaf Lenz

June 3, 2013

(2)

.icp .uni-stuttgar t.de

Outline

ScaFaCoS

ScaFaCoS Methods

Performance Comparison

Recent P 3 M developments

(3)

.icp .uni-stuttgar t.de

ScaFaCoS

Scalable Fast Coulomb Solver

“Highly scalable”, MPI-parallelized Library of different Coulomb solvers Common interface for all methods Developed by groups from Jülich, Wuppertal, Chemnitz, Bonn . . . and Stuttgart

BMBF project, officially ended 2011 Source code on github since two months (yay!)

First publication will be submitted

“soon” (since 6 months)

(4)

.icp .uni-stuttgar t.de

Interface

# i n c l u d e < fcs . h >

FCS h a n d l e = N U L L ; /* I n i t i a l i z e P3M */

f c s _ i n i t (& handle , " p3m " , M P I _ C O M M _ W O R L D ) ; /* Set c o m m o n p a r a m e t e r s */

f c s _ s e t _ c o m m o n (handle , n e a r _ f i e l d , box_a , box_b , box_c , ← - offset , p e r i o d i c i t y , t o t a l _ p a r t i c l e s ) ;

/* Set method - s p e c i f i c p a r m e t e r s */

f c s _ p 3 m _ s e t _ r _ c u t ( handle , r _ c u t ) ; /* T u n e the m e t h o d ( o p t i o n a l ) */

f c s _ t u n e ( handle , N, m a x _ p a r t i c l e s , p o s i t i o n s , c h a r g e s ) ; /* Run the m e t h o d */

f c s _ r u n ( handle , N , m a x _ p a r t i c l e s , p o s i t i o n s , charges , ← - fields , p o t e n t i a l s ) ;

/* F i n a l l y d e s t r o y the h a n d l e */

f c s _ d e s t r o y( h a n d l e ) ;

(5)

.icp .uni-stuttgar t.de

Methods

ScaFaCoS currently provides 11 methods:

DIRECT, EWALD, P3M, P2NFFT, VMG, PP3MG, PEPC, FMM, MEMD, MMM1D, MMM2D

In the following comparison, only the bold methods are considered

Distinguish Splitting Methods, Hierarchical Methods and Local Methods (i.e. MEMD)

Other methods for reference purposes only (DIRECT, EWALD),

for different periodicities only (MMM*D, here, only fully periodic

systems are considered), or performed too bad (PEPC)

(6)

.icp .uni-stuttgar t.de

Splitting Methods

Problems of electrostatic potential Slow decay – bad for direct summation

Singularity – bad for convergence accelerating methods

= +

Idea of splitting methods: split potential into fast decaying near field and non-singular far field

Near field can be computed directly ( O (N))

For the far field, other methods can be used

(7)

.icp .uni-stuttgar t.de

Splitting Methods: Ewald and Particle-Mesh Ewald

Ewald’s idea:

Compute far field in Fourier space Ewald summation O (N 3/2 )

Particle-Mesh Ewald O (N log N) discretize far field charge distribution onto mesh

use FFT to Fourier transform solve Poisson equation in Fourier space

back-FFT to obtain potential on mesh compute potentials or fields by interpolating the mesh potential In ScaFaCoS: P3M (ICP), P2NFFT (Chemnitz; uses non-equidistant FFT)

P. P. Ewald (1888-1985)

(8)

.icp .uni-stuttgar t.de

Splitting Methods: Multigrid

Solve Poisson equation in far field with multigrid PDE solver

use differente levels of successively coarser meshes

solve poisson equation on these meshes by recursively improving the solution of the coarser mesh

Complexity O (N)

Can be extended to handle periodic BC

In ScaFaCoS: PP3MG (Wuppertal), VMG (Bonn)

l = 4 l = 3 l = 2 l = 1

Restriction Prolongation Smoothing/Solving

(9)

.icp .uni-stuttgar t.de

Hierarchical Methods: Barnes-Hut Tree Code

Multipole expand successively larger clusters of particles

Compute interaction with far away clusters instead of single particles Complexity O (N log N)

Can be extended to handle periodic BC

In ScaFaCoS: PEPC (Jülich)

(10)

.icp .uni-stuttgar t.de

Hierarchical Methods: Fast Multipole Method

Expand Barnes-Hut: let clusters interact with each other

Put eveything on a grid Complexity O (N)

Can be extended to handle periodic BC

In ScaFaCoS: FMM (Jülich)

(11)

.icp .uni-stuttgar t.de

Local Methods: MEMD

See talk of Florian F.

Purely local: should show very nice parallel scaling

Complexity O (N)

(12)

.icp .uni-stuttgar t.de

Benchmark Systems

“Cloud-wall system”

(ESPR ES S O test system) 300 charges

“Silica melt”

12 960 charges

(13)

.icp .uni-stuttgar t.de

Benchmark Systems 2

When larger systems were needed, the systems were replicated PEPC was removed (too bad)

Periodic systems

Relatively homogenous density Charge-neutral

JUROPA (Linux cluster) for small to intermediate number of cores JUGENE (BlueGene/P HPC machine) for intermediate to large number of cores

Accuracies are given by the relative RMS potential error

ε pot :=

 P N

j=1

Φ ref (~ x j ) − Φ method (~ x j )

2

P N j=1

Φ ref ( ~ x j )

2

1/2

(14)

.icp .uni-stuttgar t.de

Complexity

P2NFFT, P 3 M and FMM are fastest MEMD and Multigrid

≈ × 10 slower All algorithms show (close-to-)linear behavior

log N-term of P2NFFT and P 3 M is invisible No cross-over with FMM

104 105 106 107 108 109

#Charges 106

105 104 103

Timet/#Charges[s]

MEMD P2NFFT P3M VMG FMM PP3MG

Silica melt, ε

pot

≤ 10

−3

, P = 1 (JUROPA)

(15)

.icp .uni-stuttgar t.de

Accuracy

FMM and P2NFFT scale very good Can achieve very high accuracy

P 3 M not (due to tuning) Multigrid methods suffer from steep potential (or bad tuning)

MEMD cannot influence accuracy to any great extent

10−13 10−11 10−9 10−7 10−5 10−3 10−1 Relative RMS potential errorεpot

106 10−5 104 103 102 101

Timet/#Charges[s]

MEMD P2NFFT P3M VMG FMM PP3MG

N = 102 900 (Cloud-wall), P = 1 (JUROPA)

(16)

.icp .uni-stuttgar t.de

Scaling: Timing

Execution time t vs.

number of cores P often used to display parallel scaling

shows actual execution times

hides actual scaling hides differences

between algorithms

2

0 21 22 23 24 25 26 27 28 29 210 211 212

#CoresP 103

102 101 100 101 102 103

Timet[s]

MEMD P2NFFT P3M VMG FMM PP3MG

N = 1 012 500 (Cloud-wall), ε

pot

≤ 10

−3

, JUROPA

(17)

.icp .uni-stuttgar t.de

Scaling: Relative Parallel Efficiency

Parallel efficiency can be used to show scaling: e(P) = t t

1

P

P

e(P) ∈ [0, 1]

1 for optimal scaling Can be thought of

“effective fraction of CPU used in parallel”

Relative Parallel Efficiency to compare algorithms:

e(P) = t best P best t P P

20 21 22 23 24 25 26 27 28 29 210 211 212

#CoresP 0.0

0.2 0.4 0.6 0.8 1.0

RelativeParallelEfficiencye(P)

MEMD P2NFFT P3M VMG FMM PP3MG

N = 1 012 500 (Cloud-wall), ε

pot

≤ 10

−3

, JUROPA

(18)

.icp .uni-stuttgar t.de

Scaling: Comparing Methods

Again: P2NFFT, FMM and P 3 M within × 2 Scaling of P 3 M is better than P2NFFT and FMM

Issue of P 3 M: Tuning MEMD performs OK Scaling of Multigrid methods is very smooth and flat . . . but e(P) < 10%!

20 21 22 23 24 25 26 27 28 29 210 211 212

#CoresP 0.0

0.2 0.4 0.6 0.8 1.0

RelativeParallelEfficiencye(P)

MEMD P2NFFT P3M VMG FMM PP3MG

N = 1 012 500 (Cloud-wall), ε

pot

≤ 10

−3

, JUROPA

(19)

.icp .uni-stuttgar t.de

Scaling: Small Systems

20 21 22 23 24 25

#CoresP 0.0

0.2 0.4 0.6 0.8 1.0

RelativeParallelEfficiencye(P)

MEMD P2NFFT P3M VMG FMM PP3MG

N = 8100 (Cloud-wall), ε

pot

≤ 10

−3

, JUROPA

20 21 22 23 24 25 26 27 28 29

#CoresP 0.0

0.2 0.4 0.6 0.8 1.0

RelativeParallelEfficiencye(P)

MEMD P2NFFT P3M VMG FMM PP3MG

N = 102 900 (Cloud-wall), ε

pot

≤ 10

−3

,

JUROPA

(20)

.icp .uni-stuttgar t.de

Scaling: HPC machine

20 21 22 23 24 25 26 27 28 29 210 211 212

#CoresP 0.0

0.2 0.4 0.6 0.8 1.0

RelativeParallelEfficiencye(P)

MEMD P2NFFT P3M VMG FMM PP3MG

N = 1 012 500 (Cloud-wall), ε

pot

≤ 10

−3

, JUROPA

20 21 22 23 24 25 26 27 28 29 210 211 212

#CoresP 0.0

0.2 0.4 0.6 0.8 1.0

RelativeParallelEfficiencye(P)

MEMD P2NFFT P3M VMG FMM PP3MG

N = 1 012 500 (Cloud-wall), ε

pot

≤ 10

−3

, JUGENE

Older versions of both P 3 M and P2NFFT (JUGENE is dead) All algorithms show better scaling

JUGENE has slower cores but better interconnects

(21)

.icp .uni-stuttgar t.de

Scaling: Large Systems

20 22 24 26 28 210 212 214

#CoresP 0.0

0.2 0.4 0.6 0.8 1.0

RelativeParallelEfficiencye(P)

MEMD P2NFFT VMG FMM PP3MG

N = 9 830 400 (Cloud-wall), ε

pot

≤ 10

−3

, JUGENE

20 22 24 26 28 210 212 214 216 218 220 222

#CoresP 0.0

0.2 0.4 0.6 0.8 1.0

RelativeParallelEfficiencye(P)

P2NFFT VMG FMM PP3MG

N = 1 012 500 000 (Cloud-wall), ε

pot

≤ 10

−3

, JUGENE

Many algorithms can’t handle large systems For really large systems, FMM seems to be good FMM has done time steps of 3 trillion charges!

. . . whatever that’s good for

(22)

.icp .uni-stuttgar t.de

ScaFaCoS: Conclusions

Performance depends heavily on architecture, compiler and implementation

. . . and tuning!

× 2 differences between algorithms are “normal”

Within these limits, FMM, P 3 M and P2NFFT perform equally good MEMD slightly worse ( ≈ × 4), but performs better at larger

systems

Multigrid methods seem to be worse ( ≈ × 10)

. . . apparently due to large variation in the potential

(23)

.icp .uni-stuttgar t.de

P 3 M: Recent Developments

Determined optimal P 3 M components, gained ≈ × 4 (Florian W.) Improved tuning (Florian W.)

CUDA P 3 M: coming to ESPRE S S O really soon (Florian W.) First interface to ScaFaCoS (with problems) (Andreas M.) Improved P 3 M code (Florian W., Olaf)

In progress: Improved code organization: common code base for ScaFaCoS and ESPR ES S O (Florian W., Olaf)

In progress: Further improvements in tuning (April, Olaf)

Referenzen

ÄHNLICHE DOKUMENTE

Assuming the X-ray laser emission characteristics to be similar on both ends of the tar- get, a second-moment analysis of the intensity distributions was used to determine the

Figure 1: Printed near field communication system (a) printing of electronic structures on the lab- oratory printing press LaborMAN (b) moulded test sample (c) functional test at

Here, we present a dynamic measurement approach based on the transient plane source technique, which extracts thermal properties from a temperature transient caused by a step

By placing a single microsphere on a thin film of the photosensitive phase change material Ge 2 Sb 5 Te 5 and exposing it to a single short laser pulse, the spatial intensity

0« ,. Figure 13 Coniine co lor at: www.lpr-journal.org) Develop- ment of the field distribution from very small particles (diameter D « A) on a surface to

By detuned excitation, dynamical backaction mediated by the optical dipole force is demonstrated, which leads to radiation -pressure -induced coherent oscillations

Because the exciting electric field is a superposition of incident and reflected waves disappears at the surface of metal.. Therefore the electric field has to be calculated by

Stress distribution in dam body and hydrodynamic pressure distribution in reservoir, at the maximum hydrodynamic sub-pressure time step, due to near field and far field ground