.icp .uni-stuttgar t.de
ScaFaCoS and P 3 M
Recent Developments
Olaf Lenz
June 3, 2013
.icp .uni-stuttgar t.de
Outline
ScaFaCoS
ScaFaCoS Methods
Performance Comparison
Recent P 3 M developments
.icp .uni-stuttgar t.de
ScaFaCoS
Scalable Fast Coulomb Solver
“Highly scalable”, MPI-parallelized Library of different Coulomb solvers Common interface for all methods Developed by groups from Jülich, Wuppertal, Chemnitz, Bonn . . . and Stuttgart
BMBF project, officially ended 2011 Source code on github since two months (yay!)
First publication will be submitted
“soon” (since 6 months)
.icp .uni-stuttgar t.de
Interface
# i n c l u d e < fcs . h >
FCS h a n d l e = N U L L ; /* I n i t i a l i z e P3M */
f c s _ i n i t (& handle , " p3m " , M P I _ C O M M _ W O R L D ) ; /* Set c o m m o n p a r a m e t e r s */
f c s _ s e t _ c o m m o n (handle , n e a r _ f i e l d , box_a , box_b , box_c , ← - offset , p e r i o d i c i t y , t o t a l _ p a r t i c l e s ) ;
/* Set method - s p e c i f i c p a r m e t e r s */
f c s _ p 3 m _ s e t _ r _ c u t ( handle , r _ c u t ) ; /* T u n e the m e t h o d ( o p t i o n a l ) */
f c s _ t u n e ( handle , N, m a x _ p a r t i c l e s , p o s i t i o n s , c h a r g e s ) ; /* Run the m e t h o d */
f c s _ r u n ( handle , N , m a x _ p a r t i c l e s , p o s i t i o n s , charges , ← - fields , p o t e n t i a l s ) ;
/* F i n a l l y d e s t r o y the h a n d l e */
f c s _ d e s t r o y( h a n d l e ) ;
.icp .uni-stuttgar t.de
Methods
ScaFaCoS currently provides 11 methods:
DIRECT, EWALD, P3M, P2NFFT, VMG, PP3MG, PEPC, FMM, MEMD, MMM1D, MMM2D
In the following comparison, only the bold methods are considered
Distinguish Splitting Methods, Hierarchical Methods and Local Methods (i.e. MEMD)
Other methods for reference purposes only (DIRECT, EWALD),
for different periodicities only (MMM*D, here, only fully periodic
systems are considered), or performed too bad (PEPC)
.icp .uni-stuttgar t.de
Splitting Methods
Problems of electrostatic potential Slow decay – bad for direct summation
Singularity – bad for convergence accelerating methods
= +
Idea of splitting methods: split potential into fast decaying near field and non-singular far field
Near field can be computed directly ( O (N))
For the far field, other methods can be used
.icp .uni-stuttgar t.de
Splitting Methods: Ewald and Particle-Mesh Ewald
Ewald’s idea:
Compute far field in Fourier space Ewald summation O (N 3/2 )
Particle-Mesh Ewald O (N log N) discretize far field charge distribution onto mesh
use FFT to Fourier transform solve Poisson equation in Fourier space
back-FFT to obtain potential on mesh compute potentials or fields by interpolating the mesh potential In ScaFaCoS: P3M (ICP), P2NFFT (Chemnitz; uses non-equidistant FFT)
P. P. Ewald (1888-1985)
.icp .uni-stuttgar t.de
Splitting Methods: Multigrid
Solve Poisson equation in far field with multigrid PDE solver
use differente levels of successively coarser meshes
solve poisson equation on these meshes by recursively improving the solution of the coarser mesh
Complexity O (N)
Can be extended to handle periodic BC
In ScaFaCoS: PP3MG (Wuppertal), VMG (Bonn)
l = 4 l = 3 l = 2 l = 1
Restriction Prolongation Smoothing/Solving
.icp .uni-stuttgar t.de
Hierarchical Methods: Barnes-Hut Tree Code
Multipole expand successively larger clusters of particles
Compute interaction with far away clusters instead of single particles Complexity O (N log N)
Can be extended to handle periodic BC
In ScaFaCoS: PEPC (Jülich)
.icp .uni-stuttgar t.de
Hierarchical Methods: Fast Multipole Method
Expand Barnes-Hut: let clusters interact with each other
Put eveything on a grid Complexity O (N)
Can be extended to handle periodic BC
In ScaFaCoS: FMM (Jülich)
.icp .uni-stuttgar t.de
Local Methods: MEMD
See talk of Florian F.
Purely local: should show very nice parallel scaling
Complexity O (N)
.icp .uni-stuttgar t.de
Benchmark Systems
“Cloud-wall system”
(ESPR ES S O test system) 300 charges
“Silica melt”
12 960 charges
.icp .uni-stuttgar t.de
Benchmark Systems 2
When larger systems were needed, the systems were replicated PEPC was removed (too bad)
Periodic systems
Relatively homogenous density Charge-neutral
JUROPA (Linux cluster) for small to intermediate number of cores JUGENE (BlueGene/P HPC machine) for intermediate to large number of cores
Accuracies are given by the relative RMS potential error
ε pot :=
P N
j=1
Φ ref (~ x j ) − Φ method (~ x j )
2
P N j=1
Φ ref ( ~ x j )
2
1/2
.icp .uni-stuttgar t.de
Complexity
P2NFFT, P 3 M and FMM are fastest MEMD and Multigrid
≈ × 10 slower All algorithms show (close-to-)linear behavior
log N-term of P2NFFT and P 3 M is invisible No cross-over with FMM
104 105 106 107 108 109
#Charges 10−6
10−5 10−4 10−3
Timet/#Charges[s]
MEMD P2NFFT P3M VMG FMM PP3MG
Silica melt, ε
pot≤ 10
−3, P = 1 (JUROPA)
.icp .uni-stuttgar t.de
Accuracy
FMM and P2NFFT scale very good Can achieve very high accuracy
P 3 M not (due to tuning) Multigrid methods suffer from steep potential (or bad tuning)
MEMD cannot influence accuracy to any great extent
10−13 10−11 10−9 10−7 10−5 10−3 10−1 Relative RMS potential errorεpot
10−6 10−5 10−4 10−3 10−2 10−1
Timet/#Charges[s]
MEMD P2NFFT P3M VMG FMM PP3MG
N = 102 900 (Cloud-wall), P = 1 (JUROPA)
.icp .uni-stuttgar t.de
Scaling: Timing
Execution time t vs.
number of cores P often used to display parallel scaling
shows actual execution times
hides actual scaling hides differences
between algorithms
20 21 22 23 24 25 26 27 28 29 210 211 212
#CoresP 10−3
10−2 10−1 100 101 102 103
Timet[s]
MEMD P2NFFT P3M VMG FMM PP3MG
N = 1 012 500 (Cloud-wall), ε
pot≤ 10
−3, JUROPA
.icp .uni-stuttgar t.de
Scaling: Relative Parallel Efficiency
Parallel efficiency can be used to show scaling: e(P) = t t
1P
P
e(P) ∈ [0, 1]
1 for optimal scaling Can be thought of
“effective fraction of CPU used in parallel”
Relative Parallel Efficiency to compare algorithms:
e(P) = t best P best t P P
20 21 22 23 24 25 26 27 28 29 210 211 212
#CoresP 0.0
0.2 0.4 0.6 0.8 1.0
RelativeParallelEfficiencye(P)
MEMD P2NFFT P3M VMG FMM PP3MG
N = 1 012 500 (Cloud-wall), ε
pot≤ 10
−3, JUROPA
.icp .uni-stuttgar t.de
Scaling: Comparing Methods
Again: P2NFFT, FMM and P 3 M within × 2 Scaling of P 3 M is better than P2NFFT and FMM
Issue of P 3 M: Tuning MEMD performs OK Scaling of Multigrid methods is very smooth and flat . . . but e(P) < 10%!
20 21 22 23 24 25 26 27 28 29 210 211 212
#CoresP 0.0
0.2 0.4 0.6 0.8 1.0
RelativeParallelEfficiencye(P)
MEMD P2NFFT P3M VMG FMM PP3MG
N = 1 012 500 (Cloud-wall), ε
pot≤ 10
−3, JUROPA
.icp .uni-stuttgar t.de
Scaling: Small Systems
20 21 22 23 24 25
#CoresP 0.0
0.2 0.4 0.6 0.8 1.0
RelativeParallelEfficiencye(P)
MEMD P2NFFT P3M VMG FMM PP3MG
N = 8100 (Cloud-wall), ε
pot≤ 10
−3, JUROPA
20 21 22 23 24 25 26 27 28 29
#CoresP 0.0
0.2 0.4 0.6 0.8 1.0
RelativeParallelEfficiencye(P)
MEMD P2NFFT P3M VMG FMM PP3MG
N = 102 900 (Cloud-wall), ε
pot≤ 10
−3,
JUROPA
.icp .uni-stuttgar t.de
Scaling: HPC machine
20 21 22 23 24 25 26 27 28 29 210 211 212
#CoresP 0.0
0.2 0.4 0.6 0.8 1.0
RelativeParallelEfficiencye(P)
MEMD P2NFFT P3M VMG FMM PP3MG
N = 1 012 500 (Cloud-wall), ε
pot≤ 10
−3, JUROPA
20 21 22 23 24 25 26 27 28 29 210 211 212
#CoresP 0.0
0.2 0.4 0.6 0.8 1.0
RelativeParallelEfficiencye(P)
MEMD P2NFFT P3M VMG FMM PP3MG
N = 1 012 500 (Cloud-wall), ε
pot≤ 10
−3, JUGENE
Older versions of both P 3 M and P2NFFT (JUGENE is dead) All algorithms show better scaling
JUGENE has slower cores but better interconnects
.icp .uni-stuttgar t.de
Scaling: Large Systems
20 22 24 26 28 210 212 214
#CoresP 0.0
0.2 0.4 0.6 0.8 1.0
RelativeParallelEfficiencye(P)
MEMD P2NFFT VMG FMM PP3MG
N = 9 830 400 (Cloud-wall), ε
pot≤ 10
−3, JUGENE
20 22 24 26 28 210 212 214 216 218 220 222
#CoresP 0.0
0.2 0.4 0.6 0.8 1.0
RelativeParallelEfficiencye(P)
P2NFFT VMG FMM PP3MG