.icp .uni-stuttgar t.de

(1)

.icp .uni-stuttgar t.de

ScaFaCoS and P ³ M

Recent Developments

Olaf Lenz

June 3, 2013

(2)

.icp .uni-stuttgar t.de

Outline

ScaFaCoS

ScaFaCoS Methods

Performance Comparison

Recent P ³ M developments

(3)

.icp .uni-stuttgar t.de

ScaFaCoS

Scalable Fast Coulomb Solver

“Highly scalable”, MPI-parallelized Library of different Coulomb solvers Common interface for all methods Developed by groups from Jülich, Wuppertal, Chemnitz, Bonn . . . and Stuttgart

BMBF project, officially ended 2011 Source code on github since two months (yay!)

First publication will be submitted

“soon” (since 6 months)

(4)

.icp .uni-stuttgar t.de

Interface

# i n c l u d e < fcs . h >

FCS h a n d l e = N U L L ; /* I n i t i a l i z e P3M */

f c s _ i n i t (& handle , " p3m " , M P I _ C O M M _ W O R L D ) ; /* Set c o m m o n p a r a m e t e r s */

f c s _ s e t _ c o m m o n (handle , n e a r _ f i e l d , box_a , box_b , box_c , ← - offset , p e r i o d i c i t y , t o t a l _ p a r t i c l e s ) ;

/* Set method - s p e c i f i c p a r m e t e r s */

f c s _ p 3 m _ s e t _ r _ c u t ( handle , r _ c u t ) ; /* T u n e the m e t h o d ( o p t i o n a l ) */

f c s _ t u n e ( handle , N, m a x _ p a r t i c l e s , p o s i t i o n s , c h a r g e s ) ; /* Run the m e t h o d */

f c s _ r u n ( handle , N , m a x _ p a r t i c l e s , p o s i t i o n s , charges , ← - fields , p o t e n t i a l s ) ;

/* F i n a l l y d e s t r o y the h a n d l e */

f c s _ d e s t r o y( h a n d l e ) ;

(5)

.icp .uni-stuttgar t.de

Methods

ScaFaCoS currently provides 11 methods:

DIRECT, EWALD, P3M, P2NFFT, VMG, PP3MG, PEPC, FMM, MEMD, MMM1D, MMM2D

In the following comparison, only the bold methods are considered

Distinguish Splitting Methods, Hierarchical Methods and Local Methods (i.e. MEMD)

Other methods for reference purposes only (DIRECT, EWALD),

for different periodicities only (MMM*D, here, only fully periodic

systems are considered), or performed too bad (PEPC)

(6)

.icp .uni-stuttgar t.de

Splitting Methods

Problems of electrostatic potential Slow decay – bad for direct summation

Singularity – bad for convergence accelerating methods

= +

Idea of splitting methods: split potential into fast decaying near field and non-singular far field

Near field can be computed directly ( O (N))

For the far field, other methods can be used

(7)

.icp .uni-stuttgar t.de

Splitting Methods: Ewald and Particle-Mesh Ewald

Ewald’s idea:

Compute far field in Fourier space Ewald summation O (N ^3/2 )

Particle-Mesh Ewald O (N log N) discretize far field charge distribution onto mesh

use FFT to Fourier transform solve Poisson equation in Fourier space

back-FFT to obtain potential on mesh compute potentials or fields by interpolating the mesh potential In ScaFaCoS: P3M (ICP), P2NFFT (Chemnitz; uses non-equidistant FFT)

P. P. Ewald (1888-1985)

(8)

.icp .uni-stuttgar t.de

Splitting Methods: Multigrid

Solve Poisson equation in far field with multigrid PDE solver

use differente levels of successively coarser meshes

solve poisson equation on these meshes by recursively improving the solution of the coarser mesh

Complexity O (N)

Can be extended to handle periodic BC

In ScaFaCoS: PP3MG (Wuppertal), VMG (Bonn)

l = 4 l = 3 l = 2 l = 1

Restriction Prolongation Smoothing/Solving

(9)

.icp .uni-stuttgar t.de

Hierarchical Methods: Barnes-Hut Tree Code

Multipole expand successively larger clusters of particles

Compute interaction with far away clusters instead of single particles Complexity O (N log N)

Can be extended to handle periodic BC

In ScaFaCoS: PEPC (Jülich)

(10)

.icp .uni-stuttgar t.de

Hierarchical Methods: Fast Multipole Method

Expand Barnes-Hut: let clusters interact with each other

Put eveything on a grid Complexity O (N)

Can be extended to handle periodic BC

In ScaFaCoS: FMM (Jülich)

(11)

.icp .uni-stuttgar t.de

Local Methods: MEMD

See talk of Florian F.

Purely local: should show very nice parallel scaling

Complexity O (N)

(12)

.icp .uni-stuttgar t.de

Benchmark Systems

“Cloud-wall system”

(ESPR ES S O test system) 300 charges

“Silica melt”

12 960 charges

(13)

.icp .uni-stuttgar t.de

Benchmark Systems 2

When larger systems were needed, the systems were replicated PEPC was removed (too bad)

Periodic systems

Relatively homogenous density Charge-neutral

JUROPA (Linux cluster) for small to intermediate number of cores JUGENE (BlueGene/P HPC machine) for intermediate to large number of cores

Accuracies are given by the relative RMS potential error

ε _pot :=



 P N

j=1

Φ _ref (~ x _j ) − Φ _method (~ x _j )

2 P N j=1

Φ _ref ( ~ x _j )

2 



1/2

(14)

.icp .uni-stuttgar t.de

Complexity

P2NFFT, P ³ M and FMM are fastest MEMD and Multigrid

≈ × 10 slower All algorithms show (close-to-)linear behavior

log N-term of P2NFFT and P ³ M is invisible No cross-over with FMM

10⁴ 10⁵ 10⁶ 10⁷ 10⁸ 10⁹

#Charges 10⁻⁶

10⁻⁵ 10⁻⁴ 10⁻³

Timet/#Charges[s]

MEMD P2NFFT P3M VMG FMM PP3MG

Silica melt, ε

pot

≤ 10

⁻³

, P = 1 (JUROPA)

(15)

.icp .uni-stuttgar t.de

Accuracy

FMM and P2NFFT scale very good Can achieve very high accuracy

P ³ M not (due to tuning) Multigrid methods suffer from steep potential (or bad tuning)

MEMD cannot influence accuracy to any great extent

10⁻¹³ 10⁻¹¹ 10⁻⁹ 10⁻⁷ 10⁻⁵ 10⁻³ 10⁻¹ Relative RMS potential errorεpot

10⁻⁶ 10⁻⁵ 10⁻⁴ 10⁻³ 10⁻² 10⁻¹

Timet/#Charges[s]

N = 102 900 (Cloud-wall), P = 1 (JUROPA)

(16)

.icp .uni-stuttgar t.de

Scaling: Timing

Execution time t vs.

number of cores P often used to display parallel scaling

shows actual execution times

hides actual scaling hides differences

between algorithms

²

0 2¹ 2² 2³ 2⁴ 2⁵ 2⁶ 2⁷ 2⁸ 2⁹ 2¹⁰ 2¹¹ 2¹²

#CoresP 10⁻³

10⁻² 10⁻¹ 10⁰ 10¹ 10² 10³

Timet[s]

N = 1 012 500 (Cloud-wall), ε

pot

≤ 10

⁻³

, JUROPA

(17)

.icp .uni-stuttgar t.de

Scaling: Relative Parallel Efficiency

Parallel efficiency can be used to show scaling: e(P) = _t ^t

¹

P

e(P) ∈ [0, 1]

1 for optimal scaling Can be thought of

“effective fraction of CPU used in parallel”

Relative Parallel Efficiency to compare algorithms:

e(P) = t _best P _best t _P P

2⁰ 2¹ 2² 2³ 2⁴ 2⁵ 2⁶ 2⁷ 2⁸ 2⁹ 2¹⁰ 2¹¹ 2¹²

#CoresP 0.0

0.2 0.4 0.6 0.8 1.0

RelativeParallelEfficiencye(P)

N = 1 012 500 (Cloud-wall), ε

pot

≤ 10

⁻³

, JUROPA

(18)

.icp .uni-stuttgar t.de

Scaling: Comparing Methods

Again: P2NFFT, FMM and P ³ M within × 2 Scaling of P ³ M is better than P2NFFT and FMM

Issue of P ³ M: Tuning MEMD performs OK Scaling of Multigrid methods is very smooth and flat . . . but e(P) < 10%!

2⁰ 2¹ 2² 2³ 2⁴ 2⁵ 2⁶ 2⁷ 2⁸ 2⁹ 2¹⁰ 2¹¹ 2¹²

#CoresP 0.0

0.2 0.4 0.6 0.8 1.0

N = 1 012 500 (Cloud-wall), ε

pot

≤ 10

⁻³

, JUROPA

(19)

.icp .uni-stuttgar t.de

Scaling: Small Systems

2⁰ 2¹ 2² 2³ 2⁴ 2⁵

#CoresP 0.0

0.2 0.4 0.6 0.8 1.0

N = 8100 (Cloud-wall), ε

pot

≤ 10

⁻³

, JUROPA

2⁰ 2¹ 2² 2³ 2⁴ 2⁵ 2⁶ 2⁷ 2⁸ 2⁹

#CoresP 0.0

0.2 0.4 0.6 0.8 1.0

N = 102 900 (Cloud-wall), ε

pot

≤ 10

⁻³

,

JUROPA

(20)

.icp .uni-stuttgar t.de

Scaling: HPC machine

2⁰ 2¹ 2² 2³ 2⁴ 2⁵ 2⁶ 2⁷ 2⁸ 2⁹ 2¹⁰ 2¹¹ 2¹²

#CoresP 0.0

0.2 0.4 0.6 0.8 1.0

N = 1 012 500 (Cloud-wall), ε

pot

≤ 10

⁻³

, JUROPA

2⁰ 2¹ 2² 2³ 2⁴ 2⁵ 2⁶ 2⁷ 2⁸ 2⁹ 2¹⁰ 2¹¹ 2¹²

#CoresP 0.0

0.2 0.4 0.6 0.8 1.0

N = 1 012 500 (Cloud-wall), ε

pot

≤ 10

⁻³

, JUGENE

Older versions of both P ³ M and P2NFFT (JUGENE is dead) All algorithms show better scaling

JUGENE has slower cores but better interconnects

(21)

.icp .uni-stuttgar t.de

Scaling: Large Systems

2⁰ 2² 2⁴ 2⁶ 2⁸ 2¹⁰ 2¹² 2¹⁴

#CoresP 0.0

0.2 0.4 0.6 0.8 1.0

MEMD P2NFFT VMG FMM PP3MG

N = 9 830 400 (Cloud-wall), ε

pot

≤ 10

⁻³

, JUGENE

2⁰ 2² 2⁴ 2⁶ 2⁸ 2¹⁰ 2¹² 2¹⁴ 2¹⁶ 2¹⁸ 2²⁰ 2²²

#CoresP 0.0

0.2 0.4 0.6 0.8 1.0

P2NFFT VMG FMM PP3MG

N = 1 012 500 000 (Cloud-wall), ε

pot

≤ 10

⁻³

, JUGENE

Many algorithms can’t handle large systems For really large systems, FMM seems to be good FMM has done time steps of 3 trillion charges!

. . . whatever that’s good for

(22)

.icp .uni-stuttgar t.de

ScaFaCoS: Conclusions

Performance depends heavily on architecture, compiler and implementation

. . . and tuning!

× 2 differences between algorithms are “normal”

Within these limits, FMM, P ³ M and P2NFFT perform equally good MEMD slightly worse ( ≈ × 4), but performs better at larger

systems

Multigrid methods seem to be worse ( ≈ × 10)

. . . apparently due to large variation in the potential

(23)

.icp .uni-stuttgar t.de

P ³ M: Recent Developments

Determined optimal P ³ M components, gained ≈ × 4 (Florian W.) Improved tuning (Florian W.)

CUDA P ³ M: coming to ESPRE S S O really soon (Florian W.) First interface to ScaFaCoS (with problems) (Andreas M.) Improved P ³ M code (Florian W., Olaf)

.icp .uni-stuttgar t.de

.icp .uni-stuttgar t.de

ScaFaCoS and P 3 M

Recent Developments

Olaf Lenz

June 3, 2013

.icp .uni-stuttgar t.de

Outline

ScaFaCoS

ScaFaCoS Methods

Performance Comparison

Recent P 3 M developments

.icp .uni-stuttgar t.de

ScaFaCoS

Scalable Fast Coulomb Solver

“Highly scalable”, MPI-parallelized Library of different Coulomb solvers Common interface for all methods Developed by groups from Jülich, Wuppertal, Chemnitz, Bonn . . . and Stuttgart

BMBF project, officially ended 2011 Source code on github since two months (yay!)

First publication will be submitted

“soon” (since 6 months)

.icp .uni-stuttgar t.de

Interface

# i n c l u d e < fcs . h >

FCS h a n d l e = N U L L ; /* I n i t i a l i z e P3M */

f c s _ i n i t (& handle , " p3m " , M P I _ C O M M _ W O R L D ) ; /* Set c o m m o n p a r a m e t e r s */

f c s _ s e t _ c o m m o n (handle , n e a r _ f i e l d , box_a , box_b , box_c , ← - offset , p e r i o d i c i t y , t o t a l _ p a r t i c l e s ) ;

/* Set method - s p e c i f i c p a r m e t e r s */

f c s _ p 3 m _ s e t _ r _ c u t ( handle , r _ c u t ) ; /* T u n e the m e t h o d ( o p t i o n a l ) */

f c s _ t u n e ( handle , N, m a x _ p a r t i c l e s , p o s i t i o n s , c h a r g e s ) ; /* Run the m e t h o d */

f c s _ r u n ( handle , N , m a x _ p a r t i c l e s , p o s i t i o n s , charges , ← - fields , p o t e n t i a l s ) ;

/* F i n a l l y d e s t r o y the h a n d l e */

f c s _ d e s t r o y( h a n d l e ) ;

.icp .uni-stuttgar t.de

Methods

ScaFaCoS currently provides 11 methods:

DIRECT, EWALD, P3M, P2NFFT, VMG, PP3MG, PEPC, FMM, MEMD, MMM1D, MMM2D

In the following comparison, only the bold methods are considered

Distinguish Splitting Methods, Hierarchical Methods and Local Methods (i.e. MEMD)

Other methods for reference purposes only (DIRECT, EWALD),

for different periodicities only (MMM*D, here, only fully periodic

systems are considered), or performed too bad (PEPC)

.icp .uni-stuttgar t.de

Splitting Methods

Problems of electrostatic potential Slow decay – bad for direct summation

Singularity – bad for convergence accelerating methods

= +

Idea of splitting methods: split potential into fast decaying near field and non-singular far field

Near field can be computed directly ( O (N))

For the far field, other methods can be used

.icp .uni-stuttgar t.de

Splitting Methods: Ewald and Particle-Mesh Ewald

Ewald’s idea:

Compute far field in Fourier space Ewald summation O (N 3/2 )

Particle-Mesh Ewald O (N log N) discretize far field charge distribution onto mesh

use FFT to Fourier transform solve Poisson equation in Fourier space

back-FFT to obtain potential on mesh compute potentials or fields by interpolating the mesh potential In ScaFaCoS: P3M (ICP), P2NFFT (Chemnitz; uses non-equidistant FFT)

P. P. Ewald (1888-1985)

.icp .uni-stuttgar t.de

Splitting Methods: Multigrid

Solve Poisson equation in far field with multigrid PDE solver

use differente levels of successively coarser meshes

solve poisson equation on these meshes by recursively improving the solution of the coarser mesh

Complexity O (N)

Can be extended to handle periodic BC

In ScaFaCoS: PP3MG (Wuppertal), VMG (Bonn)

l = 4 l = 3 l = 2 l = 1

.icp .uni-stuttgar t.de

Hierarchical Methods: Barnes-Hut Tree Code

Multipole expand successively larger clusters of particles

Compute interaction with far away clusters instead of single particles Complexity O (N log N)

Can be extended to handle periodic BC

In ScaFaCoS: PEPC (Jülich)

.icp .uni-stuttgar t.de

Hierarchical Methods: Fast Multipole Method

Expand Barnes-Hut: let clusters interact with each other

Put eveything on a grid Complexity O (N)

Can be extended to handle periodic BC

In ScaFaCoS: FMM (Jülich)

.icp .uni-stuttgar t.de

Local Methods: MEMD

See talk of Florian F.

ScaFaCoS and P ³ M

Recent P ³ M developments

Compute far field in Fourier space Ewald summation O (N ^3/2 )

ε _pot :=

Φ _ref (~ x _j ) − Φ _method (~ x _j )

Φ _ref ( ~ x _j )

P2NFFT, P ³ M and FMM are fastest MEMD and Multigrid

log N-term of P2NFFT and P ³ M is invisible No cross-over with FMM

P ³ M not (due to tuning) Multigrid methods suffer from steep potential (or bad tuning)

Parallel efficiency can be used to show scaling: e(P) = _t ^t

e(P) = t _best P _best t _P P

Again: P2NFFT, FMM and P ³ M within × 2 Scaling of P ³ M is better than P2NFFT and FMM

Issue of P ³ M: Tuning MEMD performs OK Scaling of Multigrid methods is very smooth and flat . . . but e(P) < 10%!