Object-Oriented Programming for Scientiﬁc Computing

(1)

Object-Oriented Programming for Scientific Computing

C++11 Multithreading

Ole Klein

Interdisciplinary Center for Scientific Computing Heidelberg University

ole.klein@iwr.uni-heidelberg.de

7. Juli 2015

(2)

Random numbers

C++11: Random Numbers

C++11 provides powerful random number generators that are able to work with different distributions, e.g. uniformly distributed random numbers:

# include< i o s t r e a m >

# include< iomanip >

# include< string >

# include< map >

# include< random >

# include< cmath >

int m a i n () {

// I m p l e m e n t a t i o n - d e p e n d e n t p r e d e f i n e d RNG std :: d e f a u l t _ r a n d o m _ e n g i n e e1 ;

// P r o d u c e u n i f o r m l y d i s t r i b u t e d i n t e g e r s b e t w e e n 1 and 6 std :: u n i f o r m _ i n t _ d i s t r i b u t i o n <int> u n i f o r m D i s t (1 ,6) ; // D r a w a n u m b e r

int m e a n = u n i f o r m D i s t ( e1 ) ;

std :: c o u t < < " R a n d o m l y g e n e r a t e d m e a n : " < < m e a n < < std :: e n d l ;

(3)

Random numbers

Normally Distributed Random Numbers

// G e n e r a t o r for non - d e t e r m i n i s t i c u n i f o r m l y d i s t r i b u t e d r a n d o m n u m b e r s

std :: r a n d o m _ d e v i c e rd ;

// A l t e r n a t i v e RNG b a s e d on M e r s e n n e T w i s t e r a l g o r i t h m std :: m t 1 9 9 3 7 e2 ( rd () ) ;

// C r e a t e n o r m a l d i s t r i b u t i o n a r o u n d m e a n std :: n o r m a l _ d i s t r i b u t i o n < > n o r m a l D i s t ( mean ,2) ; std :: map <int,int> h i s t ;

// C r e a t e r a n d o m n u m b e r s and c o u n t for (int n =0; n < 1 0 0 0 0 ; ++ n )

++ h i s t [ std :: r o u n d ( n o r m a l D i s t ( e2 ) ) ];

// P r i n t h i s t o g r a m to s c r e e n

std :: c o u t < < " N o r m a l d i s t r i b u t i o n a r o u n d m e a n " < < m e a n < < " : "

< < std :: e n d l ; for (a u t o p : h i s t )

std :: c o u t < < std :: f i x e d < < std :: s e t p r e c i s i o n (1) < < std :: s e t w (2)

< < p . f i r s t < < " " < < std :: s t r i n g ( p . s e c o n d /200 ,’ * ’)

< < std :: e n d l ; }

(4)

Random numbers

Own RNG for Uniformly Distributed Integers

c l a s s R a n d o m I n t {

p u b l i c:

R a n d o m I n t (int low , int high , u n s i g n e d int s e e d =0) :

r a n d _ { std :: b i n d ( std :: u n i f o r m _ i n t _ d i s t r i b u t i o n < >{ low , h i g h } , std :: d e f a u l t _ r a n d o m _ e n g i n e { s e e d }) }

{}

int o p e r a t o r() () c o n s t {

r e t u r n r a n d _ () ; }

p r i v a t e:

std :: f u n c t i o n <int() > r a n d _ ; };

(5)

Random numbers

Own RNG for Normally Distributed Doubles

This also works for other number types:

c l a s s R a n d o m D o u b l e {

p u b l i c:

R a n d o m D o u b l e (d o u b l e mean , d o u b l e stddev , u n s i g n e d int s e e d =0) : r a n d _ { std :: b i n d ( std :: n o r m a l _ d i s t r i b u t i o n < >{ mean , s t d d e v } ,

std :: m t 1 9 9 3 7 { s e e d }) } {}

d o u b l e o p e r a t o r() () c o n s t {

r e t u r n r a n d _ () ; }

p r i v a t e:

std :: f u n c t i o n <d o u b l e() > r a n d _ ; };

(6)

Random numbers

Application

int m a i n () {

// G e n e r a t o r for non - d e t e r m i n i s t i c u n i f o r m l y d i s t r i b u t e d r a n d o m n u m b e r s

std :: r a n d o m _ d e v i c e rd ; R a n d o m I n t r a n d I n t {1 ,6 , rd () };

int m e a n = r a n d I n t () ;

std :: c o u t < < " R a n d o m g e n e r a t e d m e a n : " < < m e a n < < std :: e n d l ; R a n d o m D o u b l e r a n d D o u b l e {(d o u b l e) mean ,2. , rd () };

std :: map <int,int> h i s t ;

++ h i s t [ std :: r o u n d ( r a n d D o u b l e () ) ];

< < std :: e n d l ;

(7)

Random numbers

Application

++ h i s t [ std :: r o u n d ( r a n d D o u b l e () ) ];

< < std :: e n d l ; }

(8)

Threads Introduction

Processes and threads in Unix

• The following information is associated with a process in UNIX:

•

IDs (process, user, group)

•

environment variables

•

code

•

registers, stack, heap

•

file descriptors, signals

•

message queues, pipes, shared-memory segments

•

shared libraries

• Each process has its own address space.

• Threads (small-scale processes) exist within a process and share the address space of the main process.

• A thread consists of:

•

ID

•

stack pointer

•

registers

•

scheduling properties

•

signals

• Switching between threads is faster

than switching between processes.

(9)

C++11 Threads

• POSIX threads (or pthreads) were introduced in 1995 and have been the standard model for threads on UNIX computers for a long time. They are created and managed via a C interface.

• C++11 defines its own threading concept, which basically is a wrapper for POSIX threads. In many cases this makes it easier to write multi-threading programs. The following concepts are supported:

Threads Thread classes and functions for the rejoining of threads are defined in the header

threads

. This header also contains functions that provide a thread ID and the functionality to detach threads that afterwards run as a separate program (fork).

Mutual Exclusion The header

mutex

provides classes for mutexes and locks

(also recursive variants and ones with timeouts), and

additionally functions that check if a lock is available.

(10)

C++11 Threads

Condition Variables The header

condition_variable

defines a class that allows the coordination of multiple threads with condition variables.

Futures The header

futures

defines classes and functions whose operations need not be executed immediately, but can be performed

asynchronously with the main part of the program. If possible, this is done in parallel, and if not, sequentially at the location where the result is needed.

Atomic Operations The classes from the header

atomic

allow certain operations

with integers, Boolean variables and pointers to be performed as an

atomic operation.

(11)

Threads C++11 Thread Production

C++11 Thread Production

# i n c l u d e < array >

# i n c l u d e < i o s t r e a m >

# i n c l u d e < thread >

v o i d H e l l o ( s i z e _ t n u m b e r ) {

std :: c o u t < < " H e l l o f r o m t h r e a d " < < n u m b e r < < std :: e n d l ; }

int m a i n () {

c o n s t int n u m T h r e a d s = 5;

std :: array < std :: thread , n u m T h r e a d s > t h r e a d s ; // S t a r t t h r e a d s

for ( s i z e _ t i = 0; i < t h r e a d s . s i z e () ; ++ i ) t h r e a d s [ i ] = std :: t h r e a d ( Hello , i ) ; std :: c o u t < < " H e l l o f r o m m a i n \ n ";

// R e j o i n threads , i m p l i c i t b a r r i e r

for ( s i z e _ t i = 0; i < t h r e a d s . s i z e () ; ++ i ) t h r e a d s [ i ]. j o i n () ;

r e t u r n 0;

}

(12)

Threads C++11 Thread Production

Output

H e l l o f r o m t h r e a d H e l l o f r o m t h r e a d H e l l o f r o m t h r e a d H e l l o f r o m t h r e a d H e l l o f r o m m a i n

0 H e l l o f r o m t h r e a d 213 4

(13)

Threads Example: Calculation of the Vector Norm

Calculation of the Vector Norm

# include< array >

# include< vector >

# include< thread >

v o i d N o r m (c o n s t std :: vector <double>& x , d o u b l e& norm , c o n s t s i z e _ t i , c o n s t s i z e _ t p )

{

s i z e _ t n u m E l e m = x . s i z e () / p ;

s i z e _ t f i r s t = n u m E l e m * i + std :: min ( i , x . s i z e () % p ) ; n u m E l e m += ( i <( x . s i z e () % p ) ? 1 : 0 ) ;

d o u b l e l o c a l N o r m = 0 . 0 ;

for ( s i z e _ t j =0; j < n u m E l e m ; ++ j )

l o c a l N o r m += x [ f i r s t + j ] * x [ f i r s t + j ];

n o r m += l o c a l N o r m ; // d a n g e r o u s ...

}

(14)

Calculation of the Vector Norm

int m a i n () {

c o n s t s i z e _ t n u m T h r e a d s = 5;

c o n s t s i z e _ t n u m V a l u e s = 1 0 0 0 0 0 0 ;

std :: array < std :: thread , n u m T h r e a d s > t h r e a d s ; std :: vector <double> x ( n u m V a l u e s , 2 . 0 ) ; d o u b l e n o r m = 0 . 0 ;

// C r e a t e t h r e a d s

for ( s i z e _ t i =0; i < t h r e a d s . s i z e () ; ++ i )

t h r e a d s [ i ] = std :: t h r e a d ( Norm , std :: c r e f ( x ) , std :: ref ( n o r m ) ,i , n u m T h r e a d s ) ; // R e j o i n t h r e a d s w i t h m a i n thread , b a r r i e r

for ( s i z e _ t i =0; i < t h r e a d s . s i z e () ; ++ i ) t h r e a d s [ i ]. j o i n () ;

std :: c o u t < < " N o r m is : " < < n o r m < < std :: e n d l ; r e t u r n 0;

}

(15)

Critical Sections

• A statement of the type s=s+t is not atomic, which means it will not be carried out in one step:

process Π

1

: process Π

2

:

1 load s in R1 3 load s in R1

load t in R2 load t in R2

add R1, R2, store in R3 add R1, R2, store in R3 2 store R3 in s 4 store R3 in s

• The order of execution of these instructions relative to each other is not fixed.

• This results in an exponentially growing sequence of possible execution orders.

(16)

Possible Execution Orders

@

@ @ 1

3 @

@ @

@

@ @ 2

3 1

4 @

@ @ 3 2 4 2 4 1

4 4 2 4 2 2

Result of the calculation s = t

Π₁

+ t

Π₂

s = t

Π₂

s = t

Π₁

s = t

Π₂

s = t

Π₁

s = t

_Π₁

+ t

_Π₂

Only some of the sequences lead to the right result!

(17)

Threads Mutual Exclusion/Locks

Mutual Exclusion

• An additional synchronization is necessary to prevent execution orders that do not provide the correct result.

• Critical sections must be executed under mutual exclusion.

• Mutual exclusion requires:

•

At most one process enters the critical section at a time.

•

There are no deadlocks (i.e. two processes that are waiting on each other).

•

No process waits at a free critical section.

•

When a process enters a critical section, it will leave it successfully in the end.

(18)

Mutual Exclusion/Locks

# include< vector >

# include< thread >

# include< mutex >

std :: m u t e x m ; v o i d e (int& sh ) {

m . l o c k () ;

sh += 1; // m o d i f y s h a r e d d a t a m . u n l o c k () ; // v e r y i m p o r t a n t }

• C++11 contains the construct

mutex

for this.

• A

mutex

can lock a section, so that it can only be accessed by one process.

• All other processes have to wait in front of the critical section. It should therefore be as small as possible.

• The same mutex must be available in all of the threads.

(19)

Lock Guard

v o i d f (int& sh ) {

std :: l o c k _ g u a r d < std :: mutex > l o c k e d { m };

sh += 1; // m o d i f y s h a r e d d a t a }

• There are helper classes for releasing a

lock

again (even if an exception is thrown).

• The simplest of these is a

lock_guard

.

• At initialization it gets a

mutex

and locks it, and it releases the lock when the destructor of

lock_guard

is called.

• A

lock_guard

should therefore be defined either at the end of a function or in

a separate block to call the destructor as early as possible.

(20)

Unique Lock

A

unique_lock

has more functionality than a

lock_guard

. It has functions that lock or release a

mutex

and can test whether a critical section can be entered, so that otherwise something else can be done. A

unique_lock

can also take over an already locked

mutex

or defer locking until it is needed.

v o i d g (int& sh ) {

std :: u n i q u e _ l o c k < std :: mutex > l o c k e d { m , std :: d e f e r _ l o c k }; // a d m i n i s t r a t e mutex , but don ’ t b l o c k it

b o o l s u c c e s s f u l = f a l s e; w h i l e (! s u c c e s s f u l ) {

if ( l o c k e d . t r y _ l o c k () ) // l o c k m u t e x if a b l e {

sh += 1; // m o d i f y s h a r e d d a t a s u c c e s s f u l = t r u e;

l o c k e d . u n l o c k () ;

// a c t u a l l y not n e c e s s a r y in t h i s e x a m p l e }

e l s e {

// do s o m e t h i n g d i f f e r e n t ...

} }

(21)

Main Program

int m a i n () {

c o n s t s i z e _ t n u m T h r e a d s = std :: t h r e a d :: h a r d w a r e _ c o n c u r r e n c y () ; std :: vector < std :: thread > t h r e a d s { n u m T h r e a d s };

int r e s u l t = 0;

// s t a r t t h r e a d s

t h r e a d s [ i ] = std :: t h r e a d { e , std :: ref ( r e s u l t ) };

// R e j o i n threads , i m p l i c i t b a r r i e r for ( s i z e _ t i =0; i < t h r e a d s . s i z e () ; ++ i )

t h r e a d s [ i ]. j o i n () ;

std :: c o u t < < " Y o u r h a r d w a r e s u p p o r t s " < < r e s u l t < < " t h r e a d s " < <

std :: e n d l ; r e t u r n 0;

}

(22)

Threads Calculation of the Vector Norm with a Mutex

Calculation of the Vector Norm with a Mutex

# include< array >

# include< vector >

# include< thread >

# include< mutex >

# include< cmath >

s t a t i c std :: m u t e x n L o c k ;

v o i d N o r m (c o n s t std :: vector <double>& x , d o u b l e& norm , c o n s t s i z e _ t i , c o n s t s i z e _ t p )

{

d o u b l e l o c a l N o r m = 0 . 0 ; for ( s i z e _ t j =0; j < n u m E l e m ;++ j )

l o c a l N o r m += x [ f i r s t + j ]* x [ f i r s t + j ];

std :: l o c k _ g u a r d < std :: mutex > b l o c k _ t h r e a d s _ u n t i l _ f i n i s h _ t h i s _ j o b ( n L o c k ) ; n o r m += l o c a l N o r m ;

}

(23)

Threads Calculation of the Vector Norm with a Mutex

Calculation of the Vector Norm with a Mutex

int m a i n () {

c o n s t s i z e _ t n u m T h r e a d s = 5;

c o n s t s i z e _ t n u m V a l u e s = 1 0 0 0 0 0 0 0 ;

std :: array < std :: thread , n u m T h r e a d s > t h r e a d s ; std :: vector <double> x ( n u m V a l u e s , 2 . 0 ) ; d o u b l e n o r m = 0 . 0 ;

// S t a r t t h r e a d s

for ( s i z e _ t i = 0; i < t h r e a d s . s i z e () ; ++ i ) t h r e a d s [ i ] =

std :: t h r e a d ( Norm , std :: c r e f ( x ) , std :: ref ( n o r m ) ,i , n u m T h r e a d s ) ; // R e j o i n t h r e a d s

for ( s i z e _ t i = 0; i < t h r e a d s . s i z e () ; ++ i ) t h r e a d s [ i ]. j o i n () ;

std :: c o u t < < " N o r m is : " < < s q r t ( n o r m ) < < std :: e n d l ; r e t u r n 0;

}

(24)

Threads Calculation of the Vector Norm with Tree Combine

Parallelization of the Sum

• The calculation of the global sum of the norm is not parallel when using a

mutex

for the calculation of s = s + t .

• It can be parallelized as follows (P = 8):

s = s

0

+ s

1

| {z }

s01

+ s

2

+ s

3

| {z }

s23

| {z }

s₀₁₂₃

+ s

4

+ s

5

| {z }

s45

+ s

6

+ s

7

| {z }

s67

| {z }

s₄₅₆₇

| {z }

s

• This reduces the complexity from O(P) to O(log

₂

P).

(25)

Tree Combine

If we use a binary representation of the process ID, the communication structure is a binary tree:

s

₀

000 s

₁

@ 001

@ I

s

₂

010 s

₃

@ 011

@ I

s

₄

100 s

₅

@ 101

@ I

s

₆

110 s

₇

@ 111

@ I s

₀

+ s

₁

000 *

s

₂

+ s

₃

H 010 H H H Y

s

₄

+ s

₅

100 *

s

₆

+ s

₇

H 110 H H H Y s

0

+ s

1

+ s

2

+ s

3

000 :

s

4

+ s

5

+ s

6

+ s

7

X 100

X X

y

P s

_i

000

(26)

Calculation of the Vector Norm with Tree Combine

# i n c l u d e < array >

# i n c l u d e < vector >

# i n c l u d e < i o s t r e a m >

# i n c l u d e < thread >

# i n c l u d e < cmath >

v o i d N o r m (c o n s t std :: vector <double>& x , std :: vector <double>& norm , std :: vector <bool>& flag ,c o n s t s i z e _ t i , c o n s t s i z e _ t d ) {

s i z e _ t p = pow (2 , d ) ;

for ( s i z e _ t j =0; j < n u m E l e m ;++ j ) n o r m [ i ]+= x [ f i r s t + j ]* x [ f i r s t + j ];

(27)

Calculation of the Vector Norm with Tree Combine

// T r e e c o m b i n e

for ( s i z e _ t j =0; j < d ;++ j ) {

s i z e _ t m = pow (2 , j ) ; if ( i & m )

{

f l a g [ i ]=t r u e; b r e a k; }

w h i l e (! f l a g [ i | m ]) ; n o r m [ i ] += n o r m [ i | m ];

} }

(28)

Calculation of the Vector Norm with Tree Combine

int m a i n () {

c o n s t s i z e _ t l o g T h r e a d s = 1;

c o n s t s i z e _ t n u m T h r e a d s = pow (2 , l o g T h r e a d s ) ; c o n s t s i z e _ t n u m V a l u e s = 1 0 0 0 0 0 0 0 ;

std :: array < std :: thread , n u m T h r e a d s > t h r e a d s ; std :: vector <double> x ( n u m V a l u e s , 2 . 0 ) ; std :: vector <bool> f l a g ( n u m T h r e a d s ,f a l s e) ; std :: vector <double> n o r m ( n u m T h r e a d s , 0 . 0 ) ; // S t a r t t h r e a d s

t h r e a d s [ i ] = std :: t h r e a d ( Norm , std :: c r e f ( x ) , std :: ref ( n o r m ) , std :: ref ( f l a g ) ,i , l o g T h r e a d s ) ; // R e j o i n t h r e a d s

for ( s i z e _ t i =0; i < t h r e a d s . s i z e () ; ++ i ) t h r e a d s [ i ]. j o i n () ;

std :: c o u t < < " N o r m is : " < < s q r t ( n o r m [ 0 ] ) < < std :: e n d l ; r e t u r n 0;

}

(29)

Condition Synchronization

• Tree combine is a form of condition synchronization.

• A process is waiting until a condition (Boolean expression) is true. The condition is made true by another process.

• Here several processes wait until their flags become true.

• The flags are also called condition variables. Their proper initialization is important.

• We implement the synchronization with busy wait. That’s probably not a good idea for multi-threading.

• If condition variables are used repeatedly (e.g. when several sums must be calculated in succession), the following rules should be followed:

•

A process which is waiting on a condition variable also resets it again.

•

A condition variable may only be set to true again if it is sure that it has been

reset earlier.

(30)

Threads Thread Creation withasync

Thread Creation with async

The function

async

allows the easy creation of a thread that performs its

calculation without interaction with the main program or other threads. Its result can later be queried using the function

get

.

# include< vector >

# include< numeric >

# include< thread >

# include< future >

t e m p l a t e<c l a s s T > s t r u c t A c c u m { // s i m p l e a c c u m u l a t o r f u n c t i o n o b j e c t

T * b ; T * e ;

T val ;

A c c u m ( T * bb , T * ee , T vv ) : b { bb } , e { ee } , val { vv } {}

T o p e r a t o r() () { r e t u r n std :: a c c u m u l a t e ( b , e , val ) ; } };

(31)

Threads Thread Creation withasync

Thread Creation with async

The function

async

allows the easy creation of a thread that performs its

calculation without interaction with the main program or other threads. Its result can later be queried using the function

get

.

d o u b l e comp ( s t d : : v e c t o r<d o u b l e>& v ) // spawn many t a s k s i f v i s l a r g e e n o u g h {

i f ( v . s i z e ( )<10000)

r e t u r n s t d : : a c c u m u l a t e ( v . b e g i n ( ) , v . end ( ) , 0 . 0 ) ; s t d : : f u t u r e<d o u b l e> f 0

{s t d : : a s y n c ( Accum<d o u b l e>{&v [ 0 ] , & v [ v . s i z e ( ) / 4 ] , 0 . 0})}; s t d : : f u t u r e<d o u b l e> f 1

{s t d : : a s y n c ( Accum<d o u b l e>{&v [ v . s i z e ( ) / 4 ] , & v [ v . s i z e ( ) / 2 ] , 0 . 0})}; s t d : : f u t u r e<d o u b l e> f 2

{s t d : : a s y n c ( Accum<d o u b l e>{&v [ v . s i z e ( ) / 2 ] , & v [ v . s i z e ( )∗3 / 4 ] , 0 . 0})};

s t d : : f u t u r e<d o u b l e> f 3

{s t d : : a s y n c ( Accum<d o u b l e>{&v [ v . s i z e ( )∗3 / 4 ] , & v [ v . s i z e ( ) ] , 0 . 0})}; // l o t s o f o t h e r c o d e c o u l d come h e r e . . .

r e t u r n f 0 . g e t ( )+f 1 . g e t ( )+f 2 . g e t ( )+f 3 . g e t ( ) ; }

i n t main ( ) {

s t d : : v e c t o r<d o u b l e> b l u b ( 1 0 0 0 0 0 , 1 . ) ;

(32)

Threads Atomics

Discrete Fourier Transform

• Let (z

i

, a

i

) be pairs of numbers where a

i

is the measured value of an unknown function at the location z

i

.

• Goal: calculate interpolation A(z ) = 1

N f

0

+ f

1

z + · · · + f

N−1

z

^N−1

• Here we use the N-th roots of unity z

i

= e

^2πi^N ^k

as nodes. The coefficients of the interpolation polynomial are then

f

k

=

N−1

X

j=0

a

j

· e

^−2πi^jk^N

for k = 0, . . . , N − 1

(33)

Threads Atomics

Fast Fourier Transform

• For even values of N, N = 2n, the sum

f

k

=

2n−1

X

j=0

a

j

· e

^−2πi²ⁿ^jk

for k = 0, . . . , 2n − 1

can be rearranged into

f

_k

=

n−1

X

j=0

a

_2j

· e

^−2πi^2jk²ⁿ

+

n−1

X

j=0

a

_2j+1

· e

^−2πi^k(2j+1)²ⁿ

(34)

Threads Atomics

Fast Fourier Transform

• Setting a

_k⁰

= a

2k

, f

_k⁰

= f

2k

, a

⁰⁰_k

= a

2k+1

and f

_k⁰⁰

= f

2k+1

, we have

f

k

=

n−1

X

j=0

a

⁰_j

· e

⁻^2πiⁿ ^jk

+ e

⁻^πiⁿ^k

n−1

X

j=0

a

⁰⁰_j

e

⁻^2πiⁿ ^jk

= (

f

_k⁰

+ e

⁻^πiⁿ^k

f

_k⁰⁰

if k < n f

_k−n⁰

+ e

⁻^πiⁿ^(k⁻ⁿ⁾

f

_k⁰⁰_−n

if k ≥ n

• This means we can solve the problem by calculating two Fourier transforms of length N/2.

• This can be applied recursively. Since the Fourier transform of a value is the value itself, N/2 sums of two values must be calculated on the lowest level.

• Since at each stage N complex multiplications with a unit root and N

complex additions are necessary, the complexity can be reduced from O(N

²

)

to O(N · log(N)).

(35)

Threads Atomics

Recursive Algorithm in Pseudocode

p r o c e d u r e R _ F F T ( X , Y , N , w ) if ( n = = 1 ) t h e n Y [0] := X [ 0 ] ; e l s e b e g i n

R _ F F T ( < X [0] , X [0] ,... , X [ N -2] > , < Q [0] , Q [1] ,... , Q [ N -2] > , N /2 , w ^2) ; R _ F F T ( < X [1] , X [3] ,... , X [ N -1] > , < T [0] , T [1] ,... , T [ N -2] > , N /2 , w ^2) ;

for k := 0 to N -1 do

Y [ k ] := Q [ k mod ( N /2) ] + w ^ k T [ k mod ( N /2) ];

end R _ F F T

with w = e

⁻^2πi^N

. But: recursion parallelizes poorly.

(36)

Threads Atomics

Iterative Algorithm in Pseudocode

p r o c e d u r e I T E R A T I V E _ F F T ( X , Y , N , w ) r := log N ;

for i := 0 to N -1 do R [ i ] := X [ i ];

for m := 0 to r -1 do

for i := 0 to N -1 do S [ i ] := R [ i ];

for i := 0 to N -1 do

/* Let ( b_0 , b_1 ,... , b_r -1) the b i n a r y r e p r e s e n t a t i o n of i */

j := ( b_0 ,... , b_m -1 ,0 , b_m +1 ,... , b_r -1) ; k := ( b_0 ,... , b_m -1 ,1 , b_m +1 ,... , b_r -1) ;

R [ i ] := S [ j ] + S [ k ] x w ^( b_m , b_m -1 ,... , b_0 ,0 ,... ,0) ; e n d f o r

e n d f o r

for i := 0 to N -1 do Y [ i ] := R [ i ];

end I T E R A T I V E _ F F T

with w = e

⁻^2πiⁿ

(37)

Threads Atomics

Sequential Implementation of FFT

std :: vector < std :: complex <double> > fft (c o n s t std :: vector < std :: complex <double> >& d a t a ) {

c o n s t s i z e _ t n u m L e v e l s =( s i z e _ t ) std :: l o g 2 ( d a t a . s i z e () ) ;

std :: vector < std :: complex <double> > r e s u l t ( d a t a . b e g i n () , d a t a . end () ) ; std :: vector < std :: complex <double> > r e s u l t N e w ( d a t a . s i z e () ) ;

std :: vector < std :: complex <double> > r o o t ( d a t a . s i z e () ) ; for ( s i z e _ t j =0; j < r o o t . s i z e () ;++ j )

r o o t [ j ]= u n i t r o o t ( j , r o o t . s i z e () ) ; for ( s i z e _ t i =0; i < n u m L e v e l s ;++ i ) {

s i z e _ t m a s k =1 < <( n u m L e v e l s -1 - i ) ; s i z e _ t i n v M a s k =~ m a s k ;

for ( s i z e _ t j =0; j < d a t a . s i z e () ;++ j ) {

s i z e _ t k = j & i n v M a s k ; s i z e _ t l = j | m a s k ;

r e s u l t N e w [ j ]= r e s u l t [ k ]+ r e s u l t [ l ]* r o o t [ R e v e r s a l ( j / mask , i +1) * m a s k ];

}

if ( i != n u m L e v e l s -1) r e s u l t . s w a p ( r e s u l t N e w ) ; }

r e t u r n r e s u l t N e w ; }

(38)

Threads Atomics

Sorting back Values by Bit Reversal

v o i d S o r t B i t r e v e r s a l ( std :: vector < std :: complex <double> >& r e s u l t ) {

std :: vector < std :: complex <double> > r e s u l t N e w ( r e s u l t . s i z e () ) ; c o n s t s i z e _ t n = std :: l o g 2 ( r e s u l t . s i z e () ) ;

for ( s i z e _ t j =0; j < r e s u l t . s i z e () ;++ j ) r e s u l t N e w [ R e v e r s a l ( j , n ) ]= r e s u l t [ j ];

r e s u l t . s w a p ( r e s u l t N e w ) ; }

v o i d F a s t S o r t B i t r e v e r s a l ( std :: vector < std :: complex <double> >& r e s u l t ) {

s i z e _ t n = r e s u l t . s i z e () ; c o n s t s i z e _ t t = std :: l o g 2 ( n ) ; s i z e _ t l = 1;

std :: vector < size_t > c ( n ) ; for ( s i z e _ t q =0; q < t ;++ q ) {

n = n /2;

for( s i z e _ t j =0; j < l ;++ j ) c [ l + j ]= c [ j ]+ n ; l =2* l ;

}

std :: vector < std :: complex <double> > r e s u l t N e w ( r e s u l t . s i z e () ) ; for ( s i z e _ t j =0; j < r e s u l t . s i z e () ;++ j )

r e s u l t N e w [ c [ j ]]= r e s u l t [ j ];

r e s u l t . s w a p ( r e s u l t N e w ) ;

(39)

Threads Atomics

Bit Reversal and Roots of Unity

s i z e _ t R e v e r s a l ( s i z e _ t k , s i z e _ t n ) {

s i z e _ t j =0;

s i z e _ t m a s k =1;

if ( k & m a s k ) ++ j ;

for ( s i z e _ t i =0; i <( n -1) ;++ i ) {

mask < <=1;

j < <=1;

if ( k & m a s k ) ++ j ; }

r e t u r n j ; }

i n l i n e std :: complex <double> u n i t r o o t ( s i z e _ t i , s i z e _ t N ) {

d o u b l e arg = -( i *2* M _ P I / N ) ;

r e t u r n std :: complex <double>( cos ( arg ) , sin ( arg ) ) ; }

(40)

Threads Atomics

Time Measurement

C++11 offers a builtin way to measure time. There are various clocks and data types to store times and timespans.

# include< chrono >

# include< thread >

u s i n g n a m e s p a c e std :: c h r o n o ;

int m a i n () {

s t e a d y _ c l o c k :: t i m e _ p o i n t s t a r t = s t e a d y _ c l o c k :: now () ; std :: t h i s _ t h r e a d :: s l e e p _ f o r ( s e c o n d s { 2 } ) ;

a u t o now = s t e a d y _ c l o c k :: now () ;

n a n o s e c o n d s d u r a t i o n = now - s t a r t ; // we w a n t the r e s u l t in ns m i l l i s e c o n d s d u r a t i o n M s = d u r a t i o n _ c a s t < m i l l i s e c o n d s >( d u r a t i o n ) ; std :: c o u t < < " s o m e t h i n g t o o k " < < d u r a t i o n . c o u n t ()

< < " ns w h i c h is " < < d u r a t i o n M s . c o u n t () < < " ms \ n "; s e c o n d s sec = h o u r s {2} + m i n u t e s { 3 5 } + s e c o n d s { 9 } ;

std :: c o u t < < " 2 h 35 m 9 s is " < < sec . c o u n t () < < " s \ n ";

}

(41)

Threads Atomics

Main Program for Sequential FFT

int m a i n () {

c o n s t s i z e _ t n u m P o i n t s = pow (2 ,16) ;

std :: vector < std :: complex <double> > d a t a ( n u m P o i n t s ) ; s i z e _ t i =1;

for (a u t o& x : d a t a ) x . r e a l ( cos ( i ++) ) ;

a u t o t0 = std :: c h r o n o :: s t e a d y _ c l o c k :: now () ;

std :: vector < std :: complex <double> > r e s u l t 1 = fft ( d a t a ) ; F a s t S o r t B i t r e v e r s a l ( r e s u l t 1 ) ;

a u t o t1 = std :: c h r o n o :: s t e a d y _ c l o c k :: now () ; for (a u t o& x : r e s u l t 1 )

std :: c o u t < < std :: abs ( x ) < < " " < < x . r e a l () < < " " < < x . i m a g () < <

std :: e n d l ;

std :: c o u t < < std :: e n d l < < std :: e n d l ; std :: c o u t < < " # S e q u e n t i a l fft t o o k " < <

std :: c h r o n o :: d u r a t i o n _ c a s t < std :: c h r o n o :: m i l l i s e c o n d s >( t1 - t0 ) . c o u n t ()

< < " m i l l i s e c o n d s . " < < std :: e n d l ; }

(42)

Threads Atomics

Barrier

c o n s t s i z e _ t n u m T h r e a d s =1;// std :: t h r e a d :: h a r d w a r e _ c o n c u r r e n c y () ; s t r u c t B a r r i e r

{

std :: atomic < size_t > c o u n t _ ; std :: atomic < size_t > s t e p _ ; B a r r i e r () : c o u n t _ (0) , s t e p _ (0) {}

v o i d b l o c k ( s i z e _ t n u m T h r e a d s ) {

if ( n u m T h r e a d s <2) r e t u r n;

s i z e _ t s t e p = s t e p _ . l o a d () ;

if ( c o u n t _ . f e t c h _ a d d (1) == ( n u m T h r e a d s -1) ) {

c o u n t _ . s t o r e (0) ; s t e p _ . f e t c h _ a d d (1) ; r e t u r n;

} e l s e {

w h i l e ( s t e p _ . l o a d () == s t e p ) std :: t h i s _ t h r e a d :: y i e l d () ; r e t u r n;

}

(43)

Threads Atomics

Parallel FFT: Threads

B a r r i e r b a r r i e r ;

v o i d f f t t h r e a d (c o n s t s i z e _ t t h r e a d N u m , std :: vector < std :: complex <double> >&

root , std :: vector < std :: complex <double> >& result , std :: vector < std :: complex <double> >& r e s u l t N e w ) {

c o n s t s i z e _ t N = std :: l o g 2 ( r e s u l t . s i z e () ) ; s i z e _ t n = r e s u l t . s i z e () / n u m T h r e a d s ; c o n s t s i z e _ t o f f s e t = t h r e a d N u m * n +

std :: min ( t h r e a d N u m , r e s u l t . s i z e () % n u m T h r e a d s ) ; n += ( t h r e a d N u m <( r e s u l t . s i z e () % n u m T h r e a d s ) ? 1 : 0 ) ; for ( s i z e _ t i = o f f s e t ; i < o f f s e t + n ;++ i )

r o o t [ i ]= u n i t r o o t ( i , r o o t . s i z e () ) ; b a r r i e r . b l o c k ( n u m T h r e a d s ) ;

for ( s i z e _ t l e v e l =0; level < N ;++ l e v e l ) {

s i z e _ t m a s k =1 < <( N - level -1) ; s i z e _ t i n v M a s k =~ m a s k ;

(44)

Threads Atomics

Parallel FFT: Threads

for ( s i z e _ t i =0; i < n ;++ i ) {

s i z e _ t j = o f f s e t + i ; s i z e _ t k = j & i n v M a s k ; s i z e _ t l = j | m a s k ;

s i z e _ t e = R e v e r s a l ( j / mask , l e v e l +1) * m a s k ;

std :: c o u t < < " # j : " < < j < < " k : " < < k < < " l : " < < l < < " e :

" < < e < < std :: e n d l ;

r e s u l t N e w [ j ]= r e s u l t [ k ]+ r e s u l t [ l ]* r o o t [ R e v e r s a l ( j / mask , l e v e l +1) * m a s k ];

}

b a r r i e r . b l o c k ( n u m T h r e a d s ) ; if ( t h r e a d N u m = = 0 )

r e s u l t . s w a p ( r e s u l t N e w ) ; b a r r i e r . b l o c k ( n u m T h r e a d s ) ; }

for ( s i z e _ t j = o f f s e t ; j < o f f s e t + n ;++ j ) r e s u l t N e w [ R e v e r s a l ( j , N ) ]= r e s u l t [ j ];

}

(45)

Threads Atomics

Parallel FFT: Threads

std :: vector < std :: complex <double> > fft (c o n s t std :: vector < std :: complex <double> >& d a t a ) {

std :: vector < std :: complex <double> > r e s u l t ( d a t a . b e g i n () , d a t a . end () ) ; std :: vector < std :: complex <double> > r e s u l t N e w ( d a t a . s i z e () ) ;

std :: vector < std :: complex <double> > r o o t ( d a t a . s i z e () ) ; std :: vector < std :: thread > t ( n u m T h r e a d s ) ;

for ( s i z e _ t p = 0; p < n u m T h r e a d s ;++ p ) t [ p ]= std :: t h r e a d

{ f f t t h r e a d , p , std :: ref ( r o o t ) , std :: ref ( r e s u l t ) , std :: ref ( r e s u l t N e w ) };

for ( s i z e _ t p = 0; p < t . s i z e () ;++ p ) t [ p ]. j o i n () ;

r e t u r n r e s u l t N e w ; }

(46)

Threads Atomics

Main Program for Parallel FFT

int m a i n () {

c o n s t s i z e _ t n u m P o i n t s = pow (2 ,2) ;

std :: vector < std :: complex <double> > d a t a ( n u m P o i n t s ) ; s i z e _ t i =1;

for (a u t o& x : d a t a ) x . r e a l ( cos ( i ++) ) ;

a u t o t0 = std :: c h r o n o :: s t e a d y _ c l o c k :: now () ;

std :: vector < std :: complex <double> > r e s u l t 1 = fft ( d a t a ) ; a u t o t1 = std :: c h r o n o :: s t e a d y _ c l o c k :: now () ;

for (a u t o& x : r e s u l t 1 )

std :: c o u t < < std :: abs ( x ) < < " " < < x . r e a l () < < " " < < x . i m a g () < <

std :: e n d l ;

std :: c o u t < < std :: e n d l < < std :: e n d l ;

std :: c o u t < < " # P a r a l l e l fft w i t h " < < n u m T h r e a d s < < " t h r e a d s t o o k " < <

std :: c h r o n o :: d u r a t i o n _ c a s t < std :: c h r o n o :: m i l l i s e c o n d s >( t1 - t0 ) . c o u n t ()

< < " m i l l i s e c o n d s . " < < std :: e n d l ; }

(47)

Threads Further Reading

Object-Oriented Programming for Scientiﬁc Computing

Object-Oriented Programming for Scientific Computing

C++11 Multithreading

Ole Klein

7. Juli 2015

C++11: Random Numbers

C++11 provides powerful random number generators that are able to work with different distributions, e.g. uniformly distributed random numbers:

Normally Distributed Random Numbers

Own RNG for Uniformly Distributed Integers

Own RNG for Normally Distributed Doubles

This also works for other number types:

Application

Application

Processes and threads in Unix

• The following information is associated with a process in UNIX:

IDs (process, user, group)

environment variables

directory

code

registers, stack, heap

file descriptors, signals

message queues, pipes, shared-memory segments

shared libraries

• Each process has its own address space.

• Threads (small-scale processes) exist within a process and share the address space of the main process.

• A thread consists of:

ID

stack pointer

registers

scheduling properties

signals

• Switching between threads is faster

than switching between processes.

C++11 Threads

• POSIX threads (or pthreads) were introduced in 1995 and have been the standard model for threads on UNIX computers for a long time. They are created and managed via a C interface.

• C++11 defines its own threading concept, which basically is a wrapper for POSIX threads. In many cases this makes it easier to write multi-threading programs. The following concepts are supported:

Threads Thread classes and functions for the rejoining of threads are defined in the header

. This header also contains functions that provide a thread ID and the functionality to detach threads that afterwards run as a separate program (fork).

Mutual Exclusion The header

provides classes for mutexes and locks

(also recursive variants and ones with timeouts), and

additionally functions that check if a lock is available.

C++11 Threads

Condition Variables The header

defines a class that allows the coordination of multiple threads with condition variables.

Futures The header

defines classes and functions whose operations need not be executed immediately, but can be performed

asynchronously with the main part of the program. If possible, this is done in parallel, and if not, sequentially at the location where the result is needed.

Atomic Operations The classes from the header

allow certain operations

with integers, Boolean variables and pointers to be performed as an

atomic operation.

C++11 Thread Production

Output

Calculation of the Vector Norm

Calculation of the Vector Norm

Critical Sections

• A statement of the type s=s+t is not atomic, which means it will not be carried out in one step:

process Π

: process Π

:

1 load s in R1 3 load s in R1

load t in R2 load t in R2

add R1, R2, store in R3 add R1, R2, store in R3 2 store R3 in s 4 store R3 in s

• The order of execution of these instructions relative to each other is not fixed.

• This results in an exponentially growing sequence of possible execution orders.

Possible Execution Orders

@

@

@ @ 1

3

@

@ @

@

@ @ 2

3 1

4

@

@ @ 3 2 4 2 4 1

4 4 2 4 2 2