• Keine Ergebnisse gefunden

Object-Oriented Programming for Scientific Computing

N/A
N/A
Protected

Academic year: 2021

Aktie "Object-Oriented Programming for Scientific Computing"

Copied!
47
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Object-Oriented Programming for Scientific Computing

C++11 Multithreading

Ole Klein

Interdisciplinary Center for Scientific Computing Heidelberg University

ole.klein@iwr.uni-heidelberg.de

7. Juli 2015

(2)

Random numbers

C++11: Random Numbers

C++11 provides powerful random number generators that are able to work with different distributions, e.g. uniformly distributed random numbers:

# include< i o s t r e a m >

# include< iomanip >

# include< string >

# include< map >

# include< random >

# include< cmath >

int m a i n () {

// I m p l e m e n t a t i o n - d e p e n d e n t p r e d e f i n e d RNG std :: d e f a u l t _ r a n d o m _ e n g i n e e1 ;

// P r o d u c e u n i f o r m l y d i s t r i b u t e d i n t e g e r s b e t w e e n 1 and 6 std :: u n i f o r m _ i n t _ d i s t r i b u t i o n <int> u n i f o r m D i s t (1 ,6) ; // D r a w a n u m b e r

int m e a n = u n i f o r m D i s t ( e1 ) ;

std :: c o u t < < " R a n d o m l y g e n e r a t e d m e a n : " < < m e a n < < std :: e n d l ;

(3)

Random numbers

Normally Distributed Random Numbers

// G e n e r a t o r for non - d e t e r m i n i s t i c u n i f o r m l y d i s t r i b u t e d r a n d o m n u m b e r s

std :: r a n d o m _ d e v i c e rd ;

// A l t e r n a t i v e RNG b a s e d on M e r s e n n e T w i s t e r a l g o r i t h m std :: m t 1 9 9 3 7 e2 ( rd () ) ;

// C r e a t e n o r m a l d i s t r i b u t i o n a r o u n d m e a n std :: n o r m a l _ d i s t r i b u t i o n < > n o r m a l D i s t ( mean ,2) ; std :: map <int,int> h i s t ;

// C r e a t e r a n d o m n u m b e r s and c o u n t for (int n =0; n < 1 0 0 0 0 ; ++ n )

++ h i s t [ std :: r o u n d ( n o r m a l D i s t ( e2 ) ) ];

// P r i n t h i s t o g r a m to s c r e e n

std :: c o u t < < " N o r m a l d i s t r i b u t i o n a r o u n d m e a n " < < m e a n < < " : "

< < std :: e n d l ; for (a u t o p : h i s t )

std :: c o u t < < std :: f i x e d < < std :: s e t p r e c i s i o n (1) < < std :: s e t w (2)

< < p . f i r s t < < " " < < std :: s t r i n g ( p . s e c o n d /200 ,’ * ’)

< < std :: e n d l ; }

(4)

Random numbers

Own RNG for Uniformly Distributed Integers

c l a s s R a n d o m I n t {

p u b l i c:

R a n d o m I n t (int low , int high , u n s i g n e d int s e e d =0) :

r a n d _ { std :: b i n d ( std :: u n i f o r m _ i n t _ d i s t r i b u t i o n < >{ low , h i g h } , std :: d e f a u l t _ r a n d o m _ e n g i n e { s e e d }) }

{}

int o p e r a t o r() () c o n s t {

r e t u r n r a n d _ () ; }

p r i v a t e:

std :: f u n c t i o n <int() > r a n d _ ; };

(5)

Random numbers

Own RNG for Normally Distributed Doubles

This also works for other number types:

c l a s s R a n d o m D o u b l e {

p u b l i c:

R a n d o m D o u b l e (d o u b l e mean , d o u b l e stddev , u n s i g n e d int s e e d =0) : r a n d _ { std :: b i n d ( std :: n o r m a l _ d i s t r i b u t i o n < >{ mean , s t d d e v } ,

std :: m t 1 9 9 3 7 { s e e d }) } {}

d o u b l e o p e r a t o r() () c o n s t {

r e t u r n r a n d _ () ; }

p r i v a t e:

std :: f u n c t i o n <d o u b l e() > r a n d _ ; };

(6)

Random numbers

Application

int m a i n () {

// G e n e r a t o r for non - d e t e r m i n i s t i c u n i f o r m l y d i s t r i b u t e d r a n d o m n u m b e r s

std :: r a n d o m _ d e v i c e rd ; R a n d o m I n t r a n d I n t {1 ,6 , rd () };

int m e a n = r a n d I n t () ;

std :: c o u t < < " R a n d o m g e n e r a t e d m e a n : " < < m e a n < < std :: e n d l ; R a n d o m D o u b l e r a n d D o u b l e {(d o u b l e) mean ,2. , rd () };

std :: map <int,int> h i s t ;

// C r e a t e r a n d o m n u m b e r s and c o u n t for (int n =0; n < 1 0 0 0 0 ; ++ n )

++ h i s t [ std :: r o u n d ( r a n d D o u b l e () ) ];

// P r i n t h i s t o g r a m to s c r e e n

std :: c o u t < < " N o r m a l d i s t r i b u t i o n a r o u n d m e a n " < < m e a n < < " : "

< < std :: e n d l ; for (a u t o p : h i s t )

std :: c o u t < < std :: f i x e d < < std :: s e t p r e c i s i o n (1) < < std :: s e t w (2)

< < p . f i r s t < < " " < < std :: s t r i n g ( p . s e c o n d /200 ,’ * ’)

< < std :: e n d l ;

(7)

Random numbers

Application

// C r e a t e r a n d o m n u m b e r s and c o u n t for (int n =0; n < 1 0 0 0 0 ; ++ n )

++ h i s t [ std :: r o u n d ( r a n d D o u b l e () ) ];

// P r i n t h i s t o g r a m to s c r e e n

std :: c o u t < < " N o r m a l d i s t r i b u t i o n a r o u n d m e a n " < < m e a n < < " : "

< < std :: e n d l ; for (a u t o p : h i s t )

std :: c o u t < < std :: f i x e d < < std :: s e t p r e c i s i o n (1) < < std :: s e t w (2)

< < p . f i r s t < < " " < < std :: s t r i n g ( p . s e c o n d /200 ,’ * ’)

< < std :: e n d l ; }

(8)

Threads Introduction

Processes and threads in Unix

• The following information is associated with a process in UNIX:

IDs (process, user, group)

environment variables

directory

code

registers, stack, heap

file descriptors, signals

message queues, pipes, shared-memory segments

shared libraries

• Each process has its own address space.

• Threads (small-scale processes) exist within a process and share the address space of the main process.

• A thread consists of:

ID

stack pointer

registers

scheduling properties

signals

• Switching between threads is faster

than switching between processes.

(9)

Threads Introduction

C++11 Threads

• POSIX threads (or pthreads) were introduced in 1995 and have been the standard model for threads on UNIX computers for a long time. They are created and managed via a C interface.

• C++11 defines its own threading concept, which basically is a wrapper for POSIX threads. In many cases this makes it easier to write multi-threading programs. The following concepts are supported:

Threads Thread classes and functions for the rejoining of threads are defined in the header

threads

. This header also contains functions that provide a thread ID and the functionality to detach threads that afterwards run as a separate program (fork).

Mutual Exclusion The header

mutex

provides classes for mutexes and locks

(also recursive variants and ones with timeouts), and

additionally functions that check if a lock is available.

(10)

Threads Introduction

C++11 Threads

Condition Variables The header

condition_variable

defines a class that allows the coordination of multiple threads with condition variables.

Futures The header

futures

defines classes and functions whose operations need not be executed immediately, but can be performed

asynchronously with the main part of the program. If possible, this is done in parallel, and if not, sequentially at the location where the result is needed.

Atomic Operations The classes from the header

atomic

allow certain operations

with integers, Boolean variables and pointers to be performed as an

atomic operation.

(11)

Threads C++11 Thread Production

C++11 Thread Production

# i n c l u d e < array >

# i n c l u d e < i o s t r e a m >

# i n c l u d e < thread >

v o i d H e l l o ( s i z e _ t n u m b e r ) {

std :: c o u t < < " H e l l o f r o m t h r e a d " < < n u m b e r < < std :: e n d l ; }

int m a i n () {

c o n s t int n u m T h r e a d s = 5;

std :: array < std :: thread , n u m T h r e a d s > t h r e a d s ; // S t a r t t h r e a d s

for ( s i z e _ t i = 0; i < t h r e a d s . s i z e () ; ++ i ) t h r e a d s [ i ] = std :: t h r e a d ( Hello , i ) ; std :: c o u t < < " H e l l o f r o m m a i n \ n ";

// R e j o i n threads , i m p l i c i t b a r r i e r

for ( s i z e _ t i = 0; i < t h r e a d s . s i z e () ; ++ i ) t h r e a d s [ i ]. j o i n () ;

r e t u r n 0;

}

(12)

Threads C++11 Thread Production

Output

H e l l o f r o m t h r e a d H e l l o f r o m t h r e a d H e l l o f r o m t h r e a d H e l l o f r o m t h r e a d H e l l o f r o m m a i n

0 H e l l o f r o m t h r e a d 213 4

(13)

Threads Example: Calculation of the Vector Norm

Calculation of the Vector Norm

# include< array >

# include< vector >

# include< i o s t r e a m >

# include< thread >

v o i d N o r m (c o n s t std :: vector <double>& x , d o u b l e& norm , c o n s t s i z e _ t i , c o n s t s i z e _ t p )

{

s i z e _ t n u m E l e m = x . s i z e () / p ;

s i z e _ t f i r s t = n u m E l e m * i + std :: min ( i , x . s i z e () % p ) ; n u m E l e m += ( i <( x . s i z e () % p ) ? 1 : 0 ) ;

d o u b l e l o c a l N o r m = 0 . 0 ;

for ( s i z e _ t j =0; j < n u m E l e m ; ++ j )

l o c a l N o r m += x [ f i r s t + j ] * x [ f i r s t + j ];

n o r m += l o c a l N o r m ; // d a n g e r o u s ...

}

(14)

Threads Example: Calculation of the Vector Norm

Calculation of the Vector Norm

int m a i n () {

c o n s t s i z e _ t n u m T h r e a d s = 5;

c o n s t s i z e _ t n u m V a l u e s = 1 0 0 0 0 0 0 ;

std :: array < std :: thread , n u m T h r e a d s > t h r e a d s ; std :: vector <double> x ( n u m V a l u e s , 2 . 0 ) ; d o u b l e n o r m = 0 . 0 ;

// C r e a t e t h r e a d s

for ( s i z e _ t i =0; i < t h r e a d s . s i z e () ; ++ i )

t h r e a d s [ i ] = std :: t h r e a d ( Norm , std :: c r e f ( x ) , std :: ref ( n o r m ) ,i , n u m T h r e a d s ) ; // R e j o i n t h r e a d s w i t h m a i n thread , b a r r i e r

for ( s i z e _ t i =0; i < t h r e a d s . s i z e () ; ++ i ) t h r e a d s [ i ]. j o i n () ;

std :: c o u t < < " N o r m is : " < < n o r m < < std :: e n d l ; r e t u r n 0;

}

(15)

Threads Example: Calculation of the Vector Norm

Critical Sections

• A statement of the type s=s+t is not atomic, which means it will not be carried out in one step:

process Π

1

: process Π

2

:

1 load s in R1 3 load s in R1

load t in R2 load t in R2

add R1, R2, store in R3 add R1, R2, store in R3 2 store R3 in s 4 store R3 in s

• The order of execution of these instructions relative to each other is not fixed.

• This results in an exponentially growing sequence of possible execution orders.

(16)

Threads Example: Calculation of the Vector Norm

Possible Execution Orders

@

@

@ @ 1

3

@

@ @

@

@ @ 2

3 1

4

@

@ @ 3 2 4 2 4 1

4 4 2 4 2 2

Result of the calculation s = t

Π1

+ t

Π2

s = t

Π2

s = t

Π1

s = t

Π2

s = t

Π1

s = t

Π1

+ t

Π2

Only some of the sequences lead to the right result!

(17)

Threads Mutual Exclusion/Locks

Mutual Exclusion

• An additional synchronization is necessary to prevent execution orders that do not provide the correct result.

• Critical sections must be executed under mutual exclusion.

• Mutual exclusion requires:

At most one process enters the critical section at a time.

There are no deadlocks (i.e. two processes that are waiting on each other).

No process waits at a free critical section.

When a process enters a critical section, it will leave it successfully in the end.

(18)

Threads Mutual Exclusion/Locks

Mutual Exclusion/Locks

# include< i o s t r e a m >

# include< vector >

# include< thread >

# include< mutex >

std :: m u t e x m ; v o i d e (int& sh ) {

m . l o c k () ;

sh += 1; // m o d i f y s h a r e d d a t a m . u n l o c k () ; // v e r y i m p o r t a n t }

• C++11 contains the construct

mutex

for this.

• A

mutex

can lock a section, so that it can only be accessed by one process.

• All other processes have to wait in front of the critical section. It should therefore be as small as possible.

• The same mutex must be available in all of the threads.

(19)

Threads Mutual Exclusion/Locks

Lock Guard

v o i d f (int& sh ) {

std :: l o c k _ g u a r d < std :: mutex > l o c k e d { m };

sh += 1; // m o d i f y s h a r e d d a t a }

• There are helper classes for releasing a

lock

again (even if an exception is thrown).

• The simplest of these is a

lock_guard

.

• At initialization it gets a

mutex

and locks it, and it releases the lock when the destructor of

lock_guard

is called.

• A

lock_guard

should therefore be defined either at the end of a function or in

a separate block to call the destructor as early as possible.

(20)

Threads Mutual Exclusion/Locks

Unique Lock

A

unique_lock

has more functionality than a

lock_guard

. It has functions that lock or release a

mutex

and can test whether a critical section can be entered, so that otherwise something else can be done. A

unique_lock

can also take over an already locked

mutex

or defer locking until it is needed.

v o i d g (int& sh ) {

std :: u n i q u e _ l o c k < std :: mutex > l o c k e d { m , std :: d e f e r _ l o c k }; // a d m i n i s t r a t e mutex , but don ’ t b l o c k it

b o o l s u c c e s s f u l = f a l s e; w h i l e (! s u c c e s s f u l ) {

if ( l o c k e d . t r y _ l o c k () ) // l o c k m u t e x if a b l e {

sh += 1; // m o d i f y s h a r e d d a t a s u c c e s s f u l = t r u e;

l o c k e d . u n l o c k () ;

// a c t u a l l y not n e c e s s a r y in t h i s e x a m p l e }

e l s e {

// do s o m e t h i n g d i f f e r e n t ...

} }

(21)

Threads Mutual Exclusion/Locks

Main Program

int m a i n () {

c o n s t s i z e _ t n u m T h r e a d s = std :: t h r e a d :: h a r d w a r e _ c o n c u r r e n c y () ; std :: vector < std :: thread > t h r e a d s { n u m T h r e a d s };

int r e s u l t = 0;

// s t a r t t h r e a d s

for ( s i z e _ t i =0; i < t h r e a d s . s i z e () ; ++ i )

t h r e a d s [ i ] = std :: t h r e a d { e , std :: ref ( r e s u l t ) };

// R e j o i n threads , i m p l i c i t b a r r i e r for ( s i z e _ t i =0; i < t h r e a d s . s i z e () ; ++ i )

t h r e a d s [ i ]. j o i n () ;

std :: c o u t < < " Y o u r h a r d w a r e s u p p o r t s " < < r e s u l t < < " t h r e a d s " < <

std :: e n d l ; r e t u r n 0;

}

(22)

Threads Calculation of the Vector Norm with a Mutex

Calculation of the Vector Norm with a Mutex

# include< array >

# include< vector >

# include< i o s t r e a m >

# include< thread >

# include< mutex >

# include< cmath >

s t a t i c std :: m u t e x n L o c k ;

v o i d N o r m (c o n s t std :: vector <double>& x , d o u b l e& norm , c o n s t s i z e _ t i , c o n s t s i z e _ t p )

{

s i z e _ t n u m E l e m = x . s i z e () / p ;

s i z e _ t f i r s t = n u m E l e m * i + std :: min ( i , x . s i z e () % p ) ; n u m E l e m += ( i <( x . s i z e () % p ) ? 1 : 0 ) ;

d o u b l e l o c a l N o r m = 0 . 0 ; for ( s i z e _ t j =0; j < n u m E l e m ;++ j )

l o c a l N o r m += x [ f i r s t + j ]* x [ f i r s t + j ];

std :: l o c k _ g u a r d < std :: mutex > b l o c k _ t h r e a d s _ u n t i l _ f i n i s h _ t h i s _ j o b ( n L o c k ) ; n o r m += l o c a l N o r m ;

}

(23)

Threads Calculation of the Vector Norm with a Mutex

Calculation of the Vector Norm with a Mutex

int m a i n () {

c o n s t s i z e _ t n u m T h r e a d s = 5;

c o n s t s i z e _ t n u m V a l u e s = 1 0 0 0 0 0 0 0 ;

std :: array < std :: thread , n u m T h r e a d s > t h r e a d s ; std :: vector <double> x ( n u m V a l u e s , 2 . 0 ) ; d o u b l e n o r m = 0 . 0 ;

// S t a r t t h r e a d s

for ( s i z e _ t i = 0; i < t h r e a d s . s i z e () ; ++ i ) t h r e a d s [ i ] =

std :: t h r e a d ( Norm , std :: c r e f ( x ) , std :: ref ( n o r m ) ,i , n u m T h r e a d s ) ; // R e j o i n t h r e a d s

for ( s i z e _ t i = 0; i < t h r e a d s . s i z e () ; ++ i ) t h r e a d s [ i ]. j o i n () ;

std :: c o u t < < " N o r m is : " < < s q r t ( n o r m ) < < std :: e n d l ; r e t u r n 0;

}

(24)

Threads Calculation of the Vector Norm with Tree Combine

Parallelization of the Sum

• The calculation of the global sum of the norm is not parallel when using a

mutex

for the calculation of s = s + t .

• It can be parallelized as follows (P = 8):

s = s

0

+ s

1

| {z }

s01

+ s

2

+ s

3

| {z }

s23

| {z }

s0123

+ s

4

+ s

5

| {z }

s45

+ s

6

+ s

7

| {z }

s67

| {z }

s4567

| {z }

s

• This reduces the complexity from O(P) to O(log

2

P).

(25)

Threads Calculation of the Vector Norm with Tree Combine

Tree Combine

If we use a binary representation of the process ID, the communication structure is a binary tree:

s

0

000

s

1

@ 001

@ I

s

2

010

s

3

@ 011

@ I

s

4

100

s

5

@ 101

@ I

s

6

110

s

7

@ 111

@ I s

0

+ s

1

000 *

s

2

+ s

3

H 010 H H H Y

s

4

+ s

5

100 *

s

6

+ s

7

H 110 H H H Y s

0

+ s

1

+ s

2

+ s

3

000 :

s

4

+ s

5

+ s

6

+ s

7

X 100

X X

X X

X X

y

P s

i

000

(26)

Threads Calculation of the Vector Norm with Tree Combine

Calculation of the Vector Norm with Tree Combine

# i n c l u d e < array >

# i n c l u d e < vector >

# i n c l u d e < i o s t r e a m >

# i n c l u d e < thread >

# i n c l u d e < cmath >

v o i d N o r m (c o n s t std :: vector <double>& x , std :: vector <double>& norm , std :: vector <bool>& flag ,c o n s t s i z e _ t i , c o n s t s i z e _ t d ) {

s i z e _ t p = pow (2 , d ) ;

s i z e _ t n u m E l e m = x . s i z e () / p ;

s i z e _ t f i r s t = n u m E l e m * i + std :: min ( i , x . s i z e () % p ) ; n u m E l e m += ( i <( x . s i z e () % p ) ? 1 : 0 ) ;

for ( s i z e _ t j =0; j < n u m E l e m ;++ j ) n o r m [ i ]+= x [ f i r s t + j ]* x [ f i r s t + j ];

(27)

Threads Calculation of the Vector Norm with Tree Combine

Calculation of the Vector Norm with Tree Combine

// T r e e c o m b i n e

for ( s i z e _ t j =0; j < d ;++ j ) {

s i z e _ t m = pow (2 , j ) ; if ( i & m )

{

f l a g [ i ]=t r u e; b r e a k; }

w h i l e (! f l a g [ i | m ]) ; n o r m [ i ] += n o r m [ i | m ];

} }

(28)

Threads Calculation of the Vector Norm with Tree Combine

Calculation of the Vector Norm with Tree Combine

int m a i n () {

c o n s t s i z e _ t l o g T h r e a d s = 1;

c o n s t s i z e _ t n u m T h r e a d s = pow (2 , l o g T h r e a d s ) ; c o n s t s i z e _ t n u m V a l u e s = 1 0 0 0 0 0 0 0 ;

std :: array < std :: thread , n u m T h r e a d s > t h r e a d s ; std :: vector <double> x ( n u m V a l u e s , 2 . 0 ) ; std :: vector <bool> f l a g ( n u m T h r e a d s ,f a l s e) ; std :: vector <double> n o r m ( n u m T h r e a d s , 0 . 0 ) ; // S t a r t t h r e a d s

for ( s i z e _ t i =0; i < t h r e a d s . s i z e () ; ++ i )

t h r e a d s [ i ] = std :: t h r e a d ( Norm , std :: c r e f ( x ) , std :: ref ( n o r m ) , std :: ref ( f l a g ) ,i , l o g T h r e a d s ) ; // R e j o i n t h r e a d s

for ( s i z e _ t i =0; i < t h r e a d s . s i z e () ; ++ i ) t h r e a d s [ i ]. j o i n () ;

std :: c o u t < < " N o r m is : " < < s q r t ( n o r m [ 0 ] ) < < std :: e n d l ; r e t u r n 0;

}

(29)

Threads Calculation of the Vector Norm with Tree Combine

Condition Synchronization

• Tree combine is a form of condition synchronization.

• A process is waiting until a condition (Boolean expression) is true. The condition is made true by another process.

• Here several processes wait until their flags become true.

• The flags are also called condition variables. Their proper initialization is important.

• We implement the synchronization with busy wait. That’s probably not a good idea for multi-threading.

• If condition variables are used repeatedly (e.g. when several sums must be calculated in succession), the following rules should be followed:

A process which is waiting on a condition variable also resets it again.

A condition variable may only be set to true again if it is sure that it has been

reset earlier.

(30)

Threads Thread Creation withasync

Thread Creation with async

The function

async

allows the easy creation of a thread that performs its

calculation without interaction with the main program or other threads. Its result can later be queried using the function

get

.

# include< i o s t r e a m >

# include< vector >

# include< numeric >

# include< thread >

# include< future >

t e m p l a t e<c l a s s T > s t r u c t A c c u m { // s i m p l e a c c u m u l a t o r f u n c t i o n o b j e c t

T * b ; T * e ;

T val ;

A c c u m ( T * bb , T * ee , T vv ) : b { bb } , e { ee } , val { vv } {}

T o p e r a t o r() () { r e t u r n std :: a c c u m u l a t e ( b , e , val ) ; } };

(31)

Threads Thread Creation withasync

Thread Creation with async

The function

async

allows the easy creation of a thread that performs its

calculation without interaction with the main program or other threads. Its result can later be queried using the function

get

.

d o u b l e comp ( s t d : : v e c t o r<d o u b l e>& v ) // spawn many t a s k s i f v i s l a r g e e n o u g h {

i f ( v . s i z e ( )<10000)

r e t u r n s t d : : a c c u m u l a t e ( v . b e g i n ( ) , v . end ( ) , 0 . 0 ) ; s t d : : f u t u r e<d o u b l e> f 0

{s t d : : a s y n c ( Accum<d o u b l e>{&v [ 0 ] , & v [ v . s i z e ( ) / 4 ] , 0 . 0})}; s t d : : f u t u r e<d o u b l e> f 1

{s t d : : a s y n c ( Accum<d o u b l e>{&v [ v . s i z e ( ) / 4 ] , & v [ v . s i z e ( ) / 2 ] , 0 . 0})}; s t d : : f u t u r e<d o u b l e> f 2

{s t d : : a s y n c ( Accum<d o u b l e>{&v [ v . s i z e ( ) / 2 ] , & v [ v . s i z e ( )∗3 / 4 ] , 0 . 0})};

s t d : : f u t u r e<d o u b l e> f 3

{s t d : : a s y n c ( Accum<d o u b l e>{&v [ v . s i z e ( )∗3 / 4 ] , & v [ v . s i z e ( ) ] , 0 . 0})}; // l o t s o f o t h e r c o d e c o u l d come h e r e . . .

r e t u r n f 0 . g e t ( )+f 1 . g e t ( )+f 2 . g e t ( )+f 3 . g e t ( ) ; }

i n t main ( ) {

s t d : : v e c t o r<d o u b l e> b l u b ( 1 0 0 0 0 0 , 1 . ) ;

(32)

Threads Atomics

Discrete Fourier Transform

• Let (z

i

, a

i

) be pairs of numbers where a

i

is the measured value of an unknown function at the location z

i

.

• Goal: calculate interpolation A(z ) = 1

N f

0

+ f

1

z + · · · + f

N−1

z

N−1

• Here we use the N-th roots of unity z

i

= e

2πiN k

as nodes. The coefficients of the interpolation polynomial are then

f

k

=

N−1

X

j=0

a

j

· e

−2πijkN

for k = 0, . . . , N − 1

(33)

Threads Atomics

Fast Fourier Transform

• For even values of N, N = 2n, the sum

f

k

=

2n−1

X

j=0

a

j

· e

−2πi2njk

for k = 0, . . . , 2n − 1

can be rearranged into

f

k

=

n−1

X

j=0

a

2j

· e

−2πi2jk2n

+

n−1

X

j=0

a

2j+1

· e

−2πik(2j+1)2n

(34)

Threads Atomics

Fast Fourier Transform

• Setting a

k0

= a

2k

, f

k0

= f

2k

, a

00k

= a

2k+1

and f

k00

= f

2k+1

, we have

f

k

=

n−1

X

j=0

a

0j

· e

2πin jk

+ e

πink

n−1

X

j=0

a

00j

e

2πin jk

= (

f

k0

+ e

πink

f

k00

if k < n f

k−n0

+ e

πin(k−n)

f

k00−n

if k ≥ n

• This means we can solve the problem by calculating two Fourier transforms of length N/2.

• This can be applied recursively. Since the Fourier transform of a value is the value itself, N/2 sums of two values must be calculated on the lowest level.

• Since at each stage N complex multiplications with a unit root and N

complex additions are necessary, the complexity can be reduced from O(N

2

)

to O(N · log(N)).

(35)

Threads Atomics

Recursive Algorithm in Pseudocode

p r o c e d u r e R _ F F T ( X , Y , N , w ) if ( n = = 1 ) t h e n Y [0] := X [ 0 ] ; e l s e b e g i n

R _ F F T ( < X [0] , X [0] ,... , X [ N -2] > , < Q [0] , Q [1] ,... , Q [ N -2] > , N /2 , w ^2) ; R _ F F T ( < X [1] , X [3] ,... , X [ N -1] > , < T [0] , T [1] ,... , T [ N -2] > , N /2 , w ^2) ;

for k := 0 to N -1 do

Y [ k ] := Q [ k mod ( N /2) ] + w ^ k T [ k mod ( N /2) ];

end R _ F F T

with w = e

2πiN

. But: recursion parallelizes poorly.

(36)

Threads Atomics

Iterative Algorithm in Pseudocode

p r o c e d u r e I T E R A T I V E _ F F T ( X , Y , N , w ) r := log N ;

for i := 0 to N -1 do R [ i ] := X [ i ];

for m := 0 to r -1 do

for i := 0 to N -1 do S [ i ] := R [ i ];

for i := 0 to N -1 do

/* Let ( b_0 , b_1 ,... , b_r -1) the b i n a r y r e p r e s e n t a t i o n of i */

j := ( b_0 ,... , b_m -1 ,0 , b_m +1 ,... , b_r -1) ; k := ( b_0 ,... , b_m -1 ,1 , b_m +1 ,... , b_r -1) ;

R [ i ] := S [ j ] + S [ k ] x w ^( b_m , b_m -1 ,... , b_0 ,0 ,... ,0) ; e n d f o r

e n d f o r

for i := 0 to N -1 do Y [ i ] := R [ i ];

end I T E R A T I V E _ F F T

with w = e

2πin

(37)

Threads Atomics

Sequential Implementation of FFT

std :: vector < std :: complex <double> > fft (c o n s t std :: vector < std :: complex <double> >& d a t a ) {

c o n s t s i z e _ t n u m L e v e l s =( s i z e _ t ) std :: l o g 2 ( d a t a . s i z e () ) ;

std :: vector < std :: complex <double> > r e s u l t ( d a t a . b e g i n () , d a t a . end () ) ; std :: vector < std :: complex <double> > r e s u l t N e w ( d a t a . s i z e () ) ;

std :: vector < std :: complex <double> > r o o t ( d a t a . s i z e () ) ; for ( s i z e _ t j =0; j < r o o t . s i z e () ;++ j )

r o o t [ j ]= u n i t r o o t ( j , r o o t . s i z e () ) ; for ( s i z e _ t i =0; i < n u m L e v e l s ;++ i ) {

s i z e _ t m a s k =1 < <( n u m L e v e l s -1 - i ) ; s i z e _ t i n v M a s k =~ m a s k ;

for ( s i z e _ t j =0; j < d a t a . s i z e () ;++ j ) {

s i z e _ t k = j & i n v M a s k ; s i z e _ t l = j | m a s k ;

r e s u l t N e w [ j ]= r e s u l t [ k ]+ r e s u l t [ l ]* r o o t [ R e v e r s a l ( j / mask , i +1) * m a s k ];

}

if ( i != n u m L e v e l s -1) r e s u l t . s w a p ( r e s u l t N e w ) ; }

r e t u r n r e s u l t N e w ; }

(38)

Threads Atomics

Sorting back Values by Bit Reversal

v o i d S o r t B i t r e v e r s a l ( std :: vector < std :: complex <double> >& r e s u l t ) {

std :: vector < std :: complex <double> > r e s u l t N e w ( r e s u l t . s i z e () ) ; c o n s t s i z e _ t n = std :: l o g 2 ( r e s u l t . s i z e () ) ;

for ( s i z e _ t j =0; j < r e s u l t . s i z e () ;++ j ) r e s u l t N e w [ R e v e r s a l ( j , n ) ]= r e s u l t [ j ];

r e s u l t . s w a p ( r e s u l t N e w ) ; }

v o i d F a s t S o r t B i t r e v e r s a l ( std :: vector < std :: complex <double> >& r e s u l t ) {

s i z e _ t n = r e s u l t . s i z e () ; c o n s t s i z e _ t t = std :: l o g 2 ( n ) ; s i z e _ t l = 1;

std :: vector < size_t > c ( n ) ; for ( s i z e _ t q =0; q < t ;++ q ) {

n = n /2;

for( s i z e _ t j =0; j < l ;++ j ) c [ l + j ]= c [ j ]+ n ; l =2* l ;

}

std :: vector < std :: complex <double> > r e s u l t N e w ( r e s u l t . s i z e () ) ; for ( s i z e _ t j =0; j < r e s u l t . s i z e () ;++ j )

r e s u l t N e w [ c [ j ]]= r e s u l t [ j ];

r e s u l t . s w a p ( r e s u l t N e w ) ;

(39)

Threads Atomics

Bit Reversal and Roots of Unity

s i z e _ t R e v e r s a l ( s i z e _ t k , s i z e _ t n ) {

s i z e _ t j =0;

s i z e _ t m a s k =1;

if ( k & m a s k ) ++ j ;

for ( s i z e _ t i =0; i <( n -1) ;++ i ) {

mask < <=1;

j < <=1;

if ( k & m a s k ) ++ j ; }

r e t u r n j ; }

i n l i n e std :: complex <double> u n i t r o o t ( s i z e _ t i , s i z e _ t N ) {

d o u b l e arg = -( i *2* M _ P I / N ) ;

r e t u r n std :: complex <double>( cos ( arg ) , sin ( arg ) ) ; }

(40)

Threads Atomics

Time Measurement

C++11 offers a builtin way to measure time. There are various clocks and data types to store times and timespans.

# include< chrono >

# include< i o s t r e a m >

# include< thread >

u s i n g n a m e s p a c e std :: c h r o n o ;

int m a i n () {

s t e a d y _ c l o c k :: t i m e _ p o i n t s t a r t = s t e a d y _ c l o c k :: now () ; std :: t h i s _ t h r e a d :: s l e e p _ f o r ( s e c o n d s { 2 } ) ;

a u t o now = s t e a d y _ c l o c k :: now () ;

n a n o s e c o n d s d u r a t i o n = now - s t a r t ; // we w a n t the r e s u l t in ns m i l l i s e c o n d s d u r a t i o n M s = d u r a t i o n _ c a s t < m i l l i s e c o n d s >( d u r a t i o n ) ; std :: c o u t < < " s o m e t h i n g t o o k " < < d u r a t i o n . c o u n t ()

< < " ns w h i c h is " < < d u r a t i o n M s . c o u n t () < < " ms \ n "; s e c o n d s sec = h o u r s {2} + m i n u t e s { 3 5 } + s e c o n d s { 9 } ;

std :: c o u t < < " 2 h 35 m 9 s is " < < sec . c o u n t () < < " s \ n ";

}

(41)

Threads Atomics

Main Program for Sequential FFT

int m a i n () {

c o n s t s i z e _ t n u m P o i n t s = pow (2 ,16) ;

std :: vector < std :: complex <double> > d a t a ( n u m P o i n t s ) ; s i z e _ t i =1;

for (a u t o& x : d a t a ) x . r e a l ( cos ( i ++) ) ;

a u t o t0 = std :: c h r o n o :: s t e a d y _ c l o c k :: now () ;

std :: vector < std :: complex <double> > r e s u l t 1 = fft ( d a t a ) ; F a s t S o r t B i t r e v e r s a l ( r e s u l t 1 ) ;

a u t o t1 = std :: c h r o n o :: s t e a d y _ c l o c k :: now () ; for (a u t o& x : r e s u l t 1 )

std :: c o u t < < std :: abs ( x ) < < " " < < x . r e a l () < < " " < < x . i m a g () < <

std :: e n d l ;

std :: c o u t < < std :: e n d l < < std :: e n d l ; std :: c o u t < < " # S e q u e n t i a l fft t o o k " < <

std :: c h r o n o :: d u r a t i o n _ c a s t < std :: c h r o n o :: m i l l i s e c o n d s >( t1 - t0 ) . c o u n t ()

< < " m i l l i s e c o n d s . " < < std :: e n d l ; }

(42)

Threads Atomics

Barrier

c o n s t s i z e _ t n u m T h r e a d s =1;// std :: t h r e a d :: h a r d w a r e _ c o n c u r r e n c y () ; s t r u c t B a r r i e r

{

std :: atomic < size_t > c o u n t _ ; std :: atomic < size_t > s t e p _ ; B a r r i e r () : c o u n t _ (0) , s t e p _ (0) {}

v o i d b l o c k ( s i z e _ t n u m T h r e a d s ) {

if ( n u m T h r e a d s <2) r e t u r n;

s i z e _ t s t e p = s t e p _ . l o a d () ;

if ( c o u n t _ . f e t c h _ a d d (1) == ( n u m T h r e a d s -1) ) {

c o u n t _ . s t o r e (0) ; s t e p _ . f e t c h _ a d d (1) ; r e t u r n;

} e l s e {

w h i l e ( s t e p _ . l o a d () == s t e p ) std :: t h i s _ t h r e a d :: y i e l d () ; r e t u r n;

}

(43)

Threads Atomics

Parallel FFT: Threads

B a r r i e r b a r r i e r ;

v o i d f f t t h r e a d (c o n s t s i z e _ t t h r e a d N u m , std :: vector < std :: complex <double> >&

root , std :: vector < std :: complex <double> >& result , std :: vector < std :: complex <double> >& r e s u l t N e w ) {

c o n s t s i z e _ t N = std :: l o g 2 ( r e s u l t . s i z e () ) ; s i z e _ t n = r e s u l t . s i z e () / n u m T h r e a d s ; c o n s t s i z e _ t o f f s e t = t h r e a d N u m * n +

std :: min ( t h r e a d N u m , r e s u l t . s i z e () % n u m T h r e a d s ) ; n += ( t h r e a d N u m <( r e s u l t . s i z e () % n u m T h r e a d s ) ? 1 : 0 ) ; for ( s i z e _ t i = o f f s e t ; i < o f f s e t + n ;++ i )

r o o t [ i ]= u n i t r o o t ( i , r o o t . s i z e () ) ; b a r r i e r . b l o c k ( n u m T h r e a d s ) ;

for ( s i z e _ t l e v e l =0; level < N ;++ l e v e l ) {

s i z e _ t m a s k =1 < <( N - level -1) ; s i z e _ t i n v M a s k =~ m a s k ;

(44)

Threads Atomics

Parallel FFT: Threads

for ( s i z e _ t i =0; i < n ;++ i ) {

s i z e _ t j = o f f s e t + i ; s i z e _ t k = j & i n v M a s k ; s i z e _ t l = j | m a s k ;

s i z e _ t e = R e v e r s a l ( j / mask , l e v e l +1) * m a s k ;

std :: c o u t < < " # j : " < < j < < " k : " < < k < < " l : " < < l < < " e :

" < < e < < std :: e n d l ;

r e s u l t N e w [ j ]= r e s u l t [ k ]+ r e s u l t [ l ]* r o o t [ R e v e r s a l ( j / mask , l e v e l +1) * m a s k ];

}

b a r r i e r . b l o c k ( n u m T h r e a d s ) ; if ( t h r e a d N u m = = 0 )

r e s u l t . s w a p ( r e s u l t N e w ) ; b a r r i e r . b l o c k ( n u m T h r e a d s ) ; }

for ( s i z e _ t j = o f f s e t ; j < o f f s e t + n ;++ j ) r e s u l t N e w [ R e v e r s a l ( j , N ) ]= r e s u l t [ j ];

}

(45)

Threads Atomics

Parallel FFT: Threads

std :: vector < std :: complex <double> > fft (c o n s t std :: vector < std :: complex <double> >& d a t a ) {

std :: vector < std :: complex <double> > r e s u l t ( d a t a . b e g i n () , d a t a . end () ) ; std :: vector < std :: complex <double> > r e s u l t N e w ( d a t a . s i z e () ) ;

std :: vector < std :: complex <double> > r o o t ( d a t a . s i z e () ) ; std :: vector < std :: thread > t ( n u m T h r e a d s ) ;

for ( s i z e _ t p = 0; p < n u m T h r e a d s ;++ p ) t [ p ]= std :: t h r e a d

{ f f t t h r e a d , p , std :: ref ( r o o t ) , std :: ref ( r e s u l t ) , std :: ref ( r e s u l t N e w ) };

for ( s i z e _ t p = 0; p < t . s i z e () ;++ p ) t [ p ]. j o i n () ;

r e t u r n r e s u l t N e w ; }

(46)

Threads Atomics

Main Program for Parallel FFT

int m a i n () {

c o n s t s i z e _ t n u m P o i n t s = pow (2 ,2) ;

std :: vector < std :: complex <double> > d a t a ( n u m P o i n t s ) ; s i z e _ t i =1;

for (a u t o& x : d a t a ) x . r e a l ( cos ( i ++) ) ;

a u t o t0 = std :: c h r o n o :: s t e a d y _ c l o c k :: now () ;

std :: vector < std :: complex <double> > r e s u l t 1 = fft ( d a t a ) ; a u t o t1 = std :: c h r o n o :: s t e a d y _ c l o c k :: now () ;

for (a u t o& x : r e s u l t 1 )

std :: c o u t < < std :: abs ( x ) < < " " < < x . r e a l () < < " " < < x . i m a g () < <

std :: e n d l ;

std :: c o u t < < std :: e n d l < < std :: e n d l ;

std :: c o u t < < " # P a r a l l e l fft w i t h " < < n u m T h r e a d s < < " t h r e a d s t o o k " < <

std :: c h r o n o :: d u r a t i o n _ c a s t < std :: c h r o n o :: m i l l i s e c o n d s >( t1 - t0 ) . c o u n t ()

< < " m i l l i s e c o n d s . " < < std :: e n d l ; }

(47)

Threads Further Reading

Further Reading

C++11 Multi-Threading Tutorial

http://solarianprogrammer.com/2011/012/16/cpp-11-thread-tutorial Overview over all C++11 thread classes and functions

http://en.cppreference.com/w/cpp/thread Overview over all C++11 atomic operations http://en.cppreference.com/w/cpp/atomic

Working draft of the C++11 standard (almost identical to the standard, but available for free)

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3376.pdf

Referenzen

ÄHNLICHE DOKUMENTE

insert(p, t) Inserts the element t in front of the one the iterator p points to, and returns an iterator that points to the inserted element.. insert(p, i, j) As above, but for

replace_copy(b,e,out,v,v2) Create copy of all elements in range [b:e) re- placing elements which are equal to v with v2 , return iterator to end of copy.

• Traits can be used to specify types and values that depend on one or more template parameters. • Policies can be used to specify parts of algorithms as

Since the units are only used as a template parameter, this only affects the class type but does not require memory... Example: Numbers

and no exercise group, because things are still being set up, but you are welcome to attend if you have questions about the lecture or exercises or something else to discuss.. E

~List (); // clean up the list and all nodes Node* first() const; // return a pointer to the first entry Node* next( const Node* n) const; // return a pointer to the node after n

Please modify your implementation again to obtain a doubly linked list: each element should also point to its predecessor.. What is

To understand why the size of empty classes (according to standard) is as observed, consider the following class!. struct