Object-Oriented Programming for Scientific Computing
C++11 Multithreading
Ole Klein
Interdisciplinary Center for Scientific Computing Heidelberg University
ole.klein@iwr.uni-heidelberg.de
7. Juli 2015
Random numbers
C++11: Random Numbers
C++11 provides powerful random number generators that are able to work with different distributions, e.g. uniformly distributed random numbers:
# include< i o s t r e a m >
# include< iomanip >
# include< string >
# include< map >
# include< random >
# include< cmath >
int m a i n () {
// I m p l e m e n t a t i o n - d e p e n d e n t p r e d e f i n e d RNG std :: d e f a u l t _ r a n d o m _ e n g i n e e1 ;
// P r o d u c e u n i f o r m l y d i s t r i b u t e d i n t e g e r s b e t w e e n 1 and 6 std :: u n i f o r m _ i n t _ d i s t r i b u t i o n <int> u n i f o r m D i s t (1 ,6) ; // D r a w a n u m b e r
int m e a n = u n i f o r m D i s t ( e1 ) ;
std :: c o u t < < " R a n d o m l y g e n e r a t e d m e a n : " < < m e a n < < std :: e n d l ;
Random numbers
Normally Distributed Random Numbers
// G e n e r a t o r for non - d e t e r m i n i s t i c u n i f o r m l y d i s t r i b u t e d r a n d o m n u m b e r s
std :: r a n d o m _ d e v i c e rd ;
// A l t e r n a t i v e RNG b a s e d on M e r s e n n e T w i s t e r a l g o r i t h m std :: m t 1 9 9 3 7 e2 ( rd () ) ;
// C r e a t e n o r m a l d i s t r i b u t i o n a r o u n d m e a n std :: n o r m a l _ d i s t r i b u t i o n < > n o r m a l D i s t ( mean ,2) ; std :: map <int,int> h i s t ;
// C r e a t e r a n d o m n u m b e r s and c o u n t for (int n =0; n < 1 0 0 0 0 ; ++ n )
++ h i s t [ std :: r o u n d ( n o r m a l D i s t ( e2 ) ) ];
// P r i n t h i s t o g r a m to s c r e e n
std :: c o u t < < " N o r m a l d i s t r i b u t i o n a r o u n d m e a n " < < m e a n < < " : "
< < std :: e n d l ; for (a u t o p : h i s t )
std :: c o u t < < std :: f i x e d < < std :: s e t p r e c i s i o n (1) < < std :: s e t w (2)
< < p . f i r s t < < " " < < std :: s t r i n g ( p . s e c o n d /200 ,’ * ’)
< < std :: e n d l ; }
Random numbers
Own RNG for Uniformly Distributed Integers
c l a s s R a n d o m I n t {
p u b l i c:
R a n d o m I n t (int low , int high , u n s i g n e d int s e e d =0) :
r a n d _ { std :: b i n d ( std :: u n i f o r m _ i n t _ d i s t r i b u t i o n < >{ low , h i g h } , std :: d e f a u l t _ r a n d o m _ e n g i n e { s e e d }) }
{}
int o p e r a t o r() () c o n s t {
r e t u r n r a n d _ () ; }
p r i v a t e:
std :: f u n c t i o n <int() > r a n d _ ; };
Random numbers
Own RNG for Normally Distributed Doubles
This also works for other number types:
c l a s s R a n d o m D o u b l e {
p u b l i c:
R a n d o m D o u b l e (d o u b l e mean , d o u b l e stddev , u n s i g n e d int s e e d =0) : r a n d _ { std :: b i n d ( std :: n o r m a l _ d i s t r i b u t i o n < >{ mean , s t d d e v } ,
std :: m t 1 9 9 3 7 { s e e d }) } {}
d o u b l e o p e r a t o r() () c o n s t {
r e t u r n r a n d _ () ; }
p r i v a t e:
std :: f u n c t i o n <d o u b l e() > r a n d _ ; };
Random numbers
Application
int m a i n () {
// G e n e r a t o r for non - d e t e r m i n i s t i c u n i f o r m l y d i s t r i b u t e d r a n d o m n u m b e r s
std :: r a n d o m _ d e v i c e rd ; R a n d o m I n t r a n d I n t {1 ,6 , rd () };
int m e a n = r a n d I n t () ;
std :: c o u t < < " R a n d o m g e n e r a t e d m e a n : " < < m e a n < < std :: e n d l ; R a n d o m D o u b l e r a n d D o u b l e {(d o u b l e) mean ,2. , rd () };
std :: map <int,int> h i s t ;
// C r e a t e r a n d o m n u m b e r s and c o u n t for (int n =0; n < 1 0 0 0 0 ; ++ n )
++ h i s t [ std :: r o u n d ( r a n d D o u b l e () ) ];
// P r i n t h i s t o g r a m to s c r e e n
std :: c o u t < < " N o r m a l d i s t r i b u t i o n a r o u n d m e a n " < < m e a n < < " : "
< < std :: e n d l ; for (a u t o p : h i s t )
std :: c o u t < < std :: f i x e d < < std :: s e t p r e c i s i o n (1) < < std :: s e t w (2)
< < p . f i r s t < < " " < < std :: s t r i n g ( p . s e c o n d /200 ,’ * ’)
< < std :: e n d l ;
Random numbers
Application
// C r e a t e r a n d o m n u m b e r s and c o u n t for (int n =0; n < 1 0 0 0 0 ; ++ n )
++ h i s t [ std :: r o u n d ( r a n d D o u b l e () ) ];
// P r i n t h i s t o g r a m to s c r e e n
std :: c o u t < < " N o r m a l d i s t r i b u t i o n a r o u n d m e a n " < < m e a n < < " : "
< < std :: e n d l ; for (a u t o p : h i s t )
std :: c o u t < < std :: f i x e d < < std :: s e t p r e c i s i o n (1) < < std :: s e t w (2)
< < p . f i r s t < < " " < < std :: s t r i n g ( p . s e c o n d /200 ,’ * ’)
< < std :: e n d l ; }
Threads Introduction
Processes and threads in Unix
• The following information is associated with a process in UNIX:
•
IDs (process, user, group)
•
environment variables
•
directory
•
code
•
registers, stack, heap
•
file descriptors, signals
•
message queues, pipes, shared-memory segments
•
shared libraries
• Each process has its own address space.
• Threads (small-scale processes) exist within a process and share the address space of the main process.
• A thread consists of:
•
ID
•
stack pointer
•
registers
•
scheduling properties
•
signals
• Switching between threads is faster
than switching between processes.
Threads Introduction
C++11 Threads
• POSIX threads (or pthreads) were introduced in 1995 and have been the standard model for threads on UNIX computers for a long time. They are created and managed via a C interface.
• C++11 defines its own threading concept, which basically is a wrapper for POSIX threads. In many cases this makes it easier to write multi-threading programs. The following concepts are supported:
Threads Thread classes and functions for the rejoining of threads are defined in the header
threads. This header also contains functions that provide a thread ID and the functionality to detach threads that afterwards run as a separate program (fork).
Mutual Exclusion The header
mutexprovides classes for mutexes and locks
(also recursive variants and ones with timeouts), and
additionally functions that check if a lock is available.
Threads Introduction
C++11 Threads
Condition Variables The header
condition_variabledefines a class that allows the coordination of multiple threads with condition variables.
Futures The header
futuresdefines classes and functions whose operations need not be executed immediately, but can be performed
asynchronously with the main part of the program. If possible, this is done in parallel, and if not, sequentially at the location where the result is needed.
Atomic Operations The classes from the header
atomicallow certain operations
with integers, Boolean variables and pointers to be performed as an
atomic operation.
Threads C++11 Thread Production
C++11 Thread Production
# i n c l u d e < array >
# i n c l u d e < i o s t r e a m >
# i n c l u d e < thread >
v o i d H e l l o ( s i z e _ t n u m b e r ) {
std :: c o u t < < " H e l l o f r o m t h r e a d " < < n u m b e r < < std :: e n d l ; }
int m a i n () {
c o n s t int n u m T h r e a d s = 5;
std :: array < std :: thread , n u m T h r e a d s > t h r e a d s ; // S t a r t t h r e a d s
for ( s i z e _ t i = 0; i < t h r e a d s . s i z e () ; ++ i ) t h r e a d s [ i ] = std :: t h r e a d ( Hello , i ) ; std :: c o u t < < " H e l l o f r o m m a i n \ n ";
// R e j o i n threads , i m p l i c i t b a r r i e r
for ( s i z e _ t i = 0; i < t h r e a d s . s i z e () ; ++ i ) t h r e a d s [ i ]. j o i n () ;
r e t u r n 0;
}
Threads C++11 Thread Production
Output
H e l l o f r o m t h r e a d H e l l o f r o m t h r e a d H e l l o f r o m t h r e a d H e l l o f r o m t h r e a d H e l l o f r o m m a i n
0 H e l l o f r o m t h r e a d 213 4
Threads Example: Calculation of the Vector Norm
Calculation of the Vector Norm
# include< array >
# include< vector >
# include< i o s t r e a m >
# include< thread >
v o i d N o r m (c o n s t std :: vector <double>& x , d o u b l e& norm , c o n s t s i z e _ t i , c o n s t s i z e _ t p )
{
s i z e _ t n u m E l e m = x . s i z e () / p ;
s i z e _ t f i r s t = n u m E l e m * i + std :: min ( i , x . s i z e () % p ) ; n u m E l e m += ( i <( x . s i z e () % p ) ? 1 : 0 ) ;
d o u b l e l o c a l N o r m = 0 . 0 ;
for ( s i z e _ t j =0; j < n u m E l e m ; ++ j )
l o c a l N o r m += x [ f i r s t + j ] * x [ f i r s t + j ];
n o r m += l o c a l N o r m ; // d a n g e r o u s ...
}
Threads Example: Calculation of the Vector Norm
Calculation of the Vector Norm
int m a i n () {
c o n s t s i z e _ t n u m T h r e a d s = 5;
c o n s t s i z e _ t n u m V a l u e s = 1 0 0 0 0 0 0 ;
std :: array < std :: thread , n u m T h r e a d s > t h r e a d s ; std :: vector <double> x ( n u m V a l u e s , 2 . 0 ) ; d o u b l e n o r m = 0 . 0 ;
// C r e a t e t h r e a d s
for ( s i z e _ t i =0; i < t h r e a d s . s i z e () ; ++ i )
t h r e a d s [ i ] = std :: t h r e a d ( Norm , std :: c r e f ( x ) , std :: ref ( n o r m ) ,i , n u m T h r e a d s ) ; // R e j o i n t h r e a d s w i t h m a i n thread , b a r r i e r
for ( s i z e _ t i =0; i < t h r e a d s . s i z e () ; ++ i ) t h r e a d s [ i ]. j o i n () ;
std :: c o u t < < " N o r m is : " < < n o r m < < std :: e n d l ; r e t u r n 0;
}
Threads Example: Calculation of the Vector Norm
Critical Sections
• A statement of the type s=s+t is not atomic, which means it will not be carried out in one step:
process Π
1: process Π
2:
1 load s in R1 3 load s in R1
load t in R2 load t in R2
add R1, R2, store in R3 add R1, R2, store in R3 2 store R3 in s 4 store R3 in s
• The order of execution of these instructions relative to each other is not fixed.
• This results in an exponentially growing sequence of possible execution orders.
Threads Example: Calculation of the Vector Norm
Possible Execution Orders
@
@
@ @ 1
3
@
@ @
@
@ @ 2
3 1
4
@
@ @ 3 2 4 2 4 1
4 4 2 4 2 2
Result of the calculation s = t
Π1+ t
Π2s = t
Π2s = t
Π1s = t
Π2s = t
Π1s = t
Π1+ t
Π2Only some of the sequences lead to the right result!
Threads Mutual Exclusion/Locks
Mutual Exclusion
• An additional synchronization is necessary to prevent execution orders that do not provide the correct result.
• Critical sections must be executed under mutual exclusion.
• Mutual exclusion requires:
•
At most one process enters the critical section at a time.
•
There are no deadlocks (i.e. two processes that are waiting on each other).
•
No process waits at a free critical section.
•
When a process enters a critical section, it will leave it successfully in the end.
Threads Mutual Exclusion/Locks
Mutual Exclusion/Locks
# include< i o s t r e a m >
# include< vector >
# include< thread >
# include< mutex >
std :: m u t e x m ; v o i d e (int& sh ) {
m . l o c k () ;
sh += 1; // m o d i f y s h a r e d d a t a m . u n l o c k () ; // v e r y i m p o r t a n t }
• C++11 contains the construct
mutexfor this.
• A
mutexcan lock a section, so that it can only be accessed by one process.
• All other processes have to wait in front of the critical section. It should therefore be as small as possible.
• The same mutex must be available in all of the threads.
Threads Mutual Exclusion/Locks
Lock Guard
v o i d f (int& sh ) {
std :: l o c k _ g u a r d < std :: mutex > l o c k e d { m };
sh += 1; // m o d i f y s h a r e d d a t a }
• There are helper classes for releasing a
lockagain (even if an exception is thrown).
• The simplest of these is a
lock_guard.
• At initialization it gets a
mutexand locks it, and it releases the lock when the destructor of
lock_guardis called.
• A
lock_guardshould therefore be defined either at the end of a function or in
a separate block to call the destructor as early as possible.
Threads Mutual Exclusion/Locks
Unique Lock
A
unique_lockhas more functionality than a
lock_guard. It has functions that lock or release a
mutexand can test whether a critical section can be entered, so that otherwise something else can be done. A
unique_lockcan also take over an already locked
mutexor defer locking until it is needed.
v o i d g (int& sh ) {
std :: u n i q u e _ l o c k < std :: mutex > l o c k e d { m , std :: d e f e r _ l o c k }; // a d m i n i s t r a t e mutex , but don ’ t b l o c k it
b o o l s u c c e s s f u l = f a l s e; w h i l e (! s u c c e s s f u l ) {
if ( l o c k e d . t r y _ l o c k () ) // l o c k m u t e x if a b l e {
sh += 1; // m o d i f y s h a r e d d a t a s u c c e s s f u l = t r u e;
l o c k e d . u n l o c k () ;
// a c t u a l l y not n e c e s s a r y in t h i s e x a m p l e }
e l s e {
// do s o m e t h i n g d i f f e r e n t ...
} }
Threads Mutual Exclusion/Locks
Main Program
int m a i n () {
c o n s t s i z e _ t n u m T h r e a d s = std :: t h r e a d :: h a r d w a r e _ c o n c u r r e n c y () ; std :: vector < std :: thread > t h r e a d s { n u m T h r e a d s };
int r e s u l t = 0;
// s t a r t t h r e a d s
for ( s i z e _ t i =0; i < t h r e a d s . s i z e () ; ++ i )
t h r e a d s [ i ] = std :: t h r e a d { e , std :: ref ( r e s u l t ) };
// R e j o i n threads , i m p l i c i t b a r r i e r for ( s i z e _ t i =0; i < t h r e a d s . s i z e () ; ++ i )
t h r e a d s [ i ]. j o i n () ;
std :: c o u t < < " Y o u r h a r d w a r e s u p p o r t s " < < r e s u l t < < " t h r e a d s " < <
std :: e n d l ; r e t u r n 0;
}
Threads Calculation of the Vector Norm with a Mutex
Calculation of the Vector Norm with a Mutex
# include< array >
# include< vector >
# include< i o s t r e a m >
# include< thread >
# include< mutex >
# include< cmath >
s t a t i c std :: m u t e x n L o c k ;
v o i d N o r m (c o n s t std :: vector <double>& x , d o u b l e& norm , c o n s t s i z e _ t i , c o n s t s i z e _ t p )
{
s i z e _ t n u m E l e m = x . s i z e () / p ;
s i z e _ t f i r s t = n u m E l e m * i + std :: min ( i , x . s i z e () % p ) ; n u m E l e m += ( i <( x . s i z e () % p ) ? 1 : 0 ) ;
d o u b l e l o c a l N o r m = 0 . 0 ; for ( s i z e _ t j =0; j < n u m E l e m ;++ j )
l o c a l N o r m += x [ f i r s t + j ]* x [ f i r s t + j ];
std :: l o c k _ g u a r d < std :: mutex > b l o c k _ t h r e a d s _ u n t i l _ f i n i s h _ t h i s _ j o b ( n L o c k ) ; n o r m += l o c a l N o r m ;
}
Threads Calculation of the Vector Norm with a Mutex
Calculation of the Vector Norm with a Mutex
int m a i n () {
c o n s t s i z e _ t n u m T h r e a d s = 5;
c o n s t s i z e _ t n u m V a l u e s = 1 0 0 0 0 0 0 0 ;
std :: array < std :: thread , n u m T h r e a d s > t h r e a d s ; std :: vector <double> x ( n u m V a l u e s , 2 . 0 ) ; d o u b l e n o r m = 0 . 0 ;
// S t a r t t h r e a d s
for ( s i z e _ t i = 0; i < t h r e a d s . s i z e () ; ++ i ) t h r e a d s [ i ] =
std :: t h r e a d ( Norm , std :: c r e f ( x ) , std :: ref ( n o r m ) ,i , n u m T h r e a d s ) ; // R e j o i n t h r e a d s
for ( s i z e _ t i = 0; i < t h r e a d s . s i z e () ; ++ i ) t h r e a d s [ i ]. j o i n () ;
std :: c o u t < < " N o r m is : " < < s q r t ( n o r m ) < < std :: e n d l ; r e t u r n 0;
}
Threads Calculation of the Vector Norm with Tree Combine
Parallelization of the Sum
• The calculation of the global sum of the norm is not parallel when using a
mutex
for the calculation of s = s + t .
• It can be parallelized as follows (P = 8):
s = s
0+ s
1| {z }
s01
+ s
2+ s
3| {z }
s23
| {z }
s0123
+ s
4+ s
5| {z }
s45
+ s
6+ s
7| {z }
s67
| {z }
s4567
| {z }
s
• This reduces the complexity from O(P) to O(log
2P).
Threads Calculation of the Vector Norm with Tree Combine
Tree Combine
If we use a binary representation of the process ID, the communication structure is a binary tree:
s
0000
s
1@ 001
@ I
s
2010
s
3@ 011
@ I
s
4100
s
5@ 101
@ I
s
6110
s
7@ 111
@ I s
0+ s
1000 *
s
2+ s
3H 010 H H H Y
s
4+ s
5100 *
s
6+ s
7H 110 H H H Y s
0+ s
1+ s
2+ s
3000 :
s
4+ s
5+ s
6+ s
7X 100
X X
X X
X X
y
P s
i000
Threads Calculation of the Vector Norm with Tree Combine
Calculation of the Vector Norm with Tree Combine
# i n c l u d e < array >
# i n c l u d e < vector >
# i n c l u d e < i o s t r e a m >
# i n c l u d e < thread >
# i n c l u d e < cmath >
v o i d N o r m (c o n s t std :: vector <double>& x , std :: vector <double>& norm , std :: vector <bool>& flag ,c o n s t s i z e _ t i , c o n s t s i z e _ t d ) {
s i z e _ t p = pow (2 , d ) ;
s i z e _ t n u m E l e m = x . s i z e () / p ;
s i z e _ t f i r s t = n u m E l e m * i + std :: min ( i , x . s i z e () % p ) ; n u m E l e m += ( i <( x . s i z e () % p ) ? 1 : 0 ) ;
for ( s i z e _ t j =0; j < n u m E l e m ;++ j ) n o r m [ i ]+= x [ f i r s t + j ]* x [ f i r s t + j ];
Threads Calculation of the Vector Norm with Tree Combine
Calculation of the Vector Norm with Tree Combine
// T r e e c o m b i n e
for ( s i z e _ t j =0; j < d ;++ j ) {
s i z e _ t m = pow (2 , j ) ; if ( i & m )
{
f l a g [ i ]=t r u e; b r e a k; }
w h i l e (! f l a g [ i | m ]) ; n o r m [ i ] += n o r m [ i | m ];
} }
Threads Calculation of the Vector Norm with Tree Combine
Calculation of the Vector Norm with Tree Combine
int m a i n () {
c o n s t s i z e _ t l o g T h r e a d s = 1;
c o n s t s i z e _ t n u m T h r e a d s = pow (2 , l o g T h r e a d s ) ; c o n s t s i z e _ t n u m V a l u e s = 1 0 0 0 0 0 0 0 ;
std :: array < std :: thread , n u m T h r e a d s > t h r e a d s ; std :: vector <double> x ( n u m V a l u e s , 2 . 0 ) ; std :: vector <bool> f l a g ( n u m T h r e a d s ,f a l s e) ; std :: vector <double> n o r m ( n u m T h r e a d s , 0 . 0 ) ; // S t a r t t h r e a d s
for ( s i z e _ t i =0; i < t h r e a d s . s i z e () ; ++ i )
t h r e a d s [ i ] = std :: t h r e a d ( Norm , std :: c r e f ( x ) , std :: ref ( n o r m ) , std :: ref ( f l a g ) ,i , l o g T h r e a d s ) ; // R e j o i n t h r e a d s
for ( s i z e _ t i =0; i < t h r e a d s . s i z e () ; ++ i ) t h r e a d s [ i ]. j o i n () ;
std :: c o u t < < " N o r m is : " < < s q r t ( n o r m [ 0 ] ) < < std :: e n d l ; r e t u r n 0;
}
Threads Calculation of the Vector Norm with Tree Combine
Condition Synchronization
• Tree combine is a form of condition synchronization.
• A process is waiting until a condition (Boolean expression) is true. The condition is made true by another process.
• Here several processes wait until their flags become true.
• The flags are also called condition variables. Their proper initialization is important.
• We implement the synchronization with busy wait. That’s probably not a good idea for multi-threading.
• If condition variables are used repeatedly (e.g. when several sums must be calculated in succession), the following rules should be followed:
•
A process which is waiting on a condition variable also resets it again.
•
A condition variable may only be set to true again if it is sure that it has been
reset earlier.
Threads Thread Creation withasync
Thread Creation with async
The function
asyncallows the easy creation of a thread that performs its
calculation without interaction with the main program or other threads. Its result can later be queried using the function
get.
# include< i o s t r e a m >
# include< vector >
# include< numeric >
# include< thread >
# include< future >
t e m p l a t e<c l a s s T > s t r u c t A c c u m { // s i m p l e a c c u m u l a t o r f u n c t i o n o b j e c t
T * b ; T * e ;
T val ;
A c c u m ( T * bb , T * ee , T vv ) : b { bb } , e { ee } , val { vv } {}
T o p e r a t o r() () { r e t u r n std :: a c c u m u l a t e ( b , e , val ) ; } };
Threads Thread Creation withasync
Thread Creation with async
The function
asyncallows the easy creation of a thread that performs its
calculation without interaction with the main program or other threads. Its result can later be queried using the function
get.
d o u b l e comp ( s t d : : v e c t o r<d o u b l e>& v ) // spawn many t a s k s i f v i s l a r g e e n o u g h {
i f ( v . s i z e ( )<10000)
r e t u r n s t d : : a c c u m u l a t e ( v . b e g i n ( ) , v . end ( ) , 0 . 0 ) ; s t d : : f u t u r e<d o u b l e> f 0
{s t d : : a s y n c ( Accum<d o u b l e>{&v [ 0 ] , & v [ v . s i z e ( ) / 4 ] , 0 . 0})}; s t d : : f u t u r e<d o u b l e> f 1
{s t d : : a s y n c ( Accum<d o u b l e>{&v [ v . s i z e ( ) / 4 ] , & v [ v . s i z e ( ) / 2 ] , 0 . 0})}; s t d : : f u t u r e<d o u b l e> f 2
{s t d : : a s y n c ( Accum<d o u b l e>{&v [ v . s i z e ( ) / 2 ] , & v [ v . s i z e ( )∗3 / 4 ] , 0 . 0})};
s t d : : f u t u r e<d o u b l e> f 3
{s t d : : a s y n c ( Accum<d o u b l e>{&v [ v . s i z e ( )∗3 / 4 ] , & v [ v . s i z e ( ) ] , 0 . 0})}; // l o t s o f o t h e r c o d e c o u l d come h e r e . . .
r e t u r n f 0 . g e t ( )+f 1 . g e t ( )+f 2 . g e t ( )+f 3 . g e t ( ) ; }
i n t main ( ) {
s t d : : v e c t o r<d o u b l e> b l u b ( 1 0 0 0 0 0 , 1 . ) ;
Threads Atomics
Discrete Fourier Transform
• Let (z
i, a
i) be pairs of numbers where a
iis the measured value of an unknown function at the location z
i.
• Goal: calculate interpolation A(z ) = 1
N f
0+ f
1z + · · · + f
N−1z
N−1• Here we use the N-th roots of unity z
i= e
2πiN kas nodes. The coefficients of the interpolation polynomial are then
f
k=
N−1
X
j=0
a
j· e
−2πijkNfor k = 0, . . . , N − 1
Threads Atomics
Fast Fourier Transform
• For even values of N, N = 2n, the sum
f
k=
2n−1
X
j=0
a
j· e
−2πi2njkfor k = 0, . . . , 2n − 1
can be rearranged into
f
k=
n−1
X
j=0
a
2j· e
−2πi2jk2n+
n−1
X
j=0
a
2j+1· e
−2πik(2j+1)2nThreads Atomics
Fast Fourier Transform
• Setting a
k0= a
2k, f
k0= f
2k, a
00k= a
2k+1and f
k00= f
2k+1, we have
f
k=
n−1
X
j=0
a
0j· e
−2πin jk+ e
−πinkn−1
X
j=0
a
00je
−2πin jk= (
f
k0+ e
−πinkf
k00if k < n f
k−n0+ e
−πin(k−n)f
k00−nif k ≥ n
• This means we can solve the problem by calculating two Fourier transforms of length N/2.
• This can be applied recursively. Since the Fourier transform of a value is the value itself, N/2 sums of two values must be calculated on the lowest level.
• Since at each stage N complex multiplications with a unit root and N
complex additions are necessary, the complexity can be reduced from O(N
2)
to O(N · log(N)).
Threads Atomics
Recursive Algorithm in Pseudocode
p r o c e d u r e R _ F F T ( X , Y , N , w ) if ( n = = 1 ) t h e n Y [0] := X [ 0 ] ; e l s e b e g i n
R _ F F T ( < X [0] , X [0] ,... , X [ N -2] > , < Q [0] , Q [1] ,... , Q [ N -2] > , N /2 , w ^2) ; R _ F F T ( < X [1] , X [3] ,... , X [ N -1] > , < T [0] , T [1] ,... , T [ N -2] > , N /2 , w ^2) ;
for k := 0 to N -1 do
Y [ k ] := Q [ k mod ( N /2) ] + w ^ k T [ k mod ( N /2) ];
end R _ F F T
with w = e
−2πiN. But: recursion parallelizes poorly.
Threads Atomics
Iterative Algorithm in Pseudocode
p r o c e d u r e I T E R A T I V E _ F F T ( X , Y , N , w ) r := log N ;
for i := 0 to N -1 do R [ i ] := X [ i ];
for m := 0 to r -1 do
for i := 0 to N -1 do S [ i ] := R [ i ];
for i := 0 to N -1 do
/* Let ( b_0 , b_1 ,... , b_r -1) the b i n a r y r e p r e s e n t a t i o n of i */
j := ( b_0 ,... , b_m -1 ,0 , b_m +1 ,... , b_r -1) ; k := ( b_0 ,... , b_m -1 ,1 , b_m +1 ,... , b_r -1) ;
R [ i ] := S [ j ] + S [ k ] x w ^( b_m , b_m -1 ,... , b_0 ,0 ,... ,0) ; e n d f o r
e n d f o r
for i := 0 to N -1 do Y [ i ] := R [ i ];
end I T E R A T I V E _ F F T
with w = e
−2πinThreads Atomics
Sequential Implementation of FFT
std :: vector < std :: complex <double> > fft (c o n s t std :: vector < std :: complex <double> >& d a t a ) {
c o n s t s i z e _ t n u m L e v e l s =( s i z e _ t ) std :: l o g 2 ( d a t a . s i z e () ) ;
std :: vector < std :: complex <double> > r e s u l t ( d a t a . b e g i n () , d a t a . end () ) ; std :: vector < std :: complex <double> > r e s u l t N e w ( d a t a . s i z e () ) ;
std :: vector < std :: complex <double> > r o o t ( d a t a . s i z e () ) ; for ( s i z e _ t j =0; j < r o o t . s i z e () ;++ j )
r o o t [ j ]= u n i t r o o t ( j , r o o t . s i z e () ) ; for ( s i z e _ t i =0; i < n u m L e v e l s ;++ i ) {
s i z e _ t m a s k =1 < <( n u m L e v e l s -1 - i ) ; s i z e _ t i n v M a s k =~ m a s k ;
for ( s i z e _ t j =0; j < d a t a . s i z e () ;++ j ) {
s i z e _ t k = j & i n v M a s k ; s i z e _ t l = j | m a s k ;
r e s u l t N e w [ j ]= r e s u l t [ k ]+ r e s u l t [ l ]* r o o t [ R e v e r s a l ( j / mask , i +1) * m a s k ];
}
if ( i != n u m L e v e l s -1) r e s u l t . s w a p ( r e s u l t N e w ) ; }
r e t u r n r e s u l t N e w ; }
Threads Atomics
Sorting back Values by Bit Reversal
v o i d S o r t B i t r e v e r s a l ( std :: vector < std :: complex <double> >& r e s u l t ) {
std :: vector < std :: complex <double> > r e s u l t N e w ( r e s u l t . s i z e () ) ; c o n s t s i z e _ t n = std :: l o g 2 ( r e s u l t . s i z e () ) ;
for ( s i z e _ t j =0; j < r e s u l t . s i z e () ;++ j ) r e s u l t N e w [ R e v e r s a l ( j , n ) ]= r e s u l t [ j ];
r e s u l t . s w a p ( r e s u l t N e w ) ; }
v o i d F a s t S o r t B i t r e v e r s a l ( std :: vector < std :: complex <double> >& r e s u l t ) {
s i z e _ t n = r e s u l t . s i z e () ; c o n s t s i z e _ t t = std :: l o g 2 ( n ) ; s i z e _ t l = 1;
std :: vector < size_t > c ( n ) ; for ( s i z e _ t q =0; q < t ;++ q ) {
n = n /2;
for( s i z e _ t j =0; j < l ;++ j ) c [ l + j ]= c [ j ]+ n ; l =2* l ;
}
std :: vector < std :: complex <double> > r e s u l t N e w ( r e s u l t . s i z e () ) ; for ( s i z e _ t j =0; j < r e s u l t . s i z e () ;++ j )
r e s u l t N e w [ c [ j ]]= r e s u l t [ j ];
r e s u l t . s w a p ( r e s u l t N e w ) ;
Threads Atomics
Bit Reversal and Roots of Unity
s i z e _ t R e v e r s a l ( s i z e _ t k , s i z e _ t n ) {
s i z e _ t j =0;
s i z e _ t m a s k =1;
if ( k & m a s k ) ++ j ;
for ( s i z e _ t i =0; i <( n -1) ;++ i ) {
mask < <=1;
j < <=1;
if ( k & m a s k ) ++ j ; }
r e t u r n j ; }
i n l i n e std :: complex <double> u n i t r o o t ( s i z e _ t i , s i z e _ t N ) {
d o u b l e arg = -( i *2* M _ P I / N ) ;
r e t u r n std :: complex <double>( cos ( arg ) , sin ( arg ) ) ; }
Threads Atomics
Time Measurement
C++11 offers a builtin way to measure time. There are various clocks and data types to store times and timespans.
# include< chrono >
# include< i o s t r e a m >
# include< thread >
u s i n g n a m e s p a c e std :: c h r o n o ;
int m a i n () {
s t e a d y _ c l o c k :: t i m e _ p o i n t s t a r t = s t e a d y _ c l o c k :: now () ; std :: t h i s _ t h r e a d :: s l e e p _ f o r ( s e c o n d s { 2 } ) ;
a u t o now = s t e a d y _ c l o c k :: now () ;
n a n o s e c o n d s d u r a t i o n = now - s t a r t ; // we w a n t the r e s u l t in ns m i l l i s e c o n d s d u r a t i o n M s = d u r a t i o n _ c a s t < m i l l i s e c o n d s >( d u r a t i o n ) ; std :: c o u t < < " s o m e t h i n g t o o k " < < d u r a t i o n . c o u n t ()
< < " ns w h i c h is " < < d u r a t i o n M s . c o u n t () < < " ms \ n "; s e c o n d s sec = h o u r s {2} + m i n u t e s { 3 5 } + s e c o n d s { 9 } ;
std :: c o u t < < " 2 h 35 m 9 s is " < < sec . c o u n t () < < " s \ n ";
}
Threads Atomics
Main Program for Sequential FFT
int m a i n () {
c o n s t s i z e _ t n u m P o i n t s = pow (2 ,16) ;
std :: vector < std :: complex <double> > d a t a ( n u m P o i n t s ) ; s i z e _ t i =1;
for (a u t o& x : d a t a ) x . r e a l ( cos ( i ++) ) ;
a u t o t0 = std :: c h r o n o :: s t e a d y _ c l o c k :: now () ;
std :: vector < std :: complex <double> > r e s u l t 1 = fft ( d a t a ) ; F a s t S o r t B i t r e v e r s a l ( r e s u l t 1 ) ;
a u t o t1 = std :: c h r o n o :: s t e a d y _ c l o c k :: now () ; for (a u t o& x : r e s u l t 1 )
std :: c o u t < < std :: abs ( x ) < < " " < < x . r e a l () < < " " < < x . i m a g () < <
std :: e n d l ;
std :: c o u t < < std :: e n d l < < std :: e n d l ; std :: c o u t < < " # S e q u e n t i a l fft t o o k " < <
std :: c h r o n o :: d u r a t i o n _ c a s t < std :: c h r o n o :: m i l l i s e c o n d s >( t1 - t0 ) . c o u n t ()
< < " m i l l i s e c o n d s . " < < std :: e n d l ; }
Threads Atomics
Barrier
c o n s t s i z e _ t n u m T h r e a d s =1;// std :: t h r e a d :: h a r d w a r e _ c o n c u r r e n c y () ; s t r u c t B a r r i e r
{
std :: atomic < size_t > c o u n t _ ; std :: atomic < size_t > s t e p _ ; B a r r i e r () : c o u n t _ (0) , s t e p _ (0) {}
v o i d b l o c k ( s i z e _ t n u m T h r e a d s ) {
if ( n u m T h r e a d s <2) r e t u r n;
s i z e _ t s t e p = s t e p _ . l o a d () ;
if ( c o u n t _ . f e t c h _ a d d (1) == ( n u m T h r e a d s -1) ) {
c o u n t _ . s t o r e (0) ; s t e p _ . f e t c h _ a d d (1) ; r e t u r n;
} e l s e {
w h i l e ( s t e p _ . l o a d () == s t e p ) std :: t h i s _ t h r e a d :: y i e l d () ; r e t u r n;
}
Threads Atomics
Parallel FFT: Threads
B a r r i e r b a r r i e r ;
v o i d f f t t h r e a d (c o n s t s i z e _ t t h r e a d N u m , std :: vector < std :: complex <double> >&
root , std :: vector < std :: complex <double> >& result , std :: vector < std :: complex <double> >& r e s u l t N e w ) {
c o n s t s i z e _ t N = std :: l o g 2 ( r e s u l t . s i z e () ) ; s i z e _ t n = r e s u l t . s i z e () / n u m T h r e a d s ; c o n s t s i z e _ t o f f s e t = t h r e a d N u m * n +
std :: min ( t h r e a d N u m , r e s u l t . s i z e () % n u m T h r e a d s ) ; n += ( t h r e a d N u m <( r e s u l t . s i z e () % n u m T h r e a d s ) ? 1 : 0 ) ; for ( s i z e _ t i = o f f s e t ; i < o f f s e t + n ;++ i )
r o o t [ i ]= u n i t r o o t ( i , r o o t . s i z e () ) ; b a r r i e r . b l o c k ( n u m T h r e a d s ) ;
for ( s i z e _ t l e v e l =0; level < N ;++ l e v e l ) {
s i z e _ t m a s k =1 < <( N - level -1) ; s i z e _ t i n v M a s k =~ m a s k ;
Threads Atomics
Parallel FFT: Threads
for ( s i z e _ t i =0; i < n ;++ i ) {
s i z e _ t j = o f f s e t + i ; s i z e _ t k = j & i n v M a s k ; s i z e _ t l = j | m a s k ;
s i z e _ t e = R e v e r s a l ( j / mask , l e v e l +1) * m a s k ;
std :: c o u t < < " # j : " < < j < < " k : " < < k < < " l : " < < l < < " e :
" < < e < < std :: e n d l ;
r e s u l t N e w [ j ]= r e s u l t [ k ]+ r e s u l t [ l ]* r o o t [ R e v e r s a l ( j / mask , l e v e l +1) * m a s k ];
}
b a r r i e r . b l o c k ( n u m T h r e a d s ) ; if ( t h r e a d N u m = = 0 )
r e s u l t . s w a p ( r e s u l t N e w ) ; b a r r i e r . b l o c k ( n u m T h r e a d s ) ; }
for ( s i z e _ t j = o f f s e t ; j < o f f s e t + n ;++ j ) r e s u l t N e w [ R e v e r s a l ( j , N ) ]= r e s u l t [ j ];
}
Threads Atomics
Parallel FFT: Threads
std :: vector < std :: complex <double> > fft (c o n s t std :: vector < std :: complex <double> >& d a t a ) {
std :: vector < std :: complex <double> > r e s u l t ( d a t a . b e g i n () , d a t a . end () ) ; std :: vector < std :: complex <double> > r e s u l t N e w ( d a t a . s i z e () ) ;
std :: vector < std :: complex <double> > r o o t ( d a t a . s i z e () ) ; std :: vector < std :: thread > t ( n u m T h r e a d s ) ;
for ( s i z e _ t p = 0; p < n u m T h r e a d s ;++ p ) t [ p ]= std :: t h r e a d
{ f f t t h r e a d , p , std :: ref ( r o o t ) , std :: ref ( r e s u l t ) , std :: ref ( r e s u l t N e w ) };
for ( s i z e _ t p = 0; p < t . s i z e () ;++ p ) t [ p ]. j o i n () ;
r e t u r n r e s u l t N e w ; }
Threads Atomics
Main Program for Parallel FFT
int m a i n () {
c o n s t s i z e _ t n u m P o i n t s = pow (2 ,2) ;
std :: vector < std :: complex <double> > d a t a ( n u m P o i n t s ) ; s i z e _ t i =1;
for (a u t o& x : d a t a ) x . r e a l ( cos ( i ++) ) ;
a u t o t0 = std :: c h r o n o :: s t e a d y _ c l o c k :: now () ;
std :: vector < std :: complex <double> > r e s u l t 1 = fft ( d a t a ) ; a u t o t1 = std :: c h r o n o :: s t e a d y _ c l o c k :: now () ;
for (a u t o& x : r e s u l t 1 )
std :: c o u t < < std :: abs ( x ) < < " " < < x . r e a l () < < " " < < x . i m a g () < <
std :: e n d l ;
std :: c o u t < < std :: e n d l < < std :: e n d l ;
std :: c o u t < < " # P a r a l l e l fft w i t h " < < n u m T h r e a d s < < " t h r e a d s t o o k " < <
std :: c h r o n o :: d u r a t i o n _ c a s t < std :: c h r o n o :: m i l l i s e c o n d s >( t1 - t0 ) . c o u n t ()
< < " m i l l i s e c o n d s . " < < std :: e n d l ; }
Threads Further Reading