• Keine Ergebnisse gefunden

§ Remarks: Three different kinds of functions in CUDA

N/A
N/A
Protected

Academic year: 2021

Aktie "§ Remarks: Three different kinds of functions in CUDA"

Copied!
25
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 35

§  Remarks:

§  __global__ defines a kernel function

§  Each '__' consists of two underscore characters

§  A kernel function must return void

§  __device__ and __host__ can be used together

host host

__host__ float hostFunc() ‏;

host/device device

__global__ void kernelFunc() ‏;

device device

__device__ float deviceFunc();

Only callable from:

Executed on:

Three different kinds of functions in CUDA

(2)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 36

§  Example for the latter: make cuComplex usable on both device and host

struct cuComplex // define a class for complex numbers {

float r, i; // real, imaginary part __device__ __host__

cuComplex( float a, float b ) : r(a), i(b) {}

__device__ __host__

float magnitude2( void ) {

return r * r + i * i;

}

// etc. ...

};

(3)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 37

§  An "optimization":

§  The sequence of z i can

either converge towards a single (complex) value,

§  or it can end up in a cycle of values,

§  or it can be chaotic.

§  Idea:

§  Try to recognize such cycles;

if you realize that a thread is

caught in a cycle, exit immediately

(should happen much earlier in most cases)

§  Maintain an array of the k most recent elements of the sequence

§  Last time I checked: 4x slower than the brute-force version!

All points here converge towards fixed point All points here

converge

towards cycle

of length 2

(4)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 38

Querying the Device for its Capabilities

§  How do you know how many threads can be in a block, etc.?

§  Query your GPU, like so:

int devID;

cudaGetDevice( &devID ); // GPU currently in use cudaDeviceProp props;

cudaGetDeviceProperties( &props, devID );

unsigned int threads_per_block = props.maxThreadsPerBlock;

(5)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 39

For Your Reference: the Complete Table of the cudaDeviceProp

ptg

QUERYING DEVICES

29 48(5<,1*'(9,&(6

int minor;

int clockRate;

size_t textureAlignment;

int deviceOverlap;

int multiProcessorCount;

int kernelExecTimeoutEnabled;

int integrated;

int canMapHostMemory;

int computeMode;

int maxTexture1D;

int maxTexture2D[2];

int maxTexture3D[3];

int maxTexture2DArray[3];

int concurrentKernels;

}

6RPHRIWKHVHDUHVHOIH[SODQDWRU\RWKHUVEHDUVRPHDGGLWLRQDOGHVFULSWLRQVHH 7DEOH

Table 3.1 &8'$'HYLFH3URSHUWLHV

DEVICE PROPERTY DESCRIPTION

char name[256]; $Q$6&,,VWULQJLGHQWLI\LQJWKHGHYLFHHJ

"GeForce GTX 280")

size_t totalGlobalMem 7KHDPRXQWRIJOREDOPHPRU\RQWKHGHYLFHLQ E\WHV

size_t sharedMemPerBlock 7KHPD[LPXPDPRXQWRIVKDUHGPHPRU\DVLQJOH EORFNPD\XVHLQE\WHV

int regsPerBlock 7KHQXPEHURIELWUHJLVWHUVDYDLODEOHSHUEORFN int warpSize 7KHQXPEHURIWKUHDGVLQDZDUS

size_t memPitch 7KHPD[LPXPSLWFKDOORZHGIRUPHPRU\FRSLHVLQ E\WHV

Continued

From the Library of Daisy Alford Smith

ptg

INTRODUCTION TO CUDA C

30

DEVICE PROPERTY DESCRIPTION

int maxThreadsPerBlock 7KHPD[LPXPQXPEHURIWKUHDGVWKDWDEORFNPD\

FRQWDLQ

int maxThreadsDim[3] 7KHPD[LPXPQXPEHURIWKUHDGVDOORZHGDORQJ HDFKGLPHQVLRQRIDEORFN

int maxGridSize[3] 7KHQXPEHURIEORFNVDOORZHGDORQJHDFK GLPHQVLRQRIDJULG

size_t totalConstMem 7KHDPRXQWRIDYDLODEOHFRQVWDQWPHPRU\

int major 7KHPDMRUUHYLVLRQRIWKHGHYLFHǢVFRPSXWH FDSDELOLW\

int minor 7KHPLQRUUHYLVLRQRIWKHGHYLFHǢVFRPSXWH FDSDELOLW\

size_t textureAlignment 7KHGHYLFHǢVUHTXLUHPHQWIRUWH[WXUHDOLJQPHQW

int deviceOverlap $ERROHDQYDOXHUHSUHVHQWLQJZKHWKHUWKHGHYLFH FDQVLPXOWDQHRXVO\SHUIRUPD

cudaMemcpy()

DQGNHUQHOH[HFXWLRQ

int multiProcessorCount 7KHQXPEHURIPXOWLSURFHVVRUVRQWKHGHYLFH

int kernelExecTimeoutEnabled $ERROHDQYDOXHUHSUHVHQWLQJZKHWKHUWKHUHLVD UXQWLPHOLPLWIRUNHUQHOVH[HFXWHGRQWKLVGHYLFH

int integrated $ERROHDQYDOXHUHSUHVHQWLQJZKHWKHUWKHGHYLFHLV DQLQWHJUDWHG*38LHSDUWRIWKHFKLSVHWDQGQRWD GLVFUHWH*38

int canMapHostMemory $ERROHDQYDOXHUHSUHVHQWLQJZKHWKHUWKHGHYLFH FDQPDSKRVWPHPRU\LQWRWKH&8'$GHYLFH DGGUHVVVSDFH

int computeMode $YDOXHUHSUHVHQWLQJWKHGHYLFHǢVFRPSXWLQJPRGH GHIDXOWH[FOXVLYHRUSURKLELWHG

int maxTexture1D 7KHPD[LPXPVL]HVXSSRUWHGIRU'WH[WXUHV

Table 3.1 &DSWLRQQHHGHG&RQWLQXHG

From the Library of Daisy Alford Smith

(6)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 40 ptg

INTRODUCTION TO CUDA C

30

DEVICE PROPERTY DESCRIPTION

int maxThreadsPerBlock 7KHPD[LPXPQXPEHURIWKUHDGVWKDWDEORFNPD\

FRQWDLQ

int maxThreadsDim[3] 7KHPD[LPXPQXPEHURIWKUHDGVDOORZHGDORQJ HDFKGLPHQVLRQRIDEORFN

int maxGridSize[3] 7KHQXPEHURIEORFNVDOORZHGDORQJHDFK GLPHQVLRQRIDJULG

size_t totalConstMem 7KHDPRXQWRIDYDLODEOHFRQVWDQWPHPRU\

int major 7KHPDMRUUHYLVLRQRIWKHGHYLFHǢVFRPSXWH FDSDELOLW\

int minor 7KHPLQRUUHYLVLRQRIWKHGHYLFHǢVFRPSXWH FDSDELOLW\

size_t textureAlignment 7KHGHYLFHǢVUHTXLUHPHQWIRUWH[WXUHDOLJQPHQW

int deviceOverlap $ERROHDQYDOXHUHSUHVHQWLQJZKHWKHUWKHGHYLFH FDQVLPXOWDQHRXVO\SHUIRUPDcudaMemcpy()

DQGNHUQHOH[HFXWLRQ

int multiProcessorCount 7KHQXPEHURIPXOWLSURFHVVRUVRQWKHGHYLFH

int kernelExecTimeoutEnabled $ERROHDQYDOXHUHSUHVHQWLQJZKHWKHUWKHUHLVD UXQWLPHOLPLWIRUNHUQHOVH[HFXWHGRQWKLVGHYLFH

int integrated $ERROHDQYDOXHUHSUHVHQWLQJZKHWKHUWKHGHYLFHLV DQLQWHJUDWHG*38LHSDUWRIWKHFKLSVHWDQGQRWD GLVFUHWH*38

int canMapHostMemory $ERROHDQYDOXHUHSUHVHQWLQJZKHWKHUWKHGHYLFH FDQPDSKRVWPHPRU\LQWRWKH&8'$GHYLFH DGGUHVVVSDFH

int computeMode $YDOXHUHSUHVHQWLQJWKHGHYLFHǢVFRPSXWLQJPRGH GHIDXOWH[FOXVLYHRUSURKLELWHG

int maxTexture1D 7KHPD[LPXPVL]HVXSSRUWHGIRU'WH[WXUHV

Table 3.1 &DSWLRQQHHGHG&RQWLQXHG

From the Library of Daisy Alford Smith

ptg

INTRODUCTION TO CUDA C

30

DEVICE PROPERTY DESCRIPTION

int maxThreadsPerBlock 7KHPD[LPXPQXPEHURIWKUHDGVWKDWDEORFNPD\

FRQWDLQ

int maxThreadsDim[3] 7KHPD[LPXPQXPEHURIWKUHDGVDOORZHGDORQJ HDFKGLPHQVLRQRIDEORFN

int maxGridSize[3] 7KHQXPEHURIEORFNVDOORZHGDORQJHDFK GLPHQVLRQRIDJULG

size_t totalConstMem 7KHDPRXQWRIDYDLODEOHFRQVWDQWPHPRU\

int major 7KHPDMRUUHYLVLRQRIWKHGHYLFHǢVFRPSXWH FDSDELOLW\

int minor 7KHPLQRUUHYLVLRQRIWKHGHYLFHǢVFRPSXWH FDSDELOLW\

size_t textureAlignment 7KHGHYLFHǢVUHTXLUHPHQWIRUWH[WXUHDOLJQPHQW

int deviceOverlap $ERROHDQYDOXHUHSUHVHQWLQJZKHWKHUWKHGHYLFH FDQVLPXOWDQHRXVO\SHUIRUPDcudaMemcpy()

DQGNHUQHOH[HFXWLRQ

int multiProcessorCount 7KHQXPEHURIPXOWLSURFHVVRUVRQWKHGHYLFH

int kernelExecTimeoutEnabled $ERROHDQYDOXHUHSUHVHQWLQJZKHWKHUWKHUHLVD UXQWLPHOLPLWIRUNHUQHOVH[HFXWHGRQWKLVGHYLFH

int integrated $ERROHDQYDOXHUHSUHVHQWLQJZKHWKHUWKHGHYLFHLV DQLQWHJUDWHG*38LHSDUWRIWKHFKLSVHWDQGQRWD GLVFUHWH*38

int canMapHostMemory $ERROHDQYDOXHUHSUHVHQWLQJZKHWKHUWKHGHYLFH FDQPDSKRVWPHPRU\LQWRWKH&8'$GHYLFH DGGUHVVVSDFH

int computeMode $YDOXHUHSUHVHQWLQJWKHGHYLFHǢVFRPSXWLQJPRGH GHIDXOWH[FOXVLYHRUSURKLELWHG

int maxTexture1D 7KHPD[LPXPVL]HVXSSRUWHGIRU'WH[WXUHV

Table 3.1 &DSWLRQQHHGHG&RQWLQXHG

From the Library of Daisy Alford Smith

(7)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 41 ptg

QUERYING DEVICES

31 48(5<,1*'(9,&(6

DEVICE PROPERTY DESCRIPTION

int maxTexture2D[2] 7KHPD[LPXPGLPHQVLRQVVXSSRUWHGIRU' WH[WXUHV

int maxTexture3D[3] 7KHPD[LPXPGLPHQVLRQVVXSSRUWHGIRU' WH[WXUHV

int maxTexture2DArray[3] 7KHPD[LPXPGLPHQVLRQVVXSSRUWHGIRU' WH[WXUHDUUD\V

int concurrentKernels $ERROHDQYDOXHUHSUHVHQWLQJZKHWKHUWKHGHYLFH VXSSRUWVH[HFXWLQJPXOWLSOHNHUQHOVZLWKLQWKH VDPHFRQWH[WVLPXOWDQHRXVO\

:HǢGOLNHWRDYRLGJRLQJWRRIDUWRRIDVWGRZQRXUUDEELWKROHVRZHZLOOQRW JRLQWRH[WHQVLYHGHWDLODERXWWKHVHSURSHUWLHVQRZ,QIDFWWKHSUHYLRXVOLVWLV PLVVLQJVRPHLPSRUWDQWGHWDLOVDERXWVRPHRIWKHVHSURSHUWLHVVR\RXZLOOZDQW WRFRQVXOWWKHNVIDIA CUDA Programming GuideIRUPRUHLQIRUPDWLRQ:KHQ\RX PRYHRQWRZULWH\RXURZQDSSOLFDWLRQVWKHVHSURSHUWLHVZLOOSURYHH[WUHPHO\

XVHIXO+RZHYHUIRUQRZZHZLOOVLPSO\VKRZKRZWRTXHU\HDFKGHYLFHDQGUHSRUW WKHSURSHUWLHVRIHDFK6RIDURXUGHYLFHTXHU\ORRNVVRPHWKLQJOLNHWKLV

#include "../common/book.h"

int main( void ) {

cudaDeviceProp prop;

int count;

HANDLE_ERROR( cudaGetDeviceCount( &count ) );

for (int i=0; i< count; i++) {

HANDLE_ERROR( cudaGetDeviceProperties( &prop, i ) );

//Do something with our device's properties

} }

Table 3.1 &8'$'HYLFH3URSHUWLHV&RQWLQXHG

From the Library of Daisy Alford Smith

(8)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 42

Problem Partitioning

§  Problem: your input, e.g. the vectors, is larger than the maximally allowed size along one dimension?

§  I.e., what if vec_len > maxThreadsDim[0] * maxGridSize[0] ?

§  Solution: partition the problem (color = thread ID)

Only these two

partitionings are

good for GPUs!

(9)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 43

Example: Adding Huge Vectors

§  Vectors of size 100,000,000 are not uncommon in high- performance computing (HPC) …

§  The thread layout:

§  Kernel launch:

§  Index computation in the kernel:

dim3 threads(16,16); // = 256 threads per block int n_threads_pb = threads.x * threads.y;

int n_blocks = (vec_len + n_threads_pb - 1) / n_threads_pb;

int nb_sqrt = (int)( ceilf( sqrtf( n_blocks ) ) );

dim3 blocks( nb_sqrt, nb_sqrt );

addVectors<<< blocks, threads >>>( d_a, d_b, d_c, n );

unsigned int tid_x = blockDim.x * blockIdx.x + threadIdx.x;

unsigned int tid_y = blockDim.y * blockIdx.y + threadIdx.y;

unsigned int i = tid_y * (blockDim.x * gridDim.x) + tid_x;

(10)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 44

§  Visualization of this index computation:

blockIdx.x

blockIdx.y

Block Grid

threadIdx.x threadIdx.y

x

y

(11)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 45

Constant Memory

§  Why is it so important to declare constant variables/instances in C/C++ as const ?

§  It allows the compiler to …

§  optimize your program a lot

§  do more type-checking

§  Something similar exists in CUDA → constant memory

(12)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 46

Example: a Simple Raytracer

§  The ray-tracing principle:

1.  Shoot rays from camera through every pixel into scene (primary rays)

2.  If the rays hits more than one object, then consider only the first hit

3.  From there, shoot rays to all light sources (shadow feelers)

4.  If a shadow feeler hits another obj → point is in shadow w.r.t. that light source.

Otherwise, evaluate a lighting model (e.g., Phong [see "Computer graphics"])

5.  If the hit object is glossy, then shoot reflected rays into scene (secondary rays) → recursion

6.  If the hit object is transparent, then shoot refracted ray → more recursion

(13)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 47

§  Simplifications (for now):

§  Only primary rays

§  Camera is at infinity →

primary rays are orthogonal to image plane

§  Only spheres

-  They are so easy, every raytracer has them J

2/27/13 11:19 AM

Page 1 of 1 http://schabby.de/wp-content/uploads/2012/08/Orthogonal-Projection-plain.svg

x y z

(14)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 48

§  The data structures:

struct Sphere {

Vec3 center; // center of sphere float radius;

Color r, g, b; // color of sphere __device__

bool intersect( const Ray & ray, Hit * hit ) {

...

}

};

(15)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 49

§  The mechanics on the host side:

int main( void ) {

// create host/device bitmaps (see Mandelbrot ex.) ...

Sphere * h_spheres = new Sphere[n_spheres];

// generate spheres, or read from file // transfer spheres to device (later) // generate image by launching kernel

// assumption: img_size = multiple of block-size!

dim3 threads(16,16);

dim3 blocks( img_size/treads.x, img_size/treads.y );

raytrace<<<blocks,threads>>>( d_bitmap );

// display, clean up, and exit

};

(16)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 50

The mechanics on the device side

__global__

void raytrace( unsigned char * bitmap ) { // map thread id to pixel position

int x = blockIdx.x * blockDim.x + threadIdx.x;

int y = blockIdx.y * blockDim.y + threadIdx.y;

int offset = x + y * (gridDim.x * blockDim.x);

Ray ray( x, y, camera ); // generate primary ray // check intersection with scene, take closest one

min_dist = INF;

int hit_sphere = MAX_INT;

Hit hit;

for ( int i = 0; i < n_spheres; i ++ ) { if ( intersect(ray, i, & hit) ) {

if ( hit.dist < min_dist ) {

min_dist = hit.dist; // found a closer hit

hit_sphere = i; // remember sphere; hit info } // is already filled

} }

// compute color at hit point (if any) and set in bitmap

}

(17)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 51

Declaration & Transfer

§  Since it is constant memory, we declare it as such:

§  Transfer now works by a different kind of Memcpy:

const int MAX_NUM_SPHERES 1000;

__constant__ Sphere c_spheres[MAX_NUM_SPHERES];

int main( void ) {

...

// transfer spheres to device

cudaMemcpyToSymbol( c_spheres, h_spheres,

n_spheres * sizeof(Sphere) );

...

};

(18)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 52

§  Access of constant memory on the device (i.e., from a kernel) works just like with any globally declared variable

§  Example:

__constant__ Sphere c_spheres[MAX_NUM_SPHERES];

__device__

bool intersect( const Ray & ray, int s, Hit * hit ) {

Vec3 m( c_spheres[s].center – ray.orig );

float q = m*m – c_spheres[s].radius*c_spheres[s].radius;

float p = ...

solve_quadratic( p, q, *t1, *t2 );

...

}

m r d M

P t

1

t

2

(t · d m) 2 = r 2 t 2 2t · md + m 2 r 2 = 0

(19)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 53

Some Considerations on Constant Memory

§  Size of constant memory on the GPU is fairly limited (~48 kB)

§  Check cudaDeviceProp

§  Reads from constant memory can be very fast:

§  "Nearby" threads accessing the same constant memory location incur only a single read operation (saves bandwidth by up to factor 16!)

§  Constant memory is cached (i.e., consecutive reads will not incur additional traffic)

§  Caveats:

§  If "nearby" threads read from different memory locations

→ traffic jam!

(20)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 54

New Terminology

§  "Nearby threads" = all threads within a warp

§  Warp := 32 threads next to each other

§  Each block's set of threads is partitioned into warps

§  All threads within a warp are executed on a single streaming multiprocessor (SM) in lockstep

§  If all threads in a warp read from the same

memory location → one read instruction by SM

§  If all threads in a warp read from random memory locations → 32 different read instructions by SM, one after another!

§  In our raytracing example, everything is fine (if there is no bug J )

For more details: see "Performance with constant memory" on course web page

(21)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 55

Overview of a GPU's Architecture

Nvidia's Kepler architecture as of 2012 (192 single-precision cores / 15 SM)

(22)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 56

One Streaming Multiprocessor

(23)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 57

Thread Divergence Revisited

§  This execution of threads in lockstep fashion on one SM (think SIMD) is the reason, why thread divergence is so bad

§  Thread divergence can occur at each occurrence of if-then- else, while, for, and switch (all flow control statements)

§  Example:

1. pass

2. pass

(24)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 58

Control Flow Divergence

Branch Branch

Path A

Path C

Branch

Path B

§  The more complex your control flow graph (this is called

cyclometric complexity), the more thread divergence can occur!

(25)

G. Zachmann Massively Parallel Algorithms SS 7 May 2014 Fundamental Algos & Introduction to CUDA 59

Consequences for You as an Algorithm Designer / Programmer

§  Try to devise algorithms that consist of kernels with very low cyclometric complexity

§  Avoid recursion (would probably further increase thread divergence)

§  The other reason is that we would need one stack per thread

§  If your algorithm heavily relies on recursion, then it may not be well suited

for massive (data) parallelism!

Abbildung

Table 3.1  &amp;DSWLRQQHHGHG&amp;RQWLQXHG
Table 3.1  &amp;DSWLRQQHHGHG&amp;RQWLQXHG
Table 3.1  &amp;8'$'HYLFH3URSHUWLHV&amp;RQWLQXHG

Referenzen

ÄHNLICHE DOKUMENTE

As expected, cuckoo hashing is highly robust against answering bad queries (Figure 6.5) and its performance degrades linearly as the average number of probes approaches the

§  Awareness of the issues (and solutions) when using massively parallel architectures.. §  Programming skills in CUDA (the language/compiler/frameworks for

§  Synchronization usually involves waiting by at least one task, and can therefore cause a parallel application's execution time to increase. §  Granularity :=

§  Device memory pointers (obtained from cudaMalloc() ). §  You can pass each kind of pointers around as much as you

Ideal Reality.. Zachmann Massively Parallel Algorithms SS 21 May 2014 Fundamental Algos &amp; Introduction to CUDA

One method to address this problem is the Smart Grid, where Model Predictive Control can be used to optimize energy consumption to match with the predicted stochastic energy

§  Assume the scan operation is a primitive that has unit time costs, then the following algorithms have the following complexities:.. 38

B.  For each number x in the list, cut a spaghetto to length x list = bundle of spaghetti &amp; unary repr.. C.  Hold the spaghetti loosely in your hand and tap them on