• Keine Ergebnisse gefunden

Exercise2(FindtheSynchronizationBug, 2Credits ) Exercise1(Reduceoperations, 8Credits ) DueDate21.05.2014 AssignmentonMassivelyParallelAlgorithms-Sheet4

N/A
N/A
Protected

Academic year: 2021

Aktie "Exercise2(FindtheSynchronizationBug, 2Credits ) Exercise1(Reduceoperations, 8Credits ) DueDate21.05.2014 AssignmentonMassivelyParallelAlgorithms-Sheet4"

Copied!
2
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Prof. G. Zachmann A. Srinivas

University of Bremen School of Computer Science

CGVR Group May 8, 2014

Summer Semester 2014

Assignment on Massively Parallel Algorithms - Sheet 4

Due Date 21. 05. 2014

Exercise 1 (Reduce operations, 8 Credits)

The frameworkreduce_max_sumsets up a large array in global memory on the GPU.

The goal of this exercise is to write a program that computes the sum, the max, and the argmax for each block inone kernel call. Then these intermediate (per block) results are combined on the CPU.

In this exercise you can assume that the input vector has a power-of-two length (and all elements are valid).

Example:

Input Array: 1, 3, -10, 0.2, 42.17, -0.1 Sum 36.27, Max: 42.17, Argmax 4:

Your task are the following:

a) Implement a version of the kernel that makes use of shared memory.

b) Implement another version of the kernel using global memoryonly for all intermediate results.

Note: CUDA does not support synchronization across different blocks of a kernel call.

c) Write are CPU reference implementation to compute the sum, max and argmax. Compare the running times of all three solutions.

c) Describe (in pseudo code) how you could change your kernel so that it can handle vectors of arbitrary length. (You don’t have to implement this version.) What could be detrimental to the performance of this modified kernel?

Exercise 2 (Find the Synchronization Bug, 2 Credits )

In the following kernel for the dot product, there is a bug that will cause erratic errors, which might look suspiciously like a race condition.

a) Find the bug. Hint: read the code carefully until the very last line.

b) Fix the bug byonly adding one line of code. You don’t have to implement anything.

1

(2)

1 g l o b a l

2 void d o t p r o d u c t ( f l o a t ∗a , f l o a t ∗b , f l o a t ∗c , i n t N )

3 {

4 s h a r e d f l o a t c a c h e [ t h r e a d s P e r B l o c k ] ;

5 i n t t i d = t h r e a d I d x . x + b l o c k I d x . x ∗ blockDim . x ;

6

7 i f ( t i d < N)

8 c a c h e [ t h r e a d I d x . x ] = a [ t i d ] ∗ b [ t i d ] ;

9

10 // f o r r e d u c t i o n s , t h r e a d s P e r B l o c k must b e a power o f 2 !

11 i n t i = blockDim . x / 2 ;

12 while ( i != 0 )

13 {

14 s y n c t h r e a d s ( ) ; // w a i t u n t i l a l l i n p u t d a t a i s r e a d y

15 i f ( t h r e a d I d x . x < i )

16 c a c h e [ t h r e a d I d x . x ] += c a c h e [ t h r e a d I d x . x + i ] ;

17 i /= 2 ;

18 }

19

20 // l a s t t h r e a d c o p i e s p a r t i a l sum t o g l o b a l memory

21 i f ( t h r e a d I d x . x == ( blockDim . x − 1 ) )

22 c [ b l o c k I d x . x ] = c a c h e [ 0 ] ;

23 }

2

Referenzen

ÄHNLICHE DOKUMENTE

Hint: You can use one of the examples on the lecture homepage or from the Cuda SDK ( included in the Cuda installation package ) to test if Cuda works at all on your computer.

Hint: Please note that the tiled version of Matrix Multiplication is used in the above given framework and use the similarities between algorithm EXTEND-PATH and Matrix

a) Modify the bubble sort cuda implementation (single block) in the previous assignment (assignment 10) so that it can handle array lengths greater than 2 times the maximum number

The intersect() function of the acceleration data structures is called for the whole scene instead of the currently used intersect() from the SurfaceList class. As expected, it

Develop an efficient algorithm to reset the temperature of the heat sources/sinks in each simulation step. Think about both, expensive memory accesses and computational effort. Is

Figure 1: Different edge aggregation methods applied to K3,3: (a) the node link diagram, (b) a hierarchical method, (c) density based edge rendering, (d) force directed edge

FMK is a kernel for operating system development, not application development, and as such it provides a minimal set of fundamental services including lightweight process

Suppose we make the balancing condition for quadtrees more severe: we no longer allow adjacent squares to differ by a factor two in size, but we require them to have exactly the