Exercise2(FindtheSynchronizationBug, 2Credits ) Exercise1(Reduceoperations, 8Credits ) DueDate21.05.2014 AssignmentonMassivelyParallelAlgorithms-Sheet4

(1)

Prof. G. Zachmann A. Srinivas

University of Bremen School of Computer Science

CGVR Group May 8, 2014

Summer Semester 2014

Assignment on Massively Parallel Algorithms - Sheet 4

Due Date 21. 05. 2014

Exercise 1 (Reduce operations, 8 Credits)

The frameworkreduce_max_sumsets up a large array in global memory on the GPU.

The goal of this exercise is to write a program that computes the sum, the max, and the argmax for each block inone kernel call. Then these intermediate (per block) results are combined on the CPU.

In this exercise you can assume that the input vector has a power-of-two length (and all elements are valid).

Example:

Input Array: 1, 3, -10, 0.2, 42.17, -0.1 Sum 36.27, Max: 42.17, Argmax 4:

Your task are the following:

a) Implement a version of the kernel that makes use of shared memory.

b) Implement another version of the kernel using global memoryonly for all intermediate results.

Note: CUDA does not support synchronization across different blocks of a kernel call.

c) Write are CPU reference implementation to compute the sum, max and argmax. Compare the running times of all three solutions.

c) Describe (in pseudo code) how you could change your kernel so that it can handle vectors of arbitrary length. (You don’t have to implement this version.) What could be detrimental to the performance of this modified kernel?

Exercise 2 (Find the Synchronization Bug, 2 Credits )

In the following kernel for the dot product, there is a bug that will cause erratic errors, which might look suspiciously like a race condition.

a) Find the bug. Hint: read the code carefully until the very last line.

b) Fix the bug byonly adding one line of code. You don’t have to implement anything.

1

(2)

1 g l o b a l

2 void d o t p r o d u c t ( f l o a t ∗a , f l o a t ∗b , f l o a t ∗c , i n t N )

3 {

4 s h a r e d f l o a t c a c h e [ t h r e a d s P e r B l o c k ] ;

5 i n t t i d = t h r e a d I d x . x + b l o c k I d x . x ∗ blockDim . x ;

6

7 i f ( t i d < N)

8 c a c h e [ t h r e a d I d x . x ] = a [ t i d ] ∗ b [ t i d ] ;

9

10 // f o r r e d u c t i o n s , t h r e a d s P e r B l o c k must b e a power o f 2 !

11 i n t i = blockDim . x / 2 ;

12 while ( i != 0 )

13 {

14 s y n c t h r e a d s ( ) ; // w a i t u n t i l a l l i n p u t d a t a i s r e a d y

15 i f ( t h r e a d I d x . x < i )

16 c a c h e [ t h r e a d I d x . x ] += c a c h e [ t h r e a d I d x . x + i ] ;

17 i /= 2 ;

18 }

19

20 // l a s t t h r e a d c o p i e s p a r t i a l sum t o g l o b a l memory

21 i f ( t h r e a d I d x . x == ( blockDim . x − 1 ) )

22 c [ b l o c k I d x . x ] = c a c h e [ 0 ] ;

23 }

2