• Keine Ergebnisse gefunden

Assignment on Massively Parallel Algorithms - Sheet 8

N/A
N/A
Protected

Academic year: 2021

Aktie "Assignment on Massively Parallel Algorithms - Sheet 8"

Copied!
1
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Prof. G. Zachmann A. Srinivas

University of Bremen School of Computer Science

CGVR Group June 27, 2014

Summer Semester 2014

Assignment on Massively Parallel Algorithms - Sheet 8

Due Date 02. 07. 2014

Exercise 1 (Prefix Sums Theoretical Exercise, 8 Credits )

a) Analyze the parallel scan kernel using 1 block given below . Show that thread divergence only occurs in the first warp for stride values up to half of the warp size. That is, for warp size 32, control divergence will occur to iterations for stride values 1, 2, 4, 8, and 16.

Hint: Refer Hillis Steele Algorithm in lecture slides

t e m p l a t e<i n t INPUT SIZE> g l o b a l v o i d h i l l i s s t e e l e s c a n k e r n e l (f l o a t ∗X , f l o a t ∗Y , i n t I n p u t S i z e )

{

s h a r e d f l o a t XY [ INPUT SIZE ] ; i n t i = t h r e a d I d x . x ;

i f ( i < I n p u t S i z e ) {

XY [ t h r e a d I d x . x ] = X [ i ] ; }

// t h e c o d e b e l o w p e r f o r m s i t e r a t i v e s c a n on XY

f o r (u n s i g n e d i n t s t r i d e = 1 ; s t r i d e <= t h r e a d I d x . x ; s t r i d e ∗= 2 ) {

s y n c t h r e a d s ( ) ;

XY [ t h r e a d I d x . x ] += XY [ t h r e a d I d x . x−s t r i d e ] ; }

Y [ i ] = XY [ t h r e a d I d x . x ] ; }

// / / / / / / / / / / / / / / / / / / / / / / / / / / / /∗k e r n e l l a u n c h∗/ / / / / / / / / / / / / / / / / / / / / / / / / / / / h i l l i s s t e e l e s c a n k e r n e l<INPUT SIZE><<<1, INPUT SIZE>>>(X , Y , I n p u t S i z e ) ;

b) For the work efficient scan kernel (Blelloch algorithm), assume that we have 2048 elements, how many add operations will be performed in both the up sweep phase and the down sweep phase?

c) Analyze the Blelloch Algorithm for arbitrary length input presented in lecture slides (hierarchical parallel scan algorithm) and show that it is work efficient and the total number of additions is no more than 4*N-3

d) Describe a massively parallel algorithm that computes the minimum of an array with depth complexityO logn

.

e) Explain why it is necessary for the⊕operator in the definition of any prefix sum to be associative.

1

Referenzen

ÄHNLICHE DOKUMENTE

As expected, cuckoo hashing is highly robust against answering bad queries (Figure 6.5) and its performance degrades linearly as the average number of probes approaches the

§  Awareness of the issues (and solutions) when using massively parallel architectures.. §  Programming skills in CUDA (the language/compiler/frameworks for

§  Synchronization usually involves waiting by at least one task, and can therefore cause a parallel application's execution time to increase. §  Granularity :=

§  Device memory pointers (obtained from cudaMalloc() ). §  You can pass each kind of pointers around as much as you

All you have to do is implement the body of the kernel reverseArrayBlock(). Launch multiple 256-thread blocks; to reverse an array of size N, you need N/256 blocks.. a) Compute

One method to address this problem is the Smart Grid, where Model Predictive Control can be used to optimize energy consumption to match with the predicted stochastic energy

§  Assume the scan operation is a primitive that has unit time costs, then the following algorithms have the following complexities:.. 38

B.  For each number x in the list, cut a spaghetto to length x list = bundle of spaghetti &amp; unary repr.. C.  Hold the spaghetti loosely in your hand and tap them on