• Keine Ergebnisse gefunden

Massively Parallel Algorithms

N/A
N/A
Protected

Academic year: 2021

Aktie "Massively Parallel Algorithms"

Copied!
99
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)Massively Parallel Algorithms Parallel Sorting. G. Zachmann University of Bremen, Germany cgvr.cs.uni-bremen.de.

(2) Sorting using Spaghetti in O(1) (?) § Is O(n) really the lower bound for sorting? § Consider the following thought experiment: B. For each number x in the list, cut a spaghetto to length x → list = bundle of spaghetti & unary repr. C. Hold the spaghetti loosely in your hand and tap them on the kitchen table → takes O(1) ! D. Lower your other hand from above until it meets with a spaghetto — this one is clearly the longest E. Remove this spaghetto and insert it into the front of the output list F. Repeat. § If we could use this mechanical computer, then sorting would be O(1). G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 4.

(3) Difficulties With Parallel Implementation of Standard Sequential Algorithms. § Insertion sort: § Considers only one element at a time. § Quicksort: § Yes, some parallelism at lower levels of the recursion tree § But, would need median as a pivot element ⟶ hard to find § Otherwise, random pivot element causes varying sub-array sizes. § Heapsort: § Only one element at a time § Heap (= recursive data structure) is difficult on mass.-parallel architecture. § Radix sort: § Yes, we've seen that already, works well § But, can handle only fixed-length numbers G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 5.

(4) Assumptions § In this chapter, we will always assume that n = 2k § Elements can have any type, for which there is a comparison operator. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 6.

(5) Sorting Networks § Informal definition of comparator networks: § Consist of a bundle of "wires" § Each wire i carries a data element Di (e.g., float) from left to right § Two wires can be connected vertically by a comparator § If Di > Dj ∧ i < j (i.e., wrong order), then Di and Dj are swapped by the. comparator before they move on. 0 1 2 3. along the wires. § Observation: every comparator network is data independent, i.e., the arrangement of comparators and the running time are always the same!. § Goal: find a "small" comparator network that performs sorting for any input → sorting network. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 7.

(6) Example 0 1 2 3 4 5 6 7. One stage / step. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 8.

(7) The 0-1 Principle § Definition (monotone function): Let A, B be two sets with a total ordering relation, and let f : A → B be a mapping. f is called monotone iff. 8a1 , a2 2 A : a1  a2 ) f (a1 )  f (a2 ). § Lemma:. Let f : A → B be monotone. Then, f and min commute, i.e.. ⇥a1 , a2. A : f ( min(a1 , a2 ) ) = min( f (a1 ), f (a2 ) ). Analogously for the max.. § Proof: a2 ⇥ f (a1 ) f (a2 ) min(a1 , a2 ) = a1 , min( f (a1 ), f (a2 ) ) = f (a1 ) f ( min(a1 , a2 ) ) = f (a1 ) = min( f (a1 ), f (a2 ) ). Case 1: a1. Case 2: G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 9.

(8) § Extension of f : A → B to sequences over A and B, resp.:. § Lemma C: Let f be a monotone mapping and Then and f commute, i.e.. a comparator network.. ⇥ ⇥ n a0 , . . . , an : N f (a) = f N (a). G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 10.

(9) § Proof: § Let. be a sequence. § Notation: we write a comparator connecting wires i and j like so:. a0 = [i : j](a) <latexit sha1_base64="9dur+1ja//rVOC+md2w2kEMCeTI=">AAACTHicZVDLShxBFK2eJEYnDx9ZClIoIbrI0O3GEBAk2WSp4Kgw3cid6ttasR5t1a0kQzPLfEO2yV/4Fdn7DW7diZCaGQUnHijqcO6te26dfq2kpzS9TFpPnj6beT47137x8tXr+YXFpQNvgxPYFVZZd9QHj0oa7JIkhUe1Q9B9hYf9s8+j+uE3dF5as0+DGgsNJ0ZWUgBFqYB3fJv35MevxTpsHC+spZ10DP6YZHdkbWflYu/q58rF7vFispGXVgSNhoQC73vZZk1FA46kUDhs58FjDeIMTrAZ7zrkb6NU8sq6eAzxsTrVpwca6DQ2ji7/sNQLVH0oGmnqQGjEZFYVFCfLR3/jpXQoSA04CBFXCkDRSpyCA0ExgykbD8ZPjPJ72s4dGvwurNZgyiavQEs1KLGCoGjY5L6659OTZDCSfkTRS48U6qZG917bEmO21cg7Jj2M4Wb/R/mYHGx2srST7cWUP7EJZtkyW2XrLGNbbId9YbusywQ7Z7/Yb/Yn+ZtcJzfJ7aS1ldy9ecOm0Jr5B2bOt30=</latexit> sha1_base64="gSp2tQ6PBBEPEZwlFdnnKhyG7EM=">AAACTHicZVBNaxRBEK1ZNcb1K9GjII1BTA4uM7kYBCHoxWMCbhLYGUJtT03Spj/G7uroMuzv8Kr/wl/h3Zs/wpsI9u4m4JoHTT9eVderfuNWq8B5/iPrXbt+Y+Xm6q3+7Tt3791fW39wEFz0kobSaeePxhhIK0tDVqzpqPWEZqzpcHz2ZlY/PCcflLPveNJSZfDEqkZJ5CRV+Ey8EiP18n21iVvHaxv5IJ9DXCXFBdnYffxt/ycA7B2vZ1tl7WQ0ZFlqDGFUbLdcdehZSU3TfhkDtSjP8IS6+a5T8TRJtWicT8eymKtLfWZikE9T4+wK/5ZGkZudqlO2jUxWLmY1UQt2YvY3UStPkvVEoJRppYicrOQpepScMliyCWjDwqi8pP3Sk6WP0hmDtu7KBo3Sk5oajJqnXRmaS748SUWr+FMSgwrEse1a8s+Nqyll28y8U9LTFG7xf5RXycH2oMgHxX5K+TUssAqP4AlsQgEvYBfewh4MQcIH+Axf4Gv2PfuV/c7+LFp72cWbh7CE3spfDXm2Fw==</latexit>. <latexit sha1_base64="2TflqOkegS9NYI+6QjEG50JYdPc=">AAACTHicZVDLShxBFK2exEfGd7LMpsgQ1IVDt5sEISBxk6WBjArTjdypvq2l9WirbmmGZr4jW/ND2fsf7kIgNQ8hEw8UdTj31j23zqBW0lOaPiStFy8XFpeWX7VXVtfWNza3Xp94G5zAnrDKurMBeFTSYI8kKTyrHYIeKDwdXB+N66e36Ly05hsNayw0XBhZSQEUpQK2+SfelwdXxQ7snm920m46AX9OshnpsBmOz7eS3by0Img0JBR438/2ayoacCSFwlE7Dx5rENdwgc1k1xF/H6WSV9bFY4hP1Lk+PdRAl7FxfPl/S/1A1ceikaYOhEZMZ1VBcbJ8/DdeSoeC1JCDEHGlABStxCU4EBQzmLPxYPzUKH+i7dyhwTthtQZTNnkFWqphiRUERaMm99UTn58kg5H0PYpeeqRQNzW6PW1LjNlWY++Y9CiGm/0f5XNyst/N0m72Ne0cfp7FvMzesndsh2XsAztkX9gx6zHBbtgPds9+Jr+Sx+R38mfa2kpmb96wObQW/wIyUrPQ</latexit> sha1_base64="9dur+1ja//rVOC+md2w2kEMCeTI=">AAACTHicZVDLShxBFK2eJEYnDx9ZClIoIbrI0O3GEBAk2WSp4Kgw3cid6ttasR5t1a0kQzPLfEO2yV/4Fdn7DW7diZCaGQUnHijqcO6te26dfq2kpzS9TFpPnj6beT47137x8tXr+YXFpQNvgxPYFVZZd9QHj0oa7JIkhUe1Q9B9hYf9s8+j+uE3dF5as0+DGgsNJ0ZWUgBFqYB3fJv35MevxTpsHC+spZ10DP6YZHdkbWflYu/q58rF7vFispGXVgSNhoQC73vZZk1FA46kUDhs58FjDeIMTrAZ7zrkb6NU8sq6eAzxsTrVpwca6DQ2ji7/sNQLVH0oGmnqQGjEZFYVFCfLR3/jpXQoSA04CBFXCkDRSpyCA0ExgykbD8ZPjPJ72s4dGvwurNZgyiavQEs1KLGCoGjY5L6659OTZDCSfkTRS48U6qZG917bEmO21cg7Jj2M4Wb/R/mYHGx2srST7cWUP7EJZtkyW2XrLGNbbId9YbusywQ7Z7/Yb/Yn+ZtcJzfJ7aS1ldy9ecOm0Jr5B2bOt30=</latexit>. a0. a'0. ai. a'i. aj an. a'j a'n. § Now the following is true:. [i : j] f (a) = [i : j] f (a0 ), . . . , f (an ) = f (a0 ), . . . , min( f (ai ), f (aj ) ), . . . , max( f (ai ), f (aj ) ), . . . , f (an ) | {z } | {z } i. j. = f (a0 ), . . . , f ( min(ai , aj ) ), . . . , f ( max(ai , aj ) ), . . . , f (an ) = f a0 , . . . , min(ai , aj ), . . . , max(ai , aj ), . . . , an = f [i : j](a) <latexit sha1_base64="AIP1TrZumC4qu+E/Fyb2t4gCTbk=">AAAEI3ichVNNa9tAEF1L/UjdjyTtsZelpsUC10im0BAohPaSYwp1EvAas1qNnHVWK7G7amyEf05/TW+llx566D/pSpaJFZt2QOww8+a92dlRmAmuje//ajnuvfsPHu49aj9+8vTZ/sHh83Od5orBkKUiVZch1SC4hKHhRsBlpoAmoYCL8PpTmb/4CkrzVH4xiwzGCZ1KHnNGjQ1NDlt/Rvx4NiYhn4oujrvUK13l4Tcf8J3MxPd6mIgoNRr3qoBcgwkZBf0BzMdtW7azwp65jECFijIoMEm47GKyouFerzxmlqfn4eWEb8hsVpGEzjeLcKNqditVN4dX3f2/ueo6JXHVFi0bqIlxA4NrEJ3jBmp7Lruk47W4ld68YUOzkbDX3ZWwAv+ir57t9iHbk4OO3/crw9tOUDsdVNuZXQqPRCnLE5CGCar1KBhkZlxQZTgTsGyTXENG2TWdQlGt4BK/tqEIx6mynzS4ijZwySKh5soCy0Nvpka5iY/GBZdZbkCyFVecC2xSXK4sjrgCZsQCU8ZsSzk1VopdUbsUxq52Q0ZTqVdCZO22iQIJNyxNEiqjgsQ04WIRQUxzYZYF0fHabzLxXHIzt0HNNZg8KzJQb5M0AmyHXWrbH2hphxvcHeW2cz7oB34/+Pyuc/KxHvMeeoleoS4K0Ht0gk7RGRoi5pw60rlx5u4397v7w/25gjqtuuYFapj7+y9V/0Hh</latexit> <latexit sha1_base64="AIP1TrZumC4qu+E/Fyb2t4gCTbk=">AAAEI3ichVNNa9tAEF1L/UjdjyTtsZelpsUC10im0BAohPaSYwp1EvAas1qNnHVWK7G7amyEf05/TW+llx566D/pSpaJFZt2QOww8+a92dlRmAmuje//ajnuvfsPHu49aj9+8vTZ/sHh83Od5orBkKUiVZch1SC4hKHhRsBlpoAmoYCL8PpTmb/4CkrzVH4xiwzGCZ1KHnNGjQ1NDlt/Rvx4NiYhn4oujrvUK13l4Tcf8J3MxPd6mIgoNRr3qoBcgwkZBf0BzMdtW7azwp65jECFijIoMEm47GKyouFerzxmlqfn4eWEb8hsVpGEzjeLcKNqditVN4dX3f2/ueo6JXHVFi0bqIlxA4NrEJ3jBmp7Lruk47W4ld68YUOzkbDX3ZWwAv+ir57t9iHbk4OO3/crw9tOUDsdVNuZXQqPRCnLE5CGCar1KBhkZlxQZTgTsGyTXENG2TWdQlGt4BK/tqEIx6mynzS4ijZwySKh5soCy0Nvpka5iY/GBZdZbkCyFVecC2xSXK4sjrgCZsQCU8ZsSzk1VopdUbsUxq52Q0ZTqVdCZO22iQIJNyxNEiqjgsQ04WIRQUxzYZYF0fHabzLxXHIzt0HNNZg8KzJQb5M0AmyHXWrbH2hphxvcHeW2cz7oB34/+Pyuc/KxHvMeeoleoS4K0Ht0gk7RGRoi5pw60rlx5u4397v7w/25gjqtuuYFapj7+y9V/0Hh</latexit>. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 11.

(10) § Theorem (0-1 principle): Let. be a comparator network.. Now, if. sorts every sequence of 0's and 1's, then it also sorts every sequence of. arbitrary elements!. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 12.

(11) § Proof (by contradiction): § Assumption: § Then. sorts all 0-1 sequences, but does not sort sequence a is not sorted correctly, i.e.. § Define f : A → {0,1} as follows:. § Now, the following holds:. f monotone, Lemma C. where a' is a 0-1 sequence. § But: f (b) is not sorted, because f (bk) = 1 and f (bk+1) = 0 § Therefore,. is not sorted as well, in other words, we have constructed a 0-1 sequence. that is not sorted correctly by G. Zachmann. Massively Parallel Algorithms. . SS. 21 September 2020. Sorting. 13.

(12) Batcher's Odd-Even-Mergesort. [1968]. § In the following, we'll always assume that the length n of a sequence a0,…,an-1 is a power of 2, i.e., n = 2k. § First of all, we define the sub-routine "odd-even merge": oem( a0,…,an-1 ): precondition: a0,…,an/2 -1 and an/2 ,…,an-1 are both sorted postcondition: a0,…,an-1 is sorted if n = 2: compare [a0:a1] (1) if n > 2: ā ← a0,a2,…,an-2 // = even sub-sequence â ← a1,a3,…,an-1 // = odd sub-sequence ← oem( ā ) ← oem( â ) (2) copy → a0,a2,…,an-2 copy → a1,a3,…,an-1 for i {1,3,5,…,n-3} (3) compare [ai : ai+1] G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 14.

(13) § Proof of correctness: § By induction and the 0-1-principle. 2x oem. § Base case: n = 2 § Induction step: n =. 2k ,. k>1. 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 1st half is sorted. 2nd half is sorted. § Consider a 0-1-sequence a0,…,an-1 odd subsequence. § Write it in two columns § Visualize 0 = white, 1 = grey. even subsequence. § Obviously: both ā and â consist of. oem( a0,…,an-1 ): precondition: a0,…,an/ -1 and an/ ,…,an-1 both sorted 2 2 postcondition: a0,…,an-1 is sorted if n = 2: compare [a0:a1] if n > 2:. two sorted halves → preconditon of oem is met. ā ← a0,a2,…,an-2 // = even sub-sequence â ← a1,a3,…,an-1 // = odd sub-sequence. § After line (2) we have this. ← oem( ā ) ← oem( â ) copy → a0,a2,…,an-2 copy → a1,a3,…,an-1 for i {1,3,5,…,n-3}. situation (the odd sub-sequence can have at most two 1's more than the even sub-sequence) G. Zachmann. Massively Parallel Algorithms. (1). SS. 21 September 2020. compare [ai : ai+1]. (2). (3) Sorting. 15.

(14) § In loop (3), these comparisons are made, and there can be only 3 cases:. § Afterwards, one of these two situations has been established:. § Result: the output sequence is sorted § Conclusion: every 0-1-sequence (meeting the preconditions) is sorted correctly § Running time (sequ.) : G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 16.

(15) § The complete general sorting-algorithm: oemSort(a0,…,an-1): if n = 1: return a0,…,an/2 -1 ← oemSort(a0,…,an/2 -1) an/ ,…,an-1 ← oemSort(an/ ,…,an-1) 2. 2. oem(a0,…,an-1). § Running time (sequ.): § Note: in a real implementation, no copying is done! G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 17.

(16) Mapping the Recursion to a Massively-Parallel Architecture. § Load data onto the GPU (global memory) § The CPU executes the following controlling program: oemSort(n): if n = 1 → return oemSort( n/2 ) oem( n, 1 ) oem( n, stride ): if n = 2: launch oemBaseCaseKernel(stride) // launch N (not n) threads else:. N = total size of input. oem( n/2, stride*2 ) launch oemRecursionKernel(stride). § With the stride parameter, we can achieve sorting "in situ" G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 18.

(17) § The kernel for line (3) of the original function oem(): oemRecursionKernel( stride ): if tid < stride || tid ≥ n-stride: output SortData[tid] else: a_i ← SortData[tid]. // pass through. a_j ← SortData[ tid+stride ] if tid/stride is even: output max( a_i, a_j ) else: output min( a_i, a_j ). § As usual, tid = thread ID = 0, …, n-1. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 19.

(18) § Kernel for line (1) of the function oem(): oemBaseCaseKernel ( stride ): i = tid // = thread ID if tid/stride is even: // are we on even/odd side? j = i + stride else: j = i - stride a0 ← SortData[i] // SortData = global array a1 ← SortData[j] if on even side: SortData[i] = min(a0,a1) // write output back else: SortData[i] = max(a0,a1). § Reminder: this kernel is executed in parallel for each index tid = 0, …, n-1. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 20.

(19) § Depth complexity:. § E.g., for 220 elements this amounts to 210 passes. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 21.

(20) Bitonic Sorting § Definition "bitonic sequence": A sequence of numbers a0, …, an-1 is bitonic ⇔ there is an index i such that - a0, …, ai is monotonically increasing, and - ai+1, …, an-1 is monotonically decreasing; OR if there is a cyclic shift of this sequence such that this is the case.. § Because of the latter "OR", we understand all index arithmetic in the following modulo n, and/or we assume in the following that the sequence(s) have been cyclically shifted as described above. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 22.

(21) § Examples of bitonic sequences: § 0 2 4 8 10 9 7 5 3 ; also: 2 4 8 10 9 7 5 3 0 ; also: 4 8 10 9 7 5 3 0 2 ; … § 10 12 14 20 95 90 60 40 35 23 18 0 3 5 8 9 § 12345 § [] § 00000111110000 ; 1111100000111111 ; 1111100000 ; 000011111. § These sequences are NOT bitonic sequences: § 123123 § 123012. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 23.

(22) § Visual representation of bitonic sequences:. index 0. n-1. n-1. 0. 0. n-1. § Because of the "modulo" index arithmetic, we can also visualize them on a circle or cylinder: § Clearly,. n-1. bitonic sequences. 0. 1. a0 a1. have exactly two inflection points an/2 n/2 n/2 - 1 G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 24.

(23) Properties of Bitonic Sequences. § Any sub-sequence of a bitonic sequence is a bitonic sequence § More precisely, assume a0, …, an-1 is bitonic and we have indices 0 ≤ i1 ≤ i2 ≤ … ≤ im < n § Then, ai0 , ai1 , . . . , aim is bitonic, too. § If a0, …, an-1 is bitonic, then an-1 , …, a0 is bitonic, too § (If we mirror a bitonic sequence "upside down", then the new sequence is bitonic, too) § A bitonic sequence has exactly one local minimum and one local maximum. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 25.

(24) Some Notions and Definitions. § More precise graphical notation of a comparator: a. min(a,b). b. max(a,b). § Definition rotation operator: Let a = (a0 , . . . , an 1 ) , and j ∈ [1,n-1] . We define the rotation operator Rj acting on a as. Rj a = (aj , aj+1 , . . . , aj+n 1 ). G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 26.

(25) § Definition L / U operator: La = ( min(a0 , a n2 ), . . . , min(a n2 Ua = ( max(a0 , a n2 ), . . . , max(a n2. 1 , an 1 ) ) 1 , an 1 ) ). § Lemma: The L/U operators are rotation invariant, i.e.. La = R j LRj a,. and Ua = R j URj a.. (Remember that indices are always meant mod n ). § Proof : § We need to show that. Rj La = LRj a. § This is trivially the case:. LRj a = min(aj , aj+ n2 ), . . . , min(a n2 G. Zachmann. Massively Parallel Algorithms. 1 , an 1 ), . . . , min(aj 1 , aj 1+ n2 ) SS. 21 September 2020. = ... Sorting. 27.

(26) § Definition half-cleaner: A network that takes a as input and outputs ( La , Ua ) is called a half-cleaner. a0. § The network that realizes a half-cleaner:. La an/2-1 an/2. Ua an-1. § Because of the rotation invariance,. a0 a1. we can depict a half-cleaner on a circle: § It always produces La and Ua, no matter how a is rotated around the circle!. an/2+1 an/2. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 28.

(27) § Theorem 1: Given a bitonic input sequence a, the output of a half-cleaner has the following properties: 1. La and Ua are bitonic, too; 2. max{La}  min{Ua}. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 29.

(28) Proof. § The half-cleaner does the following: 1. Shift (only conceptually) the right half of a over to the left 2. Take the point-wise min/max ⟶ La , Ua 3. Shift Ua back to the right. § Because a is bitonic, there can be only one cross-over point § By construction, both La and Ua must have length n/2 § Property 1 follows from the sub-sequence property. 0 G. Zachmann. n/2 Massively Parallel Algorithms. n-1. n/2. 0 SS. 21 September 2020. La. n-1. Ua. Sorting. 30.

(29) The Bitonic Merger § The half-cleaner is the basic (and only) building block for the bitonic sorting network! § The recursive definition of a bitonic merger. :. § Input: bitonic. BM" (n). sequence of. BM" (n). length n. a0. § Output: sorted. § Analogously,. BM" ( n2 ). Ua. BM" ( n2 ). bitonic. ascending order. La. sorted. sequence in. we can define an-1. BM# (n) One half-cleaner stage G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 31.

(30) Visualization of a Bitonic Merger. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 32.

(31) Mapping to a Massively Parallel Architecture. § We have n = 2k many "lanes" = threads § At each step, each thread needs to figure out its partner for compare/exchange. § This can be done by considering the ID of each thread (in binary): § At step j, j = 1, …, k : partner ID = ID obtained by reversing bit (k-j) of own ID. § Example: 000 001 010 011 100 | ^ ^ ^ |______| | | | k-3 | | |___________| | | k-2 | |_____________________| k-1 G. Zachmann. Massively Parallel Algorithms. 101. 110. 111. SS. 21 September 2020. Sorting. 33.

(32) The Bitonic Sorter § The recursive definition of a bitonic sorter BS" (n) : BS" (n). a0. an/2. BM" (n). sorted. an/2-1. bitonic. unsorted. BS" (n/2). BS# (n/2). an-1. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 38.

(33) Visualizing Bitonic Sorting on a Linear Array. Initial data sequence. 1: Sort half-arrays in opposite directions 2: Compare half-arrays. 3: Send larger item in each pair to the right Perform 2 & 3 recursively on each half. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 39.

(34) Example Bitonic Sorting Network Lanes (threads). Blue box = low-to-high sorter, red box = high-to-low sorter. 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Stage 1 G. Zachmann. Stage 2. Massively Parallel Algorithms. Stage 3. Stage 4 SS. 21 September 2020. Sorting. 40.

(35) Example Run. 3 7 4 8 6 2 1 5 8x monotonic lists: (3) (7) (4) (8) (6) (2) (1) (5) 4x bitonic lists: (3,7) (4,8) (6,2) (1,5) G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 43.

(36) 3 7 4 8 6 2 1 5 Sort the bitonic lists (each list = 2 elements ⟶ trivially bitonic) G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 44.

(37) 3. 3. 7. 7. 4. 8. 8. 4. 6. 2. 2. 6. 1. 5. 5. 1 4x monotonic lists: (3,7) (8,4) (2,6) (5,1) 2x bitonic lists: (3,7,8,4) (2,6,5,1). G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 45.

(38) 3. 3. 3. 7. 7. 4. 4. 8. 8. 8. 4. 7. 6. 2. 5. 2. 6. 6. 1. 5. 2. 5. 1. 1. Sort the bitonic lists G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 46.

(39) 3. 3. 3. 3. 7. 7. 4. 4. 4. 8. 8. 7. 8. 4. 7. 8. 6. 2. 5. 6. 2. 6. 6. 5. 1. 5. 2. 2. 5. 1. 1. 1. 2x monotonic lists: (3,4,7,8) (6,5,2,1) 1x bitonic list: (3,4,7,8, 6,5,2,1) G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 47.

(40) 3. 3. 3. 3. 3. 7. 7. 4. 4. 4. 4. 8. 8. 7. 2. 8. 4. 7. 8. 1. 6. 2. 5. 6. 6. 2. 6. 6. 5. 5. 1. 5. 2. 2. 7. 5. 1. 1. 1. 8. Sort the bitonic lists G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 48.

(41) 3. 3. 3. 3. 3. 2. 7. 7. 4. 4. 4. 1. 4. 8. 8. 7. 2. 3. 8. 4. 7. 8. 1. 4. 6. 2. 5. 6. 6. 6. 2. 6. 6. 5. 5. 5. 1. 5. 2. 2. 7. 7. 5. 1. 1. 1. 8. 8. Sort the bitonic lists G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 49.

(42) 3. 3. 3. 3. 3. 2. 1. 7. 7. 4. 4. 4. 1. 2. 4. 8. 8. 7. 2. 3. 3. 8. 4. 7. 8. 1. 4. 4. 6. 2. 5. 6. 6. 6. 5. 2. 6. 6. 5. 5. 5. 6. 1. 5. 2. 2. 7. 7. 7. 5. 1. 1. 1. 8. 8. 8. Done! G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 50.

(43) Complexity of the Bitonic Sorter § Depth complexity (= parallel time complexity): § Bitonic merger:. O log n. § Bitonic sorter:. O log2 n. § Work complexity of bitonic merger: § Means number of comparators C(n) here § Recursive equation for C: C (n) = 2C ( n2 ) +. n 2. ,. with C (2) = 1. § Overall C (n) = 12 n log n. § Remark: there must be some redundancy in the sorting network, because we know (from merge sort) that n comparisons are sufficient for merging two sorted sequences. § Reason for the redundancy? ⟶ because the network is data-independent! G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 51.

(44) Remarks on Bitonic Sorting § Probably most well-known parallel sorting algo / network § Fastest algorithm for "small" arrays. § Lower bound on depth complexity for parallel sorting is O n log n = O log n n assuming we have n processors. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 52.

(45) § A nice property: comparators in a bitonic sorter network only ever compare lines whose label (= binary line number) differs by exactly one bit!. § Consequence for the implementation: § One kernel for all threads § Each thread only needs to determine which bit of its own thread ID to "flip" ⟶ gives the "other" lane with which to compare. § Hence, bitonic sorting is sometimes pictured as well suited for a log(n)-dimensional hypercube parallel architecture: § Each node of the hypercube = one processor § Each processor is connected directly to log(n) many other processors § In each step, each processor talks to one of its direct neighbors. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 53.

(46) Optimal Sorting Networks § Optimal = minimal depth § Known up to depth 11 [2013], and depth 40 [2014] § Optimal depth d = 9 for n = 16 § Would it improve performance on the GPU??. Optimal sorting network for n = 16 G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 54.

(47) Adaptive Bitonic Sorting § Theorem 2: Let a be a bitonic sequence. Then, we can always find an index q such that. max aq , . . . , aq+ n2. G. Zachmann. Massively Parallel Algorithms. 1.  min aq+ n2 , . . . , aq. SS. 21 September 2020. 1. Sorting. 57.

(48) § Sketch of proof: § Assume (for sake of simplicity) that all elements in a are distinct § Imagine the bitonic sequence as a "line" on a cylinder § Since a is bitonic ⟶ only two inflection points ⟶ each horizontal plane cuts the sequence at exactly 2 points, and both sub-sequences are contiguous § Use the median m as "cut plane" ⟶ each sub-sequence has length n/2, and max("lower sequ.") ≤ m ≤ min("upper sequ.") § The index of m is exactly index q in Theorem 2 § These must be La and Ua , resp. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 58.

(49) § Visualization of the theorem:. m. 0 q. q+n/2. n-1. § Theorem 3: Any bitonic sequence a can be partitioned into four sub-sequences (a1, a2, a3, a4 ) = a, such that. n |a | + |a | = |a | + |a | = 2 1. 2. 3. 4. ,. |a1 | = |a3 | ,. |a2 | = |a4 |. and. either (La, Ua) = (a1 , a4 , a3 , a2 ) G. Zachmann. Massively Parallel Algorithms. or (La, Ua) = (a3 , a2 , a1 , a4 ) SS. 21 September 2020. Sorting. 59.

(50) n , |a1 | = |a3 | , |a2 | = |a4 | 2 (La, Ua) = (a1 , a4 , a3 , a2 ) or (La, Ua) = (a3 , a2 , a1 , a4 ). |a1 | + |a2 | = |a3 | + |a4 | =. Visual "Proof". either. 1. Input Sequence. 2. Find q and partition. m. n/2. 0. n-1. 0q a1. a2. 3. Swap parts. n-1. 4. Result. m. m. 0q a1 G. Zachmann. q+n/ 2 a3 a4. a2. q+n/ 2 a3 a4. Massively Parallel Algorithms. n-1. n-1. 0 La SS. Ua. 21 September 2020. Sorting. 60.

(51) Complexity § Finding the median in a bitonic sequence ⟶ log n steps § Remark: this algorithm is no longer data-independent! § Depth complexity: ⟶ exercise / research § Work complexity of the adaptive bitonic merger: § Number of comparisons k 1. X n n C (n) = 2C ( ) + log(n) = 2i log( i ) = 2n 2 2 i=0. log n. 2. § This is optimal!. § Needs a trick to avoid actually copying the subsequences - Otherwise the total complexity of an ABM(n) would be O(n log n). § Trick = bitonic tree (see orig. paper for details). G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 61.

(52) How to find the median in a bitonic sequence. § We have median(a) = min(Ua) or. median(a) = max(La) (depending on the definition of the median). § Finding the minimum in a bitonic sequence takes log(n) steps. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 62.

(53) Overall Algorithm for Adaptive Bitonic Sorting. § Same as bitonic sorting, except we replace the half cleaner by 1. Finding the median, and 2. Swapping subsequences (only conceptually) adaptiveBitonicSort( a0,..., an-1): do parallel: sort a0,..., an/2-1 ascending sort an/2,..., an-1 descending adaptiveBitonicMerge( a0,..., an-1) adaptiveBitonicMerge( a0,..., an-1): precond.: a0,..., an-1 is bitonic find index q of median swap subsequences as per theorem 2 and proof do parallel: adaptiveBitonicMerge( a0,..., an/2-1) adaptiveBitonicMerge( an/2,..., an-1) G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 63.

(54) Topics for Master Theses § Lots of different parallel sorting algorithms § What is the performance of Adaptive Bitonic Sorting using CUDA? § Do you love algorithms? § Thinking about them? § Proving properties? § Implementing them super-fast?. § Then we should talk about a possible master's thesis topic! 😀. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 64.

(55) Applications of Sorting: BVH Construction § Bounding volume hierarchies (BVHs): very important data structure for accelerating geometric queries. § Applications: ray-scene intersection, collision detection, spatial data bases, etc. § Database people usually call it "R-tree" .... § Frequently used types of bounding volumes (BVs):. Sphere. G. Zachmann. Box, AABB (R*-trees). Massively Parallel Algorithms. k-DOPs / Slabs. SS. OBB (oriented bounding box). 21 September 2020. Sorting. 65.

(56) Bounding Volume Hierarchy. § Schematic example:. § Three levels of a k-DOP BVH:. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 66.

(57) Example Application of BVHs: Collision Detection. Object 1. Object 2. A. 1 C. B. D. E. F. 3. 2. G. 4. 5. 6. 7. traverse( node X, node Y ): if X,Y do not overlap then: return if X,Y are leaves then: check all pairs of polygons else for all children pairs do: traverse( X i , Y j ). G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 67.

(58) Parallel Construction of BVHs § First idea: linearize 3D points/objects by a space-filling curve § Definition curve: A curve (with endpoints) is a continuous function with domain in the unit interval [0, 1] and range in some d-dimensional space.. § Definition space-filling curve: A space-filling curve is a curve with a range that covers the entire 2-dimensional unit square (or, more generally, an n-dimensional hypercube).. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 68.

(59) Examples of Space-Filling Curves. Peano curve. Hilbert curve. Z-order curve (a.k.a. Morton curve) G. Zachmann. Massively Parallel Algorithms. Z-order curve in 3D. SS. 21 September 2020. Sorting. 69.

(60) § Benefit: a space-filling curve gives a mapping for every point in the unit square onto a Example: Z-order for point data. Example: Z-order for point data. Example: Z-order for point data. pointSpace-filling in the unit interval interval curve: mapping from unit square to unit interval Space-filling curve: mapping from unit square to unit interval curve: mapping from unit square to unit Space-filling. Z-order: SE map quadrants recursively in order NW, NE, SW, SE Z-order: map quadrants recursively in order NW, NE,Z-order: SW, SE map quadrants recursively in order NW, NE, SW,. 1. 0. 1. 0. 0. 1. § At least, the limit curve does that … § We can construct a "space-filling" curve only up to some specific (recursion) level, i.e., in practice space-filling curves are never really space-filling. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 70.

(61) Construction of the Z-Order Curve in 3D 1. Choose a level k 2. Construct a regular lattice of points in the unit cube, 2k points along each dimension 3. Represent the coordinates of a lattice point p by integer/binary number, i.e., k bits for each coordinate, e.g. px = bx,k…bx,1 4. Define the Morton code of p as the interleaved bits of the coordinates, i.e., m(p) = bz,kby,kbx,k…bz,1by,1bx,1 5. Connect the points in the order of their Morton codes ⟶ z-order curve at level k. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 71.

(62) Example (in 2D). 1010. 1011. 1110. 1111. 11 yxyx. 1000. 1001. 1100. 1101. 0000. 10 lowest level. 0010. 0011. 0110. 0111. 0000. 0001. 0100. 0101. 00. 01. 10. 11. 01. 00. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 72.

(63) Note: the Z-curve induces a grid (actually, a complete quadtree). 1010. 1011. 1110. 1111. 1000. 1001. 1100. 1101. 0010. 0011. 0110. 0111. 0000. 0001. 0100. 0101. 00. 01. 10. 11. 11. 10. 01. 00. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 73.

(64) Properties of Morton Codes § The Morton code of each point is 3k bits long § All points p with Morton code m(p) = 0xxx lie below the plane z = 0.5. § All points with m(p) = 111xxx lie in the upper right quadrant of the cube § If we build a quadtree/octree on top of the grid, then the Morton code encodes the path of a point, from the root to the leaf that contains the point ("0" = left, "1" = right). § The Morton codes of two points differ for the first time – when read from left to right – at bit position h ⇔ the paths in the binary tree over the grid split at level h. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. 0010. Sorting. 74.

(65) Construction of Linear BVHs § Scale all polygons such that bbox = unit cube § Replace polygons by their "center point" § E.g., center point = barycenter , or center point = center of bbox of polygon 1.0. 0.0 G. Zachmann. Massively Parallel Algorithms. 1.0 SS. 21 September 2020. Sorting. 75.

(66) § Assign Morton codes to points according to their enclosing grid cell § Assign those Morton codes to the original polygons, too. G. Zachmann. 1010. 1011. 1110. 1111. 1000. 1001. 1100. 1101. 0010. 0011. 0110. 0111. 0000. 0001. 0100. 0101. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 76.

(67) § Now, we've got a list of pairs of ⟨polygon ID, Morton code⟩ § Example: Pgon ID ⟶ Morton code ⟶ 1010. 1000. 1001. 0010. 0000. 1110. 0011. 1101. § Sort list according to Morton code, i.e., along z-curve ⟶ linearization Pgon ID ⟶ Morton code ⟶ 0000 Array index i ⟶. 0. 0010. 0011. 1000. 1001. 1. 2. 3. 4. 1010 5. 1110. 1101. 6. 7. § Next: find index intervals representing BVH nodes at different levels G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 77.

(68) § Root of BVH = polygons in index range 0,…,N-1 § All polygons with first bit of Morton code = 0/1 are below/above the plane z = 0.5, resp. § In sorted array, find index i where first bit (MSB) changes from "0" to "1" § Left child of root = polygons in index range 0,…,i-1 § Right child of root = polygons in index range i,…,N-1. § In general (recursive formulation): § Given: level h, and index range i,…,j in the sorted array, such that Morton codes are identical for all polygons in that range up to bit h § Find index k in [i,j] where the bit at position h' (h' > h) in Morton codes changes from "0" to "1". § Can be achieved quickly by binary search and CUDA's __clz() function (= "count number of leading zeros") G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 78.

(69) § Consider arbitrary polygons at position i and i+1 in the array § Condition for "same node": Polygons i and i+1 are in the same node of the BVH at level h ⇔ Morton codes are the same up to bit h. § Define a split marker := ⟨index i, level h⟩ § Parallel computation of all split markers ⟶ "split list": § Each thread i checks polygons i and i+1 § Compare their Morton codes from left to right ⟶ h = left-most bit position where the two Morton codes differ - Can be calculated in one step using XOR and __clz. § Outputs split markers ⟨i,h⟩, …, ⟨i,3k⟩ (seems like a bit of overkill) § Can be at most 3k split markers per thread ⟶ static memory allocations works G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 79.

(70) § Example: Pgon ID ⟶ Morton code ⟶ 0000 Array index i ⟶. 0010. 0011. 1000. 1001. 1. 2. 3. 4. 0. (0,3). (1,4). (0,4). (2,1) (2,2). (3,4). 1010. 1110. 5. 1101. 6. (4,3). (5,2). (4,4). (5,3). 7. (6,4). (5,4). (2,3) (2,4). Split marker = (i,h). G. Zachmann. Massively Parallel Algorithms. , i ∈ [0,N-1] , h ∈ [1,3k]. SS. 21 September 2020. Sorting. 80.

(71) § Last steps: 1. Compact split list 2. Sort split list by level h - Must be a stable sort!. § For each level h, we now have ranges of indices in the resulting list; all primitives within a range are in the same node on that level h. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 81.

(72) § Example:. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 82.

(73) § Final steps: § Remove singleton BVH nodes § Compute bounding boxes for each node (i.e., interval) § Convert to "regular" BVH with pointers. § Limitations: § Not optimized for ray tracing § Morton code only approximates locality. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 83.

(74) Application of Collision Detection (Video). G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 86.

(75) Collision Detection Without Auxiliary Data Structures § Goal: collision detection of deformable objects § Consequence: auxiliary (acceleration) data structure could potentially slow down the whole method. § Given: a large set of AABB's (each enclosing one polygon) § Sought: pairs of AABB's that intersect (overlap) § Potentially intersecting pairs of polygons § Could be boxes of different objects ⟶ regular collision detection § Could be boxes of same object ⟶ self-collision / self-intersection. § Simplification here: ignore problem with pairs of boxes where triangles are adjacent in the same mesh § Need to be filtered before doing the actual intersection tests. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 87.

(76) § General idea: dimension reduction by plane sweep § Sweep plane through space along an axis § Consider only boxes that intersect that plane § Check intersection of those boxes in 2D. § Alternative description:. y. § Project all boxes onto the (sweep) axis ⟶ set of intervals § Find pairs of intervals that overlap. § Sweep/projection axis can be chosen arbitrarily S1 S2 E1. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. E2 S3 S4 E4. E3. x. Sorting. 88.

(77) The Algorithm parallel for all triangles: compute AABB sort all end points S i and E i of all AABBs in one common array (key = x-coord., value = triangle ID) create list C of overlapping intervals parallel for all pairs in C: perform AABB overlap test along y- and z-axis if no overlap: remove pair from list C perform stream compaction on C parallel for all pairs (T i , T j ) in C: if (T i , T j ) share an edge: remove pair from C if (T i , T j ) do not intersect: remove pair from C perform stream compaction on C output C. (x) (xx). § Remark 1: it is faster to omit the overlap test on the z axis § Remark 2: we will compute (x) and (xx) at the same time G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 89.

(78) Step (x): create list C of overlapping intervals. § Idea: § Consider all starting points Si § Find all intervals [Sj, Ej] with Si ∈[Sj, Ej] § Do not consider the endpoints Ei , otherwise each overlapping pair is found twice. § Naïve parallelization: one thread per triangle § Thread starts at position i of "its" Si in the sorted array of start/end points § Scans array from there to the right. § Goal for parallelization: one thread per overlapping pair § Problem: number of threads and amount of memory for C is unknown. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 90.

(79) 80 80 80. ournovell novellcollision collisiondetection detection approach our approach 3 4 5 7 0 2 1 Position 6 Position Position Array indices 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 Bounding Box ID (Start/End) SA SC SB EC EA EB SD ED . . . Sorted array of start/end ... ... Bounding IDID (Start/End) EB EB SD SD ED ED BoundingBox Box (Start/End) SASASCSCSB SBEC ECEA EA interval endpoints Type (Start/End) 0 0 0 0 ... 1 1 1 1 ... ... Type 0 0 0 0 1 1 1 1 Type(Start/End) (Start/End) 0 0 0 0 1 1 1 1 Start/end flags 3 3 3 3 4 ... Prefix Sum (pT ) 0 1 2 Prefix ofof Type (pT ) ) Pof0Type PrefixSum Sum Type (pT 0 1 1 2 2 3 3 3 3 3 3 3 3 4 4. . . . . . Prefix sum,. § Trick: prefix sum over flags array § Similar to "split" in radix sort. § Also, all triangles know their. Triangle IDID Triangle. start/end index in the sorted array of endpoints. § Number of threads per triangle i = number of potentially overlapping intervals / boxes = P[Ei ]. § Reduction yields total number of threads = max. G. Zachmann. Massively Parallel Algorithms. our novell collision detection approach. A Triangle Triangle ID A B B C ID CD D A. B. C. D. 0 0 2 2 1 16 Start (S)(S) StartPosition Position Start Position (S) index 6 0 Start 4 4 5 5 3 37 7 End Position (E) End Position End (E) Position 4 (E) index End. 2. 1. 6. 5. 3. 7. Triangle ID A B C D Triangle ID A B C D Triangle ID C D pT [E] - pT [S] - 1 3 - 0 - 1 3 - 2 - 1 3 -A 1 - 1 4 - 3B -1 pT [E] - pT [S] - 1 3-0-1 3-2-1 3-1-1 4-3-1 P[S 1 Threads pT [E] - pT [S]2 - 1 3 - 01 - 1 3 -0 2 - 1= 33 - 1 - 1 4 - 3 - 1 0 i] Number of =3 0 0 2 1 Number of Threads. =3 0 0 2 1 Number Figure 5.12: Determination of of theThreads minimal number of threads needed to Figure 5.12:identify Determination of the minimal number of threads needed to all possible colliding primitive pairs and worst-case identify all possible colliding pairs and worst-case length ofFigure array C memory usage to store all these pairs. 5.12: Determination ofprimitive the minimal number of threads needed memory usage to store all these pairs.. identify all possible colliding primitive pairs and worstmemory usage to store all these pairs. states whether the key value is a minimum or maximum value and whether thethe key is a minimum or maximum value and tostates which primitive BVvalue belongs. to which primitive SS the BV belongs. 21 September 2020 Sorting 91. states whether the key value is a minimum or maximum value a 5.2.4 Thread to Management which primitive the BV belongs. 5.2.4 Thread Management In this section we depict how we determine the minimal number of In this section depictManagement how we minimal number of 5.2.4 wethreads, Thread working (CUDA) which aredetermine needed tothe identify all possible.

(80) The Cluster-PCA-Based Sweep Plane Method. § Problem: sweep plane method ⟶ dimension reduction by projection. y. ⟶ false positives. § Idea: utilize fact that sweep/sorting axis can be chosen arbitrarily. PCA. § Use axis such that number of overlapping projected AABBs is minimized ⟶ heuristic: longest axis of PCA (= axis of largest variation). x. Triangles of front side. § Further problem: could still produce lots of false positives Triangles of back side G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 92.

(81) § Second idea to further reduce false positives: partition objects into clusters, perform previous method in parallel for all clusters. 2 clusters. 16 clusters. § Problem: need to find overlapping AABBs between clusters, too § Solution: assign polygons along cluster borders to both clusters § Method: fuzzy c-means (variant of k-means algo) G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 93.

(82) Overall Algorithm. parallel for all triangles: compute center points subdivide scene into c (overlapping) clusters parallel for all clusters: compute PCA transform all points into PCA coord. system perform rest of collision detection as before. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 94.

(83) Results: Cloth on Ball Benchmark. § Cloth (92k triangles) drops down on a rotating ball (760 triangles). G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 95.

(84) Results: Funnel Benchmark. § Ball (1.7k triangles) pushes a cloth (14k triangles) through a funnel (2k triangles). G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 96.

(85) Movie. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 97.

(86) Faster Ray-Tracing by Sorting § Recap: the principle of ray-tracing § Shoot one (or many) primary rays per pixel into the scene § Find first intersection (can be accelerated, e.g., by 3D grid) § Generate secondary rays (in order to collect light from all different directions) § Recursion ⟶ ray tree. § Ray-Tracing is "embarrassingly parallel": § Just start one thread per primary ray § Or, is it that simple?. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 98.

(87) § The ray tree for one primary ray:. Shadow Rays Reflection Rays. Ei = primary ray Ri = reflected ray. Si = shadow ray Ti = transmitted ray. § Problem for massive parallelization by one-thread-per-primary ray: § Each thread traverses their own ray tree § The set of rays currently being followed by all active threads go in all kinds of different directions § Consequence: thread divergence! § Another problem: each thread needs their own stack! G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 99.

(88) Chunk Size) we segmented scan he Head Flags ments. The result ered (sorted) ray. 13 14 15 16. 3.2. Frustum Creation. Once the rays are sorted and packet ranges extracted, we build a frustum for each packet. As in the work [ORM08], we define the frustum by using a dominant axis and two axis-aligned rectangles. The dominant axis corresponds to the rayrays: direction component with a maximum absolute § Definition coherent 17 18 19 value. the coherentthe rays of origin a packet assumed Two rays that have For "approximately" same andthis the axis sameis direction are said to be the same. The two axis-aligned rectangles are be coherentto rays. perpendicular to this called dominant axis and bound all the rays A set of coherent rays is sometimes a coherent ray packet. of the packet (see Fig. 7). Y. 13 14 15 16 17 18 19. X. 1. 0. 1. 1 14 1. 1. 0. 1. 0. 0. 0. 19 0. 1. 2 14 Observations: 15 16. n(S, F). 1. 0. §. Z. Figure 7: Frustum is defined by dominant axis X and two axis-aligned rectangles.. § Coherent rays are likely to hit the same object in the scene. § Coherent rays will likely hit the same cells in an acceleration data structure (e.g., grid or kd-tree). ple.. Zachmann ion. We G.would arger than some extract the base. We implemented the frustum creation in a single CUDA kernel where each frustum is computed by a warp of (32) Massively Parallel Algorithms 2020 threads. Shared memory is usedSS to21 September compute the valid interval along the dominant axis and base rectangles for all the rays in a packet.. Sorting. 100.

(89) Approach to Solve the Divergence Problem § Take an array of rays as input §. § Can be arbitrary mix of primary, secondary, tertiary, shadow rays, … In the following, we Arrange them into packets of coherent rays will look at this step. § Compute ray-scene intersections § One thread per ray § Each block of threads processes one coherent ray packet § Each thread traverses the acceleration data structure § At the end of this procedure, each thread generates a number of new rays. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 101.

(90) multi-branching probhash value for each ray. Rays that map to the same hash Laine [AL09]. This value are considered to be coherent in the 3D-space. ersistent threads that 3 idle warp of threads. 7 0 2 6 ds become Identifying idle if one Coherent Rays 1 our ray tracing pipe1 2 oop / Fast Ray Sorting and Breadth-First Packet Traversal for5 GPU Ray Tracing ulti-branching at ex0 4 3 General ray sorting.§ Roger et approach: classification by discretization 4 5 6specify 7 memory since itconis uniform grid (see 0Fig.1 2).2 We3 manually the cell space hierarchy § Here: compute a (trivial) hash value per ray IDIA]. Several works sizes for 2: both grids (see 5.1).and With quan- is Figure Thevirtual quantization of section ray origin direction pace indexing. Rays § Discretize thetized ray origin by a 3Dof grid first part hash value to eliminate or mitithe⟶origin andof we compute usedcomponents to compute a hash value for adirection given ray. erence. But these approaches ids within these grids andsecond mergepart them into a 32-bit § Discretize raycell direction by direction cube ⟶ multi-branching probhashSorting. value for each ray. Rays that map to the– same hash We introduce a “compression sorting – de§ Concatenate the two hash parts ⟶ complete hash value Laine [AL09]. This value are considered to bescheme coherent in the 3D-space. compression” (CSD) (see Fig. 3) and explicitly ersistent threads that nt GPU execution we maintain coherence through all the ray bounce levels. Co3 idle warp of threads. 7 0 : ray sorting, frustum herent rays 2hit similar geometry locations. And these hit 6 ds become idle if one 1 ocalized ray-primitive points form ray origins for next-generation rays (bounced our ray tracing pipe1 5 2 rays). There is a non-zero probability that some sequenmulti-branching at ex0 3 tially generated rays will 4receive the same hash value. This ray sorting. Roger et ally coherent rays in 0 1 2 3 4sorting 5 6 becomes 7 observation is exploited and faster. The spacetohierarchy ared unsorted conrays, compressed dataray: is sorted using radixand sortdirection [SHG09]. § CanRays beadone Figure in parallel forray each 2: The quantization of ray origin is indexing. sspace less divergence on used toRay compute a hash value for a given ray. herence. Extracting packets of IDs: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 reation for packets of Hash values: Sorting. We introduce a “compression – sorting – deCompress erence in our pipeline compression” (CSD) scheme (see Fig. 3)September and 2020 explicitly G. Zachmann Massively Parallel Algorithms SS 21 Chunk Hash: nt GPU execution we maintain coherence through all the ray bounce levels. CoChunk Base: 0 3 7 9 14 17 :toraytraverse sorting,the frustum herent rays hit similar geometry locations. And these hit BVH Chunk Size: 3 4 2 5 3 3 ocalizedrays. ray-primitive points form ray origins for next-generation rays (bounced vidual For each Sort (radix, bitonic). Sorting. 102.

(91) § Note: often, there are many consecutive rays (in the input array) that are coherent, i.e., will map to the same ray hash value § For instance, shadow rays § Multiple secondary rays from glossy surfaces, etc.. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 103.

(92) Pipeline. acing to efficient GPU execution we g into 4 stages: ray sorting, frustum traveral, and localized ray-primitive 1).. to store spatially coherent rays in ocations. Compared unsorted rays, § Cantowe sort the sorted rays hasarray less divergence on a of rays yet? such as GPU. Extracting packets of tight frustum§creation for packets of We could, but we'd intain ray coherence in our pipeline perform way too e.. much work!. stums in order to traverse the BVH § Idea: rays. For each nstead of individual spatially sorted list of BVH-leaves 1. Compress the array y the frustum. Given that the set of Similar run length ller than the set of-rays, wetoperform compression/coding raversal utilizing a narrower parallel el. 2. Sort primitive intersection stage, each ray 3. Unpack ustum is tested against all the primit of sorted BVH-leaves captured in a. Sorting. We introduce a “compression – sorting – decompression” (CSD) scheme (see Fig. 3) and explicitly maintain coherence through all the ray bounce levels. Coherent rays hit similar geometry locations. And these hit points form ray origins for next-generation rays (bounced rays). There is a non-zero probability that some sequentially generated rays will receive the same hash value. This observation is exploited and sorting becomes faster. The compressed ray data is sorted using radix sort [SHG09]. Ray IDs:. 0. 1. 2. 3. 4. 5. 6. 7. 8. 9 10 11 12 13 14 15 16 17 18 19. Hash values: Compress Chunk Hash: Chunk Base: Chunk Size:. 0. 3. 7. 9 14 17. 3. 4. 2. 5. 3. Chunk (a.k.a. run). 3. Sort (radix, bitonic) Chunk Hash: Chunk Base: Chunk Size:. 3. 9. 7 17 0 14. 4. 5. 2. 3. 3. 3. Decompress Reordered IDs:. 3. 4. 5. 6. 9 10 11 12 13 7. 8 17 18 19 0. 1. 2 14 15 16. Hash values:. Figure 3: The overall ray sorting scheme.. Compression. We create the array Head Flags equal in size to the array Hash values. All the elements of Head Flags are set to 0 except for the elements whose corresponding hash value is not equal to the previous one (see Massively Parallel September 2020 Fig.Algorithms 4). We apply an exclusive SS scan21procedure [SHG08] to dure is used G.toZachmann accelerate ray tracing the Head Flags array. ce and reducing execution branches ssor. However, the cost of such ray Ray IDs: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 et by an increase in performance. We Hash values: hat is based on compression of key-. Sorting. 104.

(93) Chunk Hash: Chunk Base: Chunk Size:. in order to traverse the BVH d of individual rays. For each ally sorted list of BVH-leaves Chunk Hash: Chunk Base: frustum. Given that theCompression set of Ray Array Chunk Size: han the set of rays, we perform sal utilizing a narrower parallel. 0. 3. 7. 9 14 17. 3. 4. 2. 5. 3. 3. Sort (radix, bitonic) 3. 9. 7 17 0 14. 4. 5. 2. 3. 3. 3. Decompress. IDs: 3 4 5 6 9 10 11 12 13 7 8 17 18 19 0 1 2 14 15 16 1. Set all HeadFlags[i]Reordered = 1, where HashValue[i-1] ≠ HashValue[i], Hash values: else set each HeadFlag[i] = 0 tive intersection stage, ray Figure 3: The overall ray sorting scheme. is tested against all the primi- prefix sum to HeadFlags array ⟶ ScanHeadFlags 2. Apply exclusive orted BVH-leaves captured in a Compression. Weposition create the array Head arrays Flags equal in § Now, ScanHeadFlags[i] contains new in the Chunk size to the array Hash values. All the elements of Head 3. For all i, where HeadFlags[i]==1 Flags are set to: 0 except for the elements whose corresChunkHash[ ScanHeadFlags[i] HashValue[i] ponding hash value] is= not equal to the previous one (see Fig. 4). We apply an scan procedure [SHG08] to s used to accelerate ray tracing ScanHeadFlags[i] ChunkBase[ ] exclusive = i the Head Flags array. d reducing execution branches 4. Set all However, the cost of such ray Ray IDs: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ChunkSize[i] an increase in performance. We = Hash values: based on compression of keyChunkBase[i+1] Head Flags: 1 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 the compressed sequence and – ChunkBase[i]. sequence of key-index pairs by and a hash value computed for ntize the ray origins assuming a thin scene’s bounding box. We G. Zachmann ay directions assuming a virtual. phics Association and Blackwell Publishing Ltd.. Scan(Head Flags):. Chunk Hash: Chunk Base: Chunk Size: Massively Parallel Algorithms. 0. 1. 1. 1. 2. 2. 0. 1. 2. 3. 4. 5. 0. 3. 7. 9 14 17. 3. 4. 2. 5. 3. 2. 2. 3. 3. 4. 4. 4. 4. 4. 5. 5. 5. 6. 6. 3. SS 21 September 2020 Figure 4: Compression example.. Sorting. 105.

(94) (see Fig. 5). We initialize the array Skeleton with ones, and Skeleton considering t the array Head Flags with zeroes (the sizes of both arrays segmented scan is the are equal to Hash values array). Into positions of the array packet, the size of a ray Skeleton specified by Scan(Chunk Size) we write the corconsecutive bases. responding values of Chunk Base array. Into positions of Unpacking the Chunk theArray array Head Flags specified by Scan(Chunk Size) we 3.2 Frustum Creatio write ones. We then apply an inclusive segmented scan Once the rays are sort [SHG08] to array Skeleton considering the Head Flags § Compute exclusive prefix-sum on ChunkSize ⟶ ScanChunkSize build a frustum for eac array that specifies the bounds of data segments. The result § ScanChunkSize contains first index in output array for range of ray IDs the chunk represents we define the frustum of the segmented scan is the array of reordered (sorted) ray axis-aligned rectangles § Init array S with 1's, init array HeadFlags withvalues. 0's ids corresponding to their hash the ray direction com Ray IDs: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 value. For the coherent Hash values: to be the same. Th perpendicular to this d Chunk Hash: Chunk Base: 3 9 7 17 0 14 Sorted Chunks of the packet (see Fig. Chunk Size:. 4. 5. 2. Scan(Chunk Size):. 0. 4. 9 11 14 17. 0. 1. 2. 3. 4. 5. 6. 7. 8. 9 10 11 12 13 14 15 16 17 18 19. S = Skeleton: F = Head Flags:. 3 1. 1. 1. 1. 1 9. 1. 1. 1. 1. 1 7. 1 1 1 17. 1. 0 1. 1. 1 14 1 1. 1. 1 0. 0. 0. 0. 1 0. 0. 0. 0. 0. 1 0. 0. 0. 1 0. 0. 0. 0. SegScan(S, F):. 3. 4. 5. 6. 9 10 11 12 13 7. 8 17 18 19 0. 1. 2 14 15 16. 3. 3. 3. Y. 1 0. 0. 1 0. 0. Hash values: Reordered IDs = SegScan(S, F). Figure 5: Decompression example. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Decomposition: packet ranges extraction. We would like to create packets of coherent rays no larger than some capacity (e.g., MaxSize = 256). First, we extract the base. Figure 7: Frustum is d axis-aligned rectangles. We implemented the Sorting each 106frus kernel where threads. Shared memo terval along the domin the rays in a packet..

(95) (see Fig. 5). We initialize the array Skeleton with ones, and Skeleton considering t the array Head Flags with zeroes (the sizes of both arrays segmented scan is the are equal to Hash values array). Into positions of the array packet, the size of a ray Skeleton specified by Scan(Chunk Size) we write the corconsecutive bases. responding values of Chunk Base array. Into positions of the array Head Flags specified by Scan(Chunk Size) we 3.2 Frustum Creatio write ones. We then apply an inclusive segmented scan Once the rays are sort [SHG08] toset array Skeleton considering the Head Flags § For all i = 0, …, #chunks-1: build a frustum for eac array that] specifies the bounds of data segments. The result S[ ScanChunkSize[i] = ChunkBase[i] we define the frustum of the segmented ]scan HeadFlags[ ScanChunkSize[i] = 1is the array of reordered (sorted) ray axis-aligned rectangles ids corresponding to their hash values. § Perform inclusive segmented prefix-sum on S with bounds specified by HeadFlags ⟶ direction com the ray Ray IDs: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 SegScan array value. For the coherent Hash values: to be the same. Th perpendicular to this d Chunk Hash: Chunk Base: 3 9 7 17 0 14 Sorted Chunks of the packet (see Fig. Chunk Size:. 4. 5. 2. Scan(Chunk Size):. 0. 4. 9 11 14 17. 0. 1. 2. 3. 4. 5. 6. 7. 8. 9 10 11 12 13 14 15 16 17 18 19. S = Skeleton: F = Head Flags:. 3. 1. 1. 1. 9. 1. 1. 1. 1. 7. 1 17 1. 1. 0. 1. 1 14 1. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. SegScan(S, F):. 3. 4. 5. 6. 9 10 11 12 13 7. 8 17 18 19 0. 1. 2 14 15 16. 3. 3. 3. Y. 1. 0. 1. 0. Hash values: Reordered IDs = SegScan(S, F). Figure 5: Decompression example. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Decomposition: packet ranges extraction. We would like to create packets of coherent rays no larger than some capacity (e.g., MaxSize = 256). First, we extract the base. Figure 7: Frustum is d axis-aligned rectangles. We implemented the Sorting each 107frus kernel where threads. Shared memo terval along the domin the rays in a packet..

(96) he set of rays, we perform ilizing a narrower parallel. (see Fig. 5). We initialize the array Skeleton with ones, and Decompress the array Head Flags with zeroes (the sizes of both arrays Reordered IDs: 3 4 5 6 9 10 11 12 13 7 8 17 18 19 0 1 2 14 15 16 are equal to Hash values array). Into positions of the array Hash values: Skeleton specified by Scan(Chunk Size) we write the corintersection stage, each ray Figure 3: The overall ray sorting scheme. responding values of Chunk Base array. Into positions of ested against all the primitheCompression. array Head Flags specified by Scan(Chunk Size) in we BVH-leaves captured in a We create the array Head Flags equal writetoones. We then an All inclusive segmented scan size the array Hashapply values. the elements of Head [SHG08] array Skeleton considering the Head Flags Flags are set to 0 except the elements whose § For all i in [0,#rays-1]: set to Output[i] = for RayID[ SegScan[i] ]corresarray thathash specifies bounds of to data ponding valuethe is not equal thesegments. previous The one result (see § Result = array of re-ordered rays, ordered by their hash value of the is the array reordered[SHG08] (sorted) ray Fig. 4).segmented We apply scan an exclusive scanofprocedure to ed to accelerate ray tracing ids Head corresponding to their hash values. the Flags array. ducing execution branches. ever, the cost of such ray crease in performance. We ed on compression of keycompressed sequence and. ence of key-index pairs by a hash value computed for the ray origins assuming a scene’s bounding box. We ections assuming a virtual. Ray IDs:. 0. 11. 22. 33. 44. 55 66 77 88 99 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19. 1. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 1. 2. 2. 2. 2. 3. 3. 4. 4. 4. 4. 4. 5. 5. 5. 6. 6. 3. 9. 7 17 0 14. Hash values: values: Hash Head Flags:. Chunk Hash: Scan(Head Flags): Chunk Base: Chunk Size:. 4. 5. 2. 3. 3. 3. 0. 1. 2. 3. 4. 5. Chunk Hash: Chunk Base: 0 0 Size: 3 SChunk = Skeleton:. 3. 7. 9 14 17. 4. 2. 5. Scan(Chunk Size):. 0. 3. 4 1. 1. 2. 1. 3. 4. 5. 3. 1. 4. 3. 9. 5. 6. 7. 8. 9 10 11 12 13 14 15 16 17 18 19. 1. 1. 1. 1. 7. 3. 1 17 1. 1. 0. 1. 1 14 1. 1. 1. 0. 0. 0. 8 17 18 19 0. 1. 2 14 15 16. 0 Compression 1 0 0 0 0 1 example. 0 1 0 0 4: 6. 9 10 11 12 13 7. 1. 0. Hash values: Reordered IDs = SegScan(S, F). sociation and Blackwell Publishing Ltd.. Figure 5: Decompression example. G. Zachmann. Massively Parallel Algorithms. 3.2. Frustum Creatio. Once the rays are sort build a frustum for eac we define the frustum axis-aligned rectangles the ray direction com value. For the coherent to be the same. Th perpendicular to this d of the packet (see Fig. Y. 9 11 14 17. F = Head Flags: Figure 1 0 0 SegScan(S, F):. Sorted Chunks. Skeleton considering t segmented scan is the packet, the size of a ray consecutive bases.. SS. 21 September 2020. Decomposition: packet ranges extraction. We would like to create packets of coherent rays no larger than some capacity (e.g., MaxSize = 256). First, we extract the base. Figure 7: Frustum is d axis-aligned rectangles. We implemented the Sorting each 108frus kernel where threads. Shared memo terval along the domin the rays in a packet..

(97) Skeleton specified by Scan(Chunk Size) we write the corconsecutive bases. responding values of Chunk Base array. Into positions of the array Head Flags specified by Scan(Chunk Size) we 3.2 Frustum Creat write ones. We then apply an inclusive segmented scan Once the rays are so [SHG08] to array Skeleton considering the Head Flags build a frustum for ea Partition Into Ray Packets array that specifies the bounds of data segments. The result we define the frustum of the segmented scan is the array of reordered (sorted) ray axis-aligned rectangle ids corresponding to their values. § Remaining problem: the sets of rays withhash same hash value can have very different the ray direction co Ray IDs: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 lengths value. For the coheren Hash values: to be the same. T § Solution: partition into ray packets perpendicular to this Chunk Hash: § Definition of ray packet: Chunk Base: 3 9 7 17 0 14 Sorted Chunks of the packet (see Fig Size: 4 5 2 3 3 3 Ray packet = indexChunk range (in array of re-ordered rays) such that Y Scan(Chunk Size): 0 4 9 11 14 17 1. all rays have same hash value, and 1 2 3 4 5packet 6 7 8size. 9 10 11 12 13 14 15 16 17 18 19 2. number of rays in range < 0maximum S = Skeleton: F = Head Flags:. 3. 1. 1. 1. 9. 1. 1. 1. 1. 7. 1 17 1. 1. 0. 1. 1 14 1. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. SegScan(S, F):. 3. 4. 5. 6. 9 10 11 12 13 7. 8 17 18 19 0. 1. 2 14 15 16. 1. 0. 1. 0. Hash values: Reordered IDs = SegScan(S, F) 1 ray packet. Figure = 5: Decompression example. 1 thread block. G. Zachmann. Decomposition: packet ranges extraction. We would like to create packets of coherent rays no larger than some Massively Parallel Algorithms SS 21 September 2020 capacity (e.g., MaxSize = 256). First, we extract the base index and range of each cell that contains the chunk of rays with the same hash value. In order to do this we apply the compression procedure described above to the array of. Figure 7: Frustum is axis-aligned rectangl. We implemented th kernel where each fru threads. Shared mem terval along the dom Sorting 109 the rays in a packet. 3.3. Breadth-First F.

(98) Workflow. G. Zachmann. Massively Parallel Algorithms. SS. 21 September 2020. Sorting. 111.

(99) ng (i.e. if the AABB of the 10) so the actual number of n be much lower.. g. 11 (b). 115. Results. 110. 105. 100 95 0,002. 0,004. 0,008. 0,016 UCF. g. 11 (d). 160. 150. 5.3. Comparison with a Depth-first Ray Tracing. The charts in Fig. 14 represent our pipeline in comparison of the Aila and Laine approach [AL09]. The gap between two approaches is biggerrays for soft § Here only(!) for primary and shadow shadow rays that are less coherent since we reduce warp§ "New method" some(we further tricks wise branches in our raycontains tracing pipeline have only ray not described here: stage, see Fig. 10). masking in intersection. § toComparison: our implementation. 0,016 UCF. MaxSize=128 MaxSize=256 MaxSize=512. arameters and performance soft shadow rays (bigger ers meaning and stats are. cing is given in Fig. 12 (for 11) and it is not clear what scenes. However selecting action=0.004 seems to be y tracing and leads to relaWe use these parameters for and comparisons. G. Zachmann. ages. t in different stages of our a fixed light source. For the are sorted in a ray sorting. 100 75 50 25 0. 81. 69. 56 63. 50. 46. 40. Fig.11(c). Fig.11(d). 26. Fig.11(a). Fig.11(b). Soft Shadow rays (at 1024x768x16 samples): Mrays/sec. 0,008. 200 150 100 50 0. 153. 147. 123. New method. [Garanzha & Loop, 2010]. 0,004. Mrays/sec. Primary rays (at 1024x768):. 140. 130 0,002. pipeline for soft shadow rays. Left chart: 16 shadow rays were generated per primary hit point. Right chart: 1 shadow ray was generated per primary hit point (with 4x4 per pixel antialiasing). For the right chart data there are 16 shadow samples per pixel (and we sort 16x more ray origins overall than for the left chart data).. 112. 34. 24. Fig.11(a). Fig.11(b). 46. 46. Fig.11(c). Fig.11(d). Pipeline Master thesis:[AL09] real & New thorough comparison?. FigureMassively 14: Performance Parallel Algorithmscomparison of our ray tracing SS 21 September 2020 pipeline and our implementation of [AL09] (bigger numbers are better). See Fig. 11 for viewpoints. Performance measurements (rays per second) of the depth-first ray tracing implementation [AL09] may be dif-. Sorting. 112.

(100)

Referenzen

ÄHNLICHE DOKUMENTE

§  Awareness of the issues (and solutions) when using massively parallel architectures.. §  Programming skills in CUDA (the language/compiler/frameworks for

§  Synchronization usually involves waiting by at least one task, and can therefore cause a parallel application's execution time to increase. §  Granularity :=

§  Device memory pointers (obtained from cudaMalloc() ). §  You can pass each kind of pointers around as much as you

All you have to do is implement the body of the kernel reverseArrayBlock(). Launch multiple 256-thread blocks; to reverse an array of size N, you need N/256 blocks.. a) Compute

One method to address this problem is the Smart Grid, where Model Predictive Control can be used to optimize energy consumption to match with the predicted stochastic energy

§  Assume the scan operation is a primitive that has unit time costs, then the following algorithms have the following complexities:.. 38

B.  For each number x in the list, cut a spaghetto to length x list = bundle of spaghetti &amp; unary repr.. C.  Hold the spaghetti loosely in your hand and tap them on

Show that thread divergence only occurs in the first warp for stride values up to half of the