Massively Parallel Algorithms
Parallel Prefix Sum And Its Applications
G. Zachmann
University of Bremen, Germany
cgvr.cs.uni-bremen.de
§
Remember the reduction operation§ Extremely important/frequent operation → Google's MapReduce
§
Definition prefix sum:Given an input sequence
,
the (inclusive) prefix sum of this sequence is the output sequence where ⨁ is an arbitrary binary associative operator.
The exclusive prefix sum is
whereιis the identity/zero element, e.g., 0 for the + operator.
§
The prefix sum operation is sometimes also called a scan (operation)A = (a
0, a
1, a
2, . . . , a
n 1)
Aˆ = (a0, a1 a0,a2 a1 a0, . . .,an 1 · · · a0)
Aˆ0 = (◆, a0,a1 a0, . . .,an 2 · · · a0)
§
Example:§ Input:
§ Inclusive prefix sum:
§ Exclusive prefix sum:
§
Further variant: backward scan§
Applications: many!§ For example: polynomial evaluation (Horner's scheme)
§ In general: "What came before/after me?"
§ "Where do I start writing my data?"
§
The prefix sum problem appears to be "inherently sequential"A = (3 1 7 0 4 1 6 3)
Aˆ = (3 4 11 11 15 16 22 25) Aˆ0 = (0 3 4 11 11 15 16 22)
Variation: Segmented Scan
§
Input: segments of numbers in one large vector§
Task: scan (prefix-sum) within each segment§
Output: prefix-sums for each segment, in one vector§
Forms the basis for a wide variety of algorithms:§ E.g., Quicksort, Sparse Matrix-Vector Multiply, Convex Hull
§ Note: take care to store the flags array space- and
3
0 0 7 7 0 1 7
1
3 7 0 4 1 6 3
0
1 1 0 0 1 0 0 Flags array
(head-tail flags,
"1" = new segment) Payload data
Application from "Everyday" Life
§
Given:§ A 100-inch sandwich
§ 10 persons
§ We know how many inches each person wants: [3 5 2 7 10 4 3 0 8 1]
§
Task: cut the sandwich quickly§
Sequential method: one cut after another(3 inches first, 5 inches next, …)
§
Parallel method:§ Compute prefix sum
§ Make cuts in parallel with 10 knives
§ How quickly can we compute the prefix sum?
Importance of the Scan Operation
§
Assume the scan operation is a primitive that has unit time costs, then the following algorithms have the following complexities:38 CHAPTER 3. THE SCAN PRIMITIVES
Model
Algorithm EREW CRCW Scan
Graph Algorithms
(nvertices,medges,mprocessors)
Minimum Spanning Tree lg2n lgn lgn
Connected Components lg2n lgn lgn
Maximum Flow n2lgn n2lgn n2
Maximal Independent Set lg2n lg2n lgn
Biconnected Components lg2n lgn lgn
Sorting and Merging (nkeys,nprocessors)
Sorting lgn lgn lgn
Merging lgn lg lgn lg lgn
Computational Geometry (npoints,nprocessors)
Convex Hull lg2n lgn lgn
Building aK-D Tree lg2n lg2n lgn
Closest Pair in the Plane lg2n lgnlg lgn lgn
Line of Sight lgn lgn 1
Matrix Manipulation (n×nmatrix,n2processors)
Matrix×Matrix n n n
Vector×Matrix lgn lgn 1
Matrix Inversion nlgn nlgn n
EREW=
exclusive-read,
exclusive-write PRAM CRCW=
concurrent-read,
concurrent-write PRAM Scan =
EREW with scan as unit-cost primitive
Guy E. Blelloch:
Vector Models for
Data-Parallel Computing
§
Actually, prefix-sum (a.k.a. scan) was considered such animportant operation, that it was implemented as a primitive in the CM-2 Connection Machine (of Thinking Machines Corp.)
Example: Line-of-Sight
§
Given:§ Terrain as grid of height values (height map)
§ Point X in the grid (our "viewpoint", has a height, too)
§ Viewing direction, we can look up and down, but not to the left or right
§
Problem: find all visible points in the grid along the view direction§
Assumption: we have already extracted a vector of heights from the grid containing all cells' heights that are in our viewing direction3.3. EXAMPLE: LINE-OF-SIGHT 41
Figure 3.1: The line-of-sight problem for a single ray. The X marks the observation point.
The visible points are shaded. A point on the ray is visible if no previous point has a greater angle. The angle is calculated as arctan(altitude/distance).
§
The algorithm:1. Convert height vector to vertical angles (as seen from X) → A
- One thread per vector element
2. Perform max-scan on angle vector (i.e., prefix sum with the max operator) → Â
3. Test âi < ai , if true then grid point is visible form X
Height vector
3.3. EXAMPLE: LINE-OF-SIGHT 41
Figure 3.1: The line-of-sight problem for a single ray. The X marks the observation point.
The visible points are shaded. A point on the ray is visible if no previous point has a greater angle. The angle is calculated as arctan(altitude/distance).
Angle vector (A)
3.3. EXAMPLE: LINE-OF-SIGHT 41
Figure 3.1: The line-of-sight problem for a single ray. The X marks the observation point.
The visible points are shaded. A point on the ray is visible if no previous point has a greater angle. The angle is calculated as arctan(altitude/distance).
Max-scan of angle vector (Â)
The Hillis-Steele Algorithm
§
Iterate log(n) times:§
Notes:§ Blue = active threads
§ Each thread reads from "another" thread, too → must use barrier sync
- Can save one barrier by double buffering
A: 3 1 7 0 4 1 6 3
B: 3 4 8 7 4 5 7 9
d = 0, stride 1
A: 3 4 11 11 12 12 11 14
B: 3 4 11 11 15 16 22 25
d = 1, stride 2
d = 2, stride 4
§
The algorithm as pseudo-code:§ Note: barrier synchronization omitted for clarity
§
Remark: precision is usually better than the naïve sequential algo§ Because, in the parallel version, summands (in each iteration) tend to be of the same order
§
Algorithmic technique: recursive/iterative doubling technique ="Accesses or actions are governed by increasing powers of 2"
§ Remember the algo for maintaining dynamic arrays? (2nd/1st semester)
forall i in parallel do // n threads for d = 0...log(n)-1:
if i >= 2^d :
x[i] = x[ i − 2^d ] + x[i]
Definitions
§
Depth D(n) = "#iterations" = parallel running time Tp(n)§ (Think of the loops unrolled and "baked" into a hardware pipeline)
§ Sometimes also called step complexity
§
Work W(n) = total number of operations performed by all threads§ With sequential algorithms, work complexity = time complexity
§
Work-efficient:A parallel algorithm is called work-efficient, if it performs no more work than the sequential one
§
Visual definition of depth/work complexity:§ Express computation as a dependence graph of parallel tasks:
§ Work complexity = total amount of work performed by all tasks
§ Depth complexity = length of the "critical path" in the graph
§
Parallel algorithms should be always both work and depth efficient!Parallel tasks
§
Complexity of the Hillis-Steele algorithm:§ Depth d = Tp(n) = # iterations = log(n) → good
§ In iteration d:
§ Total number of adds = work complexity
§
Conclusion: not work-efficient§ A factor of log(n) can hurt: 20x for 106 elements
n 2d 1 adds
W(n) =
log2n
X
d=1
(n 2d 1) =
log2n
X
d=1
n
log2n
X
d=1
2d 1 = n·logn n 2 O n log n
The Blelloch Algorithm (for Exclusive Scan)
§
Consists of two phases: up-sweep (= reduction) and down-sweep 1. Up-sweep:§
Note: no double-buffering needed! (sync is still needed, of course)3 1 7 0 4 1 6 3
3 4 7 7 4 5 6 9
3 4 7 11 4 5 6 14
3 4 7 11 4 5 6 25
d = 0, stride 1
d = 1, stride 2
d = 2, stride 4
2. Down-sweep:
§ First: zero last element (might seem strange at first thought)
§ Dashed line means "copy over" (overwriting previous content)
3 4 7 0 4 5 6 11
3 4 7 11 4 5 6 0
3 0 7 4 4 11 6 16
0 3 4 11 11 15 16 22
d = 0, stride 4
d = 1, stride 2
d = 2, stride 1
§
Depth complexity:§ Performs 2.log(n) iterations
§ D(n) ∈ O( log n )
§
Work-efficiency:§ Number of adds: n/2 + n/4 + .. + 1 + 1 + … + n/4 + n/2
§ Work complexity W(n) = 2.n = O(n)
§ The Blelloch algorithm is work efficient
§
This up-sweep followed by down-sweep is a very common pattern in massively parallel algorithms!§
Limitations so far:§ Only one block of threads (what if the array is larger?)
§ Only arrays with power-of-2 size
Working on Arbitrary Length Input
§
Partition array into b blocks§ Choose fairly small block size = 2k, so we can easily pad array to b.2k
1. Run up-sweep on each block
2. Each block writes the sum of its partition (= last element after up- sweep) into a PartialSums array at blockIdx.x
3. Run prefix sum on the PartialSums array 4. Perform down-sweep on each block
5. Add PartialSums[blockIdx.x] to each element in "next"
array section blockIdx.x+1
Up-sweep block 0 Up-sweep block 1 Up-sweep block 2 Up-sweep block 3
Down-sweep block 0 Down-sweep block 1 Down-sweep block 2 Down-sweep block 3
Store block sums to auxiliary array PartialSums
Scan auxiliary array PartialSums
"Seed" last value in block i+1 with PartialSums[i], instead of 0
Further Optimizations
§
A simple & effective technique:§ Each thread i loads 4 floats from global memory → float4 x
§ Store Σj=0…3 x[i][j] in shared memory a[i]
§ Compute the prefix-sum on a ⟶ â
§ Store 4 values back in global memory:
-A[4*i] = â[i-1] + x[0]
-A[4*i+1] = â[i-1] + x[0] + x[1]
-A[4*i+2] = â[i-1] + x[0] + x[1] + x[2]
-A[4*i+3] = â[i-1] + x[0] + x[1] + x[2] + x[3]
§ Experience shows: 2x faster
§ Why does this improve performance? → Brent's theorem exclusive
Brent's Theorem
§
Frequent assumption when formulating parallel algorithms: we have arbitrarily many processors§ E.g., O(n) many processors for input of size n
§ Kernel launch even reflects that:
- Often, we run as many threads as there are input elements - I.e., CUDA/GPU provide us with this (nice) abstraction
§
Real hardware: only has fixed number p of processors§ E.g., on current GPUs: p ≈ 200–2000 (depending on viewpoint)
§
Question: how fast can an implementation of a parallel algorithm really be?§
Assumptions for Brent's theorem: PRAM model§ No explicit synchronization needed
§ Memory access = free (no cost)
§
Brent's Theorem:Given a massively parallel algorithm A; let D(n) = its depth (i.e., parallel time) complexity, and W(n) = its work complexity.
Then, A can be run on a p-processor PRAM in time at most
(Note the "≤")
T(n, p) W(n) p
⌫
+ D(n)
§
Proof:§ For each iteration step i, 1 ≤ i ≤ D(n), let Wi(n) = number of operations in that step
§ In each iteration, distribute those Wi(n) operations on p processors:
- Execute operations on each of the p processors in parallel
- Takes time steps on the PRAM
§ Overall :
T(n,p) =
D(n)X
i=1
⇠Wi(n) p
⇡
D(n)X
i=1
Wi(n) p
⌫
+ 1 W (n) p
⌫
+ D(n)
⇠Wi(n) p
⇡
⇠Wi(n) p
⇡
Application of Brent's Theorem to our Optimization of Prefix-Sum
§
Assume that the optimized version loads f floats into local registers§
Work complexity:§ Without optimization:
§ With optimization:
§
Depth complexity:§ Without optimization:
§ With optimization:
§
If f = 2, then W2 = W1 and D2 = D1, i.e., we gain nothing§
If f > 2, speedup of version 2 (opt.) over version 1 (orig.):D
1(n) = 2 log(n) W
1(n) = 2n
W2(n) = 2nf + nf ·f = n 1 + 2f
⇡ 2np
n
p 1 + 2f = 2f f + 2
D
2(n) = 2 log(
nf) + 2f = 2 log n 2 log f + 2f
Speedup(n) = T1(n) T2(n) =
W1(n)
p + D1(n)
W2(n)
p + D2(n)
<latexit sha1_base64="5WkMZq87Tct349y3qHAl0YsNb3E=">AAACxHicZVHbbhMxEPVugZZwaVpeQLxYVEitENFuXuAFqSoV4rGIpqnUjSLHO9ta9U32GIhW2xe+jw9AfAF/gTeXlrQjWXNm5njGPjOxUnjMst9Junbv/oP1jYedR4+fPN3sbm2feBMchwE30rjTCfMghYYBCpRwah0wNZEwnFx+bOvDb+C8MPoYpxZGip1rUQnOMKbG3Z+FseAYGqeZgvqrBSiDbXb1Hv1Ai8oxXtPjcd7GTYv6M3RTW/rhNcfG8ht6eB3fMPq3GItenXF3J+tlM6N3Qb4AO/uHv/5QQsjReCvZK0rDgwKNXDLvz/K+xVHNHAouoekUwYNl/JKdQz3TqKGvY6qklXHxaKSz7ApPTRXDi0hsnf+/dBawej+qhbYBQfN5rypIioa2mtJSOOAop5RxHp8UGMZR/ILFf2PUfmWMZ9rPBxVL2CkcaPjOjVJMl3VRMSXktISKBYlNXfhqiVc7iaAF/ohJLzxgsHXc5FtlSojraTVvN9xEcfPbUt4FJ/1envXyL1HlAzK3DfKSvCK7JCfvyD75TI7IgHDyN9lMnicv0k+pTH0a5tQ0Wdx5RlYsvfoHpkPY3w==</latexit><latexit sha1_base64="c7dBg+NdHIt6oHYWZ/7Tc500Zjo=">AAACxHicZVFLaxRBEO4ZH4nrIxu9KF6aBCFBXGb2ohchaBCPEbPZQGZZe3tqkib9ortaXYbx4u8TvIpnD/4Le/aRuElBU19VfV3V/dXESuExy34l6Y2bt26vrd/p3L13/8FGd/PhkTfBcRhwI407njAPUmgYoEAJx9YBUxMJw8n527Y+/AzOC6MPcWphpNipFpXgDGNq3P1eGAuOoXGaKag/WoAy2GZH79LXtKgc4zU9HOdt3LSoP0OXtaUfXnBsLD+n+xfxJaN/hbHo1Rl3t7NeNjN6HeQLsL23/+P31s8/nw7Gm8luURoeFGjkknl/kvctjmrmUHAJTacIHizj5+wU6plGDX0WUyWtjItHI51lV3hqqhieRWLr/P+lk4DVq1EttA0Ims97VUFSNLTVlJbCAUc5pYzz+KTAMI7iZyz+G6P2K2M8034+qFjCTuFAwxdulGK6rIuKKSGnJVQsSGzqwldLvNpJBC3wa0x64QGDreMmXyhTQlxPq3m74SaKm1+V8jo46vfyrJd/iCq/IXNbJ0/JFtkhOXlJ9sh7ckAGhJO/yUbyOHmSvktl6tMwp6bJ4s4jsmLpt38pldr2</latexit><latexit sha1_base64="c7dBg+NdHIt6oHYWZ/7Tc500Zjo=">AAACxHicZVFLaxRBEO4ZH4nrIxu9KF6aBCFBXGb2ohchaBCPEbPZQGZZe3tqkib9ortaXYbx4u8TvIpnD/4Le/aRuElBU19VfV3V/dXESuExy34l6Y2bt26vrd/p3L13/8FGd/PhkTfBcRhwI407njAPUmgYoEAJx9YBUxMJw8n527Y+/AzOC6MPcWphpNipFpXgDGNq3P1eGAuOoXGaKag/WoAy2GZH79LXtKgc4zU9HOdt3LSoP0OXtaUfXnBsLD+n+xfxJaN/hbHo1Rl3t7NeNjN6HeQLsL23/+P31s8/nw7Gm8luURoeFGjkknl/kvctjmrmUHAJTacIHizj5+wU6plGDX0WUyWtjItHI51lV3hqqhieRWLr/P+lk4DVq1EttA0Ims97VUFSNLTVlJbCAUc5pYzz+KTAMI7iZyz+G6P2K2M8034+qFjCTuFAwxdulGK6rIuKKSGnJVQsSGzqwldLvNpJBC3wa0x64QGDreMmXyhTQlxPq3m74SaKm1+V8jo46vfyrJd/iCq/IXNbJ0/JFtkhOXlJ9sh7ckAGhJO/yUbyOHmSvktl6tMwp6bJ4s4jsmLpt38pldr2</latexit><latexit sha1_base64="c7dBg+NdHIt6oHYWZ/7Tc500Zjo=">AAACxHicZVFLaxRBEO4ZH4nrIxu9KF6aBCFBXGb2ohchaBCPEbPZQGZZe3tqkib9ortaXYbx4u8TvIpnD/4Le/aRuElBU19VfV3V/dXESuExy34l6Y2bt26vrd/p3L13/8FGd/PhkTfBcRhwI407njAPUmgYoEAJx9YBUxMJw8n527Y+/AzOC6MPcWphpNipFpXgDGNq3P1eGAuOoXGaKag/WoAy2GZH79LXtKgc4zU9HOdt3LSoP0OXtaUfXnBsLD+n+xfxJaN/hbHo1Rl3t7NeNjN6HeQLsL23/+P31s8/nw7Gm8luURoeFGjkknl/kvctjmrmUHAJTacIHizj5+wU6plGDX0WUyWtjItHI51lV3hqqhieRWLr/P+lk4DVq1EttA0Ims97VUFSNLTVlJbCAUc5pYzz+KTAMI7iZyz+G6P2K2M8034+qFjCTuFAwxdulGK6rIuKKSGnJVQsSGzqwldLvNpJBC3wa0x64QGDreMmXyhTQlxPq3m74SaKm1+V8jo46vfyrJd/iCq/IXNbJ0/JFtkhOXlJ9sh7ckAGhJO/yUbyOHmSvktl6tMwp6bJ4s4jsmLpt38pldr2</latexit><latexit sha1_base64="KJvsRPzR+AV6bsz6POjq0fOL+lc=">AAACxHicZVHbbhMxEPUutxIuTcsLiBeLCKkVItrNC7wgVYAQj0U0TaVuFDne2daqb7LH0Gi1vPCVfAJ/gTfZtKQdyZozM8cz9pm5lcJjlv1J0jt3791/sPWw9+jxk6fb/Z3dY2+C4zDmRhp3MmcepNAwRoESTqwDpuYSJvOLT2198gOcF0Yf4cLCVLEzLSrBGcbUrP+7MBYcQ+M0U1B/twBlsM2e3qcfaFE5xmt6NMvbuGnRaImua2s/ueLYWH5DP1/F14zRDUbXqzfrD7JhtjR6G+QdGJDODmc7yX5RGh4UaOSSeX+ajyxOa+ZQcAlNrwgeLOMX7AzqpUYNfR1TJa2Mi0cjXWY3eGqhGJ5HYuv8/6XTgNX7aS20DQiar3pVQVI0tNWUlsIBR7mgjPP4pMAwjuLnLP4bo/YbYzzTfjWoWMNe4UDDT26UYrqsi4opIRclVCxIbOrCV2u82UkELfAyJr3wgMHWcZNvlSkhrqfVvN1wE8XNb0p5GxyPhnk2zL9lg4OPncxb5CV5RfZITt6RA/KVHJIx4eRvsp08T16kX1KZ+jSsqGnS3XlGNiz99Q/BhtaT</latexit>
Other Consequences of Brent's Theorem
§
Obviously,§
In the sequential world, time = work:§
In the parallel world:§
Our speedup is§
Assume,i.e., our parallel algorithm would do asymptotically more work
§
Then,because, on real hardware, p is bounded
§
This is the reason why we want work-efficient parallel algorithms!Speedup(n) p
T
S(n) = W
S(n)
TP(n) = WPp(n) + D(n)W
P(n) 2 ⌦ ( W
S(n) )
Speedup(n) = TTS(n)
P(n) = WPW(n)S(n)
p +D(n)
Speedup(n) = WS(n)
⌦(WS(n) ) + D(n) ! 0 as n ! 1
§
Now, look at work-efficient parallel algorithms, i.e.§
Then,§
In this situation, we will achieve the optimal speedup of O(p), so long as§
Consequence: given two work-efficient parallel algorithms, the one with the smaller depth complexity is better, because we can run it on hardware with more processors (cores) and still obtain a speedup of p over the sequential algorithm (in theory).We say this algorithm scales better.
W
P(n) 2 ⇥ ( W
S(n) )
Speedup(n) = W (n)
W(n)
p + D(n) = pW(n)
W(n) + pD(n)
p 2 O W (n) D(n)
Limitations of Brent's Theorem
§
Brent's theorem is based on the PRAM model§
That model makes a number of unrealistic assumptions:§ Memory access has zero latency
§ Memory bandwidth is infinite
§ No synchronization among processors (threads) is necessary
§ Arithmetic operations cost unit time
§
With current hardware, rather the opposite is realisticDigression: Radix-Sort
§
Modelled after sorting machines of post routing centers (but with a twist!)§
Disadvantages:§ Not generic like Quicksort, which require only a compare operator on pairs of elements
§ Works only on elements with a known, pre-defined
numeric representation (e.g., 32 bits)
§ Different representation require different versions of radix sort
§
Vorteil: sehr effizient!§
Observation: integers can be represented with any base r§
Naive (intuitive) idea:§ Sort all elements according to the most significand digit into bins (one bin per digit)
§ Sort bin 0 using radix sort recursively
§ Sort bin 1 recursively, etc. …
§
This is called MSD radix sort (MSD = most significant digit)§
For the algorithm on the next slide:§ Choose radix r and fix it
§ Define z(t,a) = t-tth digit of number a when represented over base r, where t=0 denotes the least significand digit
The Algorithm (in Python)
A = array of numbers
i = current digit used for sorting ( 0 <= i <= d-1 ) d = total number of digits (same for all keys)
def msd_radix_sort( A, i, d ):
# init array of r empty lists = [ [], [], [], … ] bin = r * [[]]
# distribute all A's in bins according to z(i,.) for j in range(0, len(A) ):
bin[ z(i, A[j]) ].append( A[j] )
# sort bins if i >= 0:
for j in range(0, r):
msd_radix_sort( bin[j], i-1, d )
# gather bins A = []
for j in range(0, r):
A.extend( bin[j] ) bin[j] = []
Optional
Example
§
Keys = integers with 64 bits§
Size of input = 224 (ca. 16 mio)§
We choose r = 28 = 256 as base§ E.g. "digits" = characters in fixed-length strings
§
On the first recursion level, the algo checks the left-most 8 bits of the keys and distributes each key into one of 256 bins§
Average (expected) size of the bins (assuming uniform distribution of the keys) = 224 / 28 = 216 =65536
§
Recursion tree:§
Problem: in each recursion, we need to save r-1 many bins§ Lots of house keeping necessary
§ Either using marker arrays like with Counting Sort
§ Or using arrays of lists (lots of allocations / deallocations)
§
Solution: LSD Radix-Sort (aka. Backward Radix-Sort)§ First, sort according to least-significand digit, then according to least but second digit, etc.
§
Let d = number of digits, digit 0 = least-significand one§
The algorithm:§ Use, e.g., Counting Sort inside the loop lsd_radix_sort( A ):
for i = 0, ..., d-1:
do a stable sort on A with the
i-th digit of the elements as the key
Example
§
Sort 12 letters according to the post code (zip code)§
In the first iteration, consider only the last digit§ Notice that letters with the same digit did not change their position relative to each other!
Letters before the first iteration Letters after the first iteration
§
Sort by last but second digitLetters before the second iteration Letters after the second iteration
Parallel Radix Sort, Based on the Split Operation
§
The split operation: rearrange elements according to a flag§ Note: split maintains order within each group! (i.e., it is stable)
§
Radix sort (massively parallel):where split(i,a) rearranges a by moving all keys that have bit i = 0 to the front, and all keys that have bit i = 1 to the back
(bit 0 = LSB)
§ Reminder: stability of split is essential!
1 1
1 0
0 0
0 0
1 0 0
0 1
0 0
1
radix_sort( array a, int len ):
for i = 0...numbits-1: // important: go from low to high bit!
split(i, a) // split a, based on bit i of keys
Flags.
Usually, there are payload data, too (omitted here)
Algorithm for the Split Operation
§
Split's job:§ Determine new index for each element
§ Then perform the permutation
§
Algorithm (by way of the example):§ Consider lowest bit of the keys 1. Compute exclusive "0"-scan:
fi = # 0's in (a0, …, ai-1)
2. Set F = total number of "0"s
3. Construct d = new pos. of ai's
- If ai's bit = 0 → new pos. di = fi - If ai = 1 → new pos. di = F + (i – fi),
because i – fi = # 1's to the left of i
100 111 010 110 001 101 001 000
0 1 2 3 4 5 6 7
a:
i:
=
(fn 1 + 1 an 1 = 0 fn 1 an 1 = 1
0 1 1 2 3 3 3 3
f:
4+(1-1) 4+(4-3)4+(5-3)4+(6-3)
dfor "1"s:
0 1 2 3
dfor "0"s: F=4
0 4 1 2 5 6 7 3
d:
4 7 2 6 1 5 1 0
0 1 2 3 4 5 6 7
a:
i:
4 2 6 0 7 1 5 1
a':
Example: split based on bit 0
§
A conceptual algorithm for the "0"-scan:§ Extract the relevant bit (conceptually only)
§ Invert the bit
§ Compute regular prefix sum with +-operation
§
In a real implementation, you would, of course, implement this as a native "0"-scan routine with a special "+" operation!§
Depth complexity:§ Amounts to O( b2 ) , or O( log2(n) )
100 111 010 110 001 101 001 000
a:
1 0 1 1 0 0 0 1
a':
0 1 1 2 3 3 3 3
f:
O(b * log(n) ), where b = #bits per integer, and n = # elements
Stream Compaction
§
Given: input stream A, and a flag/predicate for each ai§
Goal: output stream A' that contains only ai's, for which flag = true§
Example:§ Given: array of upper and lower case letters
§ Goal: delete lower case letters and compact the others
to the front of the array
§
Solution:§ Just like with the split operation, except we don't compute indices for the
"to-be-deleted" elements
§
Frequent task: e.g., collision detection,§
Sometimes also called list packing, or stream packingA x C P h w b Z
a:
1 0 1 1 0 0 0 1
a':
A C P Z
b:
Sparse Matrices
§
"Unstructured" sparse matrices:§ Most common storage format is Compressed Sparse Row (CSR)
§ Matrix M, size m×n , k non-zero elements
§ Stored in three arrays V, C, R
- Row i of matrix M is stored in
- Element Vj in M's i-th row represents element
V C
R
VRi,. . .,VRi+1 1
<latexit sha1_base64="adkwkxf9pqSM4uu7pOYREbzTdT8=">AAACX3icZVBNbxMxEJ0sHy0BSgonhIQsAhKINFr3Ui5IEVw4FkTSSt1o5XhnW6v+WNljSrTaG7+GK/wZbvBP2E1aidAnWX56M543fotKq0Bp+quX3Lh56/bW9p3+3Xv3dx4Mdh/Ogote4lQ67fzxQgTUyuKUFGk8rjwKs9B4tDh/39WPvqAPytnPtKxwbsSpVaWSglopHzyd5fWnXDUjlunCURixlVCr17zZ400+GKbjdAV2nfBLMpw8z0a/AeAw3+29ygono0FLUosQTvh+RfNaeFJSY9PPYsBKyHNxivVq/Ya9aKWClc63xxJbqRt9ZmkEnbWN3RX+LZ1EKt/Ma2WrSGjlelYZNSPHuu+yQnmUpJdMSNmuFAW1VvJMeCGpjWXDJggb1kbZFe1nHi1eSGeMsEWdlcIovSywFFFTU2ehvOKbk1S0ir62YlABKVZ1hX7PuALZW1Z23m34Xbj8/yivk9n+mKdj/pEPJ+9gjW14As/gJXA4gAl8gEOYgoRv8B1+wM/en2Qr2UkG69akd/nmEWwgefwXIDi6WQ==</latexit><latexit sha1_base64="asGJRMYiMnpeSnTu0+CRvxlZcwo=">AAACX3icZVBNbxMxEHW2QEqANoUTQkIWAQlEGq17KZdKEVw4FkTSSt1o5XhnW6v+WNljaLTaW/9E/wJX+DMckfgheJNWIvRJlp/ejOeN37xS0mOa/uokG3fu3utu3u89ePhoa7u/83jqbXACJsIq647n3IOSBiYoUcFx5YDruYKj+fmHtn70FZyX1nzBRQUzzU+NLKXgGKW8/3ya159z2QxppgqLfkiXQi3fsmaXNXl/kI7SJehtwq7JYPwyG/4prrqH+U7nTVZYETQYFIp7f8L2KpzV3KEUCppeFjxUXJzzU6iX6zf0VZQKWloXj0G6VNf69EJzPIuN7eX/LZ0ELN/NammqgGDEalYZFEVL2+/SQjoQqBaUCxFXChyjlTjjjguMsazZeG78yii7ob3MgYFvwmrNTVFnJddSLQooeVDY1Jkvb/j6JBmMxIsoeukBQ1VX4Ha1LYAe0LL1juG34bL/o7xNpnsjlo7YJzYYvycrbJJn5AV5TRjZJ2PykRySCRHkknwnP8jPzu+km2wl/VVr0rl+84SsIXn6F9wcu2E=</latexit><latexit sha1_base64="asGJRMYiMnpeSnTu0+CRvxlZcwo=">AAACX3icZVBNbxMxEHW2QEqANoUTQkIWAQlEGq17KZdKEVw4FkTSSt1o5XhnW6v+WNljaLTaW/9E/wJX+DMckfgheJNWIvRJlp/ejOeN37xS0mOa/uokG3fu3utu3u89ePhoa7u/83jqbXACJsIq647n3IOSBiYoUcFx5YDruYKj+fmHtn70FZyX1nzBRQUzzU+NLKXgGKW8/3ya159z2QxppgqLfkiXQi3fsmaXNXl/kI7SJehtwq7JYPwyG/4prrqH+U7nTVZYETQYFIp7f8L2KpzV3KEUCppeFjxUXJzzU6iX6zf0VZQKWloXj0G6VNf69EJzPIuN7eX/LZ0ELN/NammqgGDEalYZFEVL2+/SQjoQqBaUCxFXChyjlTjjjguMsazZeG78yii7ob3MgYFvwmrNTVFnJddSLQooeVDY1Jkvb/j6JBmMxIsoeukBQ1VX4Ha1LYAe0LL1juG34bL/o7xNpnsjlo7YJzYYvycrbJJn5AV5TRjZJ2PykRySCRHkknwnP8jPzu+km2wl/VVr0rl+84SsIXn6F9wcu2E=</latexit><latexit sha1_base64="asGJRMYiMnpeSnTu0+CRvxlZcwo=">AAACX3icZVBNbxMxEHW2QEqANoUTQkIWAQlEGq17KZdKEVw4FkTSSt1o5XhnW6v+WNljaLTaW/9E/wJX+DMckfgheJNWIvRJlp/ejOeN37xS0mOa/uokG3fu3utu3u89ePhoa7u/83jqbXACJsIq647n3IOSBiYoUcFx5YDruYKj+fmHtn70FZyX1nzBRQUzzU+NLKXgGKW8/3ya159z2QxppgqLfkiXQi3fsmaXNXl/kI7SJehtwq7JYPwyG/4prrqH+U7nTVZYETQYFIp7f8L2KpzV3KEUCppeFjxUXJzzU6iX6zf0VZQKWloXj0G6VNf69EJzPIuN7eX/LZ0ELN/NammqgGDEalYZFEVL2+/SQjoQqBaUCxFXChyjlTjjjguMsazZeG78yii7ob3MgYFvwmrNTVFnJddSLQooeVDY1Jkvb/j6JBmMxIsoeukBQ1VX4Ha1LYAe0LL1juG34bL/o7xNpnsjlo7YJzYYvycrbJJn5AV5TRjZJ2PykRySCRHkknwnP8jPzu+km2wl/VVr0rl+84SsIXn6F9wcu2E=</latexit><latexit sha1_base64="vYhuK9XKkcgO7w6XSWvd5Am83JA=">AAACX3icZVBNTxsxEHW2pUBKaWhPCKmyGiFRFaI1F3qphMqFI1RNQGKjleOdBQt/rOxxabTaW39Nr+2f6ZF/gjcJUlOeZPnpzXje+E0qJT2m6d9O8uz5yovVtfXuy41Xm697W29G3gYnYCissu5ywj0oaWCIEhVcVg64nii4mNyetPWL7+C8tOYbTisYa35tZCkFxyjlvXejvP6ay2afZqqw6PfpTKjlR9YcsCbv9dNBOgN9StiC9MkCZ/lW50NWWBE0GBSKe3/FDisc19yhFAqabhY8VFzc8muoZ+s3dDdKBS2ti8cgnalLfXqqOd7Exvby/5auApafxrU0VUAwYj6rDIqipe13aSEdCFRTyoWIKwWO0UrccMcFxliWbDw3fm6UPdJu5sDAnbBac1PUWcm1VNMCSh4UNnXmy0e+PEkGI/FHFL30gKGqK3AH2hZAP9Oy9Y7ht+Gy/6N8SkaHA5YO2DnrH39ZxLxGdsh7skcYOSLH5JSckSER5Cf5RX6TP537ZDXZTHrz1qSzePOWLCHZfgAVe7iN</latexit>
Mi,Cj
<latexit sha1_base64="L7j+dafKqDJxwSNIwdYEgYgZeTQ=">AAACS3icZVBNaxRBEK3ZGI3rVxKPXhqXgIIuM7noJbAkFy9CBDcJZJalt6cmadNfdFdHl2H+Rq7mD/kD/B3exIM9uwm45kHTj1fV9arfzCkZKM9/Zr21e+v3H2w87D96/OTps82t7aNgoxc4FlZZfzLjAZU0OCZJCk+cR65nCo9nFwdd/fgSfZDWfKa5w4nmZ0bWUnBKUvlx2sg3rDmYfmnb6eYgH+YLsLukuCGD0U4RrwHgcLqVvS4rK6JGQ0LxEE6LXUeThnuSQmHbL2NAx8UFP8NmsWrLdpJUsdr6dAyxhbrSp+ea03lq7K7wb+k0Uv1+0kjjIqERy1l1VIws677GKulRkJozLkRaKXJKVuKcey4oRbBiE7gJS6PylvZLjwa/Cqs1N1VT1lxLNa+w5lFR25ShvuWrk2Q0kr4lMciAFF3j0L/VtkK2x+rOOwXdhVv8H+VdcrQ7LPJh8akYjPZhiQ14AS/hFRTwDkbwAQ5hDAIcXMF3uM5+ZL+y39mfZWsvu3nzHFbQW/8LhRC2Uw==</latexit><latexit sha1_base64="KhP90KnRRx+Q0keEcupMlIPLdGc=">AAACS3icZVDLahRBFK0ejcbxkUSXbgqHgIIO3dnoRhjMxo0QwUkC6WG4U307KVMvqm6pQ9NL/0Dc6h/kS/IB+Q534sLqmQQcc6Cow7m37rl1Zk7JQHl+kfVu3Fy7dXv9Tv/uvfsPNja3Hu4HG73AsbDK+sMZBFTS4JgkKTx0HkHPFB7MTne7+sEn9EFa84HmDicajo2spQBKUvlu2sjnvNmdfmzb6eYgH+YL8OukuCSD0XYRz86+ftubbmXPysqKqNGQUBDCUbHjaNKAJykUtv0yBnQgTuEYm8WqLd9OUsVr69MxxBfqSp+ea6CT1Nhd4d/SUaT61aSRxkVCI5az6qg4Wd59jVfSoyA15yBEWikCJStxAh4EpQhWbAKYsDQqr2i/9Gjws7Bag6masgYt1bzCGqKitilDfcVXJ8loJH1JYpABKbrGoX+hbYX8Na877xR0F27xf5TXyf7OsMiHxftiMHrDllhnj9kT9pQV7CUbsbdsj42ZYI59Zz/Yz+w8+5X9zv4sW3vZ5ZtHbAW9tb+cmLgq</latexit><latexit sha1_base64="KhP90KnRRx+Q0keEcupMlIPLdGc=">AAACS3icZVDLahRBFK0ejcbxkUSXbgqHgIIO3dnoRhjMxo0QwUkC6WG4U307KVMvqm6pQ9NL/0Dc6h/kS/IB+Q534sLqmQQcc6Cow7m37rl1Zk7JQHl+kfVu3Fy7dXv9Tv/uvfsPNja3Hu4HG73AsbDK+sMZBFTS4JgkKTx0HkHPFB7MTne7+sEn9EFa84HmDicajo2spQBKUvlu2sjnvNmdfmzb6eYgH+YL8OukuCSD0XYRz86+ftubbmXPysqKqNGQUBDCUbHjaNKAJykUtv0yBnQgTuEYm8WqLd9OUsVr69MxxBfqSp+ea6CT1Nhd4d/SUaT61aSRxkVCI5az6qg4Wd59jVfSoyA15yBEWikCJStxAh4EpQhWbAKYsDQqr2i/9Gjws7Bag6masgYt1bzCGqKitilDfcVXJ8loJH1JYpABKbrGoX+hbYX8Na877xR0F27xf5TXyf7OsMiHxftiMHrDllhnj9kT9pQV7CUbsbdsj42ZYI59Zz/Yz+w8+5X9zv4sW3vZ5ZtHbAW9tb+cmLgq</latexit><latexit sha1_base64="KhP90KnRRx+Q0keEcupMlIPLdGc=">AAACS3icZVDLahRBFK0ejcbxkUSXbgqHgIIO3dnoRhjMxo0QwUkC6WG4U307KVMvqm6pQ9NL/0Dc6h/kS/IB+Q534sLqmQQcc6Cow7m37rl1Zk7JQHl+kfVu3Fy7dXv9Tv/uvfsPNja3Hu4HG73AsbDK+sMZBFTS4JgkKTx0HkHPFB7MTne7+sEn9EFa84HmDicajo2spQBKUvlu2sjnvNmdfmzb6eYgH+YL8OukuCSD0XYRz86+ftubbmXPysqKqNGQUBDCUbHjaNKAJykUtv0yBnQgTuEYm8WqLd9OUsVr69MxxBfqSp+ea6CT1Nhd4d/SUaT61aSRxkVCI5az6qg4Wd59jVfSoyA15yBEWikCJStxAh4EpQhWbAKYsDQqr2i/9Gjws7Bag6masgYt1bzCGqKitilDfcVXJ8loJH1JYpABKbrGoX+hbYX8Na877xR0F27xf5TXyf7OsMiHxftiMHrDllhnj9kT9pQV7CUbsbdsj42ZYI59Zz/Yz+w8+5X9zv4sW3vZ5ZtHbAW9tb+cmLgq</latexit><latexit sha1_base64="jCzWxUv57dGH0FjiSJfbfYFtFF4=">AAACS3icZVBNaxRBEO3ZGI2r0SQevTQugoIuM3vRixDMxYuQgJsEMstS21OTdNJfdFeryzB/w6v+IX+Av8ObeEjP7gZc86Dpx6vqetVv5pQMlOe/st7Gnc2797bu9x883H70eGd37zjY6AWOhVXWn84goJIGxyRJ4anzCHqm8GR2ddDVTz6jD9KaTzR3ONFwbmQtBVCSyo/TRr7izcH0sm2nO4N8mC/Ab5NiRQZshcPpbvayrKyIGg0JBSGcFSNHkwY8SaGw7ZcxoANxBefYLFZt+fMkVby2Ph1DfKGu9em5BrpIjd0V/i2dRarfThppXCQ0YjmrjoqT5d3XeCU9ClJzDkKklSJQshIX4EFQimDNJoAJS6PyhvZLjwa/CKs1mKopa9BSzSusISpqmzLUN3x9koxG0tckBhmQomsc+tfaVsjf8brzTkF34Rb/R3mbHI+GRT4sjorB/vtVzFvsKXvGXrCCvWH77AM7ZGMmmGPf2Hf2I/uZ/c7+ZH+Xrb1s9eYJW0Nv8xqyDbSo</latexit>
§
Example:M = 0 B B B B
@
a
00 0 a
10 0 a
20 0 0 0 0 a
30 0 0 a
40 a
5a
60 0 0 0 a
71 C C C C A
<latexit sha1_base64="+MdAF5W+5Bxum+MuL62xLQ3xNW0=">AAAC2nicZVFLb9NAEF6bVzGPpnDkskpUBAeCHR7lghTBhQtSkUhbqRtZ4/U4XXW9tvYBRFYkxA1x5cJP49fAxm7apB1pvZ+/b2Zn9E1WS2FsHP8NwmvXb9y8tXU7unP33v3t3s6DA1M5zXHCK1npowwMSqFwYoWVeFRrhDKTeJidvl/qh19QG1Gpz3Ze47SEmRKF4GA9lfb+fKRvKctwJlRTl2C1+LaIII3pY7o6kCYtYizqfkdr4jndSS8u05C+PBdftd/X6xUXlXs0YqjyiyHS3iAexm3QqyA5A4Nxn/W/E0L2053gKcsr7kpUlksw5jgZ1XbagLaCS1xEzBmsgZ/CDJvWuQXd9VROi0r7oyxt2Y28cu7nOfGJy8usS8fOFm+mjVC1s6h491bhJLUVXTpNc6GRWzmnwLkfyYH1rfgJaODWb2SjjQFlukZsBSOmUeFXXpUleF9YAaWQ8xwLcNIuGmaKFd58STglrPePGWHQurqpUT8rqxz9ootlb7/3hTc3uWzlVXAwGibxMPmUDMbvSBdb5BHpkyckIXtkTD6QfTIhnPwLdoNh8Dxk4Y/wZ/irSw2Ds5qHZCPC3/8BV2nTpw==</latexit><latexit sha1_base64="QGNbtg4yXTSMe0CB/cZNiFv2vBc=">AAAC2nicZVFNb9QwEHXCVwlfWzhysXZVBAeWZKGUC9IKLlyQisS2lepVNHEmW6uOE9lOYRXlwq3iioT4PfwIxK8Bb9Jtd9uRHL+8N+MZvUlKKYwNw7+ef+36jZu3Nm4Hd+7eu/+gt/lwzxSV5jjhhSz0QQIGpVA4scJKPCg1Qp5I3E+O3y/0/RPURhTqs52XOM1hpkQmOFhHxb1fH+lbyhKcCVWXOVgtvjYBxCF9QpcH4qhFjAXd72hFPKc76eVlGuJX5+J2+329WnFRuUMDhiq9GCLuDcJh2Aa9CqIzMBj3Wf/05+8/u/Gm94ylBa9yVJZLMOYwGpV2WoO2gktsAlYZLIEfwwzr1rmGbjkqpVmh3VGWtuxaXj538xy5xMVlVqXDymZvprVQZWVR8e6trJLUFnThNE2FRm7lnALnbqQKrGvFj0ADt24ja20MKNM1YksYMI0Kv/Aiz8H5wjLIhZynmEElbVMzky3x+kuiUsI6/5gRBm1V1iXq53mRolt0tujt9t44c6PLVl4Fe6NhFA6jT9Fg/I50sUEekz55SiKyQ8bkA9klE8LJP2/LG3ovfOZ/80/9712q753VPCJr4f/4DxTj1eE=</latexit><latexit sha1_base64="QGNbtg4yXTSMe0CB/cZNiFv2vBc=">AAAC2nicZVFNb9QwEHXCVwlfWzhysXZVBAeWZKGUC9IKLlyQisS2lepVNHEmW6uOE9lOYRXlwq3iioT4PfwIxK8Bb9Jtd9uRHL+8N+MZvUlKKYwNw7+ef+36jZu3Nm4Hd+7eu/+gt/lwzxSV5jjhhSz0QQIGpVA4scJKPCg1Qp5I3E+O3y/0/RPURhTqs52XOM1hpkQmOFhHxb1fH+lbyhKcCVWXOVgtvjYBxCF9QpcH4qhFjAXd72hFPKc76eVlGuJX5+J2+329WnFRuUMDhiq9GCLuDcJh2Aa9CqIzMBj3Wf/05+8/u/Gm94ylBa9yVJZLMOYwGpV2WoO2gktsAlYZLIEfwwzr1rmGbjkqpVmh3VGWtuxaXj538xy5xMVlVqXDymZvprVQZWVR8e6trJLUFnThNE2FRm7lnALnbqQKrGvFj0ADt24ja20MKNM1YksYMI0Kv/Aiz8H5wjLIhZynmEElbVMzky3x+kuiUsI6/5gRBm1V1iXq53mRolt0tujt9t44c6PLVl4Fe6NhFA6jT9Fg/I50sUEekz55SiKyQ8bkA9klE8LJP2/LG3ovfOZ/80/9712q753VPCJr4f/4DxTj1eE=</latexit><latexit sha1_base64="QGNbtg4yXTSMe0CB/cZNiFv2vBc=">AAAC2nicZVFNb9QwEHXCVwlfWzhysXZVBAeWZKGUC9IKLlyQisS2lepVNHEmW6uOE9lOYRXlwq3iioT4PfwIxK8Bb9Jtd9uRHL+8N+MZvUlKKYwNw7+ef+36jZu3Nm4Hd+7eu/+gt/lwzxSV5jjhhSz0QQIGpVA4scJKPCg1Qp5I3E+O3y/0/RPURhTqs52XOM1hpkQmOFhHxb1fH+lbyhKcCVWXOVgtvjYBxCF9QpcH4qhFjAXd72hFPKc76eVlGuJX5+J2+329WnFRuUMDhiq9GCLuDcJh2Aa9CqIzMBj3Wf/05+8/u/Gm94ylBa9yVJZLMOYwGpV2WoO2gktsAlYZLIEfwwzr1rmGbjkqpVmh3VGWtuxaXj538xy5xMVlVqXDymZvprVQZWVR8e6trJLUFnThNE2FRm7lnALnbqQKrGvFj0ADt24ja20MKNM1YksYMI0Kv/Aiz8H5wjLIhZynmEElbVMzky3x+kuiUsI6/5gRBm1V1iXq53mRolt0tujt9t44c6PLVl4Fe6NhFA6jT9Fg/I50sUEekz55SiKyQ8bkA9klE8LJP2/LG3ovfOZ/80/9712q753VPCJr4f/4DxTj1eE=</latexit><latexit sha1_base64="ZwOyoS2mY6NU3G4wy39H97CAraM=">AAAC2nicZVFLb9NAEF6bVzGPpnDksiIqggPBDo9yQargwgWpSKSt1I2s8XqcrrpeW/sAIssXbogrF34avwY2dtIm7Ujr/fx9Mzujb7JaCmPj+G8QXrt+4+atrdvRnbv37m8Pdh4cmsppjhNeyUofZ2BQCoUTK6zE41ojlJnEo+zsw0I/+oraiEp9sfMapyXMlCgEB+updPDnE31HWYYzoZq6BKvF9zaCNKZP6OpAmnSIsaj/Ha+J53QvvbxMQ/rqXHzdfd+sV1xU7tGIocovhkgHw3gUd0GvgmQJhmQZB+lO8IzlFXclKsslGHOSjGs7bUBbwSW2EXMGa+BnMMOmc66lu57KaVFpf5SlHbuRV879PKc+cXGZdenE2eLttBGqdhYV798qnKS2ogunaS40civnFDj3IzmwvhU/BQ3c+o1stDGgTN+IrWDENCr8xquyBO8LK6AUcp5jAU7atmGmWOHNl4RTwnr/mBEGraubGvXzssrRL7pY9PZ7b725yWUrr4LD8SiJR8nnZLj/fmnzFnlEHpOnJCF7ZJ98JAdkQjj5F+wGo+BFyMIf4c/wV58aBsuah2Qjwt//AQML0kc=</latexit>
V
<latexit sha1_base64="Wx0lZ1PH1zcrcU3neZTK48t8oqc=">AAAChXicZVHbbhMxEJ0sl5ZwaQqPSMgiQmqlNtoNpeUFEQEPPLYSSSt1o8jxzrZWfVnZYyBa5Yf4Aj6D1/LEp+AkW4nQkaxzNDOeY5+ZVkp6StPrVnLn7r37G5sP2g8fPX6y1dl+OvI2OIFDYZV1Z1PuUUmDQ5Kk8KxyyPVU4en06uOifvoVnZfWfKFZhWPNL4wspeAUU5POpxF7x3YYn6R7LN+LmDXYb/B1gwcNvmnwsMEjttuedLppL10Gu02yhnQHL36e/AGA48l2azcvrAgaDQnFvT/P+hWNa+5ICoXzdh48Vlxc8Qusl5+cs1cxVbDSungMsWV2rU/PNKfL2LgA/2/pPFD5dlxLUwVCI1azyqAYWbYwhRXSoSA1Y1yI+KTAKUqJS+64oGjemoznxq+E8hvazh0a/Cas1twUdV5yLdWswJIHRfM69+UNX58kg5H0PSa99Eihqit0+9oWGHdSLrTjiubR3Ox/K2+TUb+Xpb3sJOsOPsAqNuE5vIQdyOAIBvAZjmEIAn7AL7iG38lGsp8cJIer1qTV3HkGa5G8/wvTEMCt</latexit><latexit sha1_base64="mzuhQbPf0fpuUaUNzxhDHLedmi8=">AAAChXicZVHbahsxEJW3l6TuzUkfC0HUFBJIzK6b20uoafvQxwRqJ5A1ZqydTUS00iKNkprFj/2ZfkE+o6/pL/QnKtsbqJsBcQ4zoznSmXGppKM4vmtEjx4/ebqy+qz5/MXLV69ba+sDZ7wV2BdGGXs2BodKauyTJIVnpUUoxgpPx1efZ/XTa7ROGv2NJiUOC7jQMpcCKKRGrS8DfsQ3OYzibZ5uB0xq7Nb4ocbdGvdq3K/xgG81R6123InnwR+SpCbt3sbtyZ8fG7fHo7XGVpoZ4QvUJBQ4d550SxpWYEkKhdNm6h2WIK7gAqv5J6f8fUhlPDc2HE18nl3qKyYF0GVonIH7t3TuKT8cVlKXnlCLxazcK06Gz0zhmbQoSE04CBGe5IGClLgEC4KCeUsyDrRbCKX3tJla1HgjTFGAzqo0h0KqSYY5eEXTKnX5PV+eJL2W9D0knXRIvqxKtDuFyTDsJJ9phxVNg7nJ/1Y+JINuJ4k7yUnS7n1ii1hlb9k7tskSdsB67Cs7Zn0m2E/2i92x39FKtBPtRvuL1qhR33nDliL6+BcsdMIT</latexit><latexit sha1_base64="mzuhQbPf0fpuUaUNzxhDHLedmi8=">AAAChXicZVHbahsxEJW3l6TuzUkfC0HUFBJIzK6b20uoafvQxwRqJ5A1ZqydTUS00iKNkprFj/2ZfkE+o6/pL/QnKtsbqJsBcQ4zoznSmXGppKM4vmtEjx4/ebqy+qz5/MXLV69ba+sDZ7wV2BdGGXs2BodKauyTJIVnpUUoxgpPx1efZ/XTa7ROGv2NJiUOC7jQMpcCKKRGrS8DfsQ3OYzibZ5uB0xq7Nb4ocbdGvdq3K/xgG81R6123InnwR+SpCbt3sbtyZ8fG7fHo7XGVpoZ4QvUJBQ4d550SxpWYEkKhdNm6h2WIK7gAqv5J6f8fUhlPDc2HE18nl3qKyYF0GVonIH7t3TuKT8cVlKXnlCLxazcK06Gz0zhmbQoSE04CBGe5IGClLgEC4KCeUsyDrRbCKX3tJla1HgjTFGAzqo0h0KqSYY5eEXTKnX5PV+eJL2W9D0knXRIvqxKtDuFyTDsJJ9phxVNg7nJ/1Y+JINuJ4k7yUnS7n1ii1hlb9k7tskSdsB67Cs7Zn0m2E/2i92x39FKtBPtRvuL1qhR33nDliL6+BcsdMIT</latexit><latexit sha1_base64="mzuhQbPf0fpuUaUNzxhDHLedmi8=">AAAChXicZVHbahsxEJW3l6TuzUkfC0HUFBJIzK6b20uoafvQxwRqJ5A1ZqydTUS00iKNkprFj/2ZfkE+o6/pL/QnKtsbqJsBcQ4zoznSmXGppKM4vmtEjx4/ebqy+qz5/MXLV69ba+sDZ7wV2BdGGXs2BodKauyTJIVnpUUoxgpPx1efZ/XTa7ROGv2NJiUOC7jQMpcCKKRGrS8DfsQ3OYzibZ5uB0xq7Nb4ocbdGvdq3K/xgG81R6123InnwR+SpCbt3sbtyZ8fG7fHo7XGVpoZ4QvUJBQ4d550SxpWYEkKhdNm6h2WIK7gAqv5J6f8fUhlPDc2HE18nl3qKyYF0GVonIH7t3TuKT8cVlKXnlCLxazcK06Gz0zhmbQoSE04CBGe5IGClLgEC4KCeUsyDrRbCKX3tJla1HgjTFGAzqo0h0KqSYY5eEXTKnX5PV+eJL2W9D0knXRIvqxKtDuFyTDsJJ9phxVNg7nJ/1Y+JINuJ4k7yUnS7n1ii1hlb9k7tskSdsB67Cs7Zn0m2E/2i92x39FKtBPtRvuL1qhR33nDliL6+BcsdMIT</latexit><latexit sha1_base64="MTqSy52YPbqzv8WpaOUdwGZdKPg=">AAAChXicZVFNTxsxEHWWttD0g0CPvViNKoEE0W5KoZeqqO2BI5WagMRG0cQ7CxZee2WPodEqf6i/plf6a+okRiJlJOs9zYzn2W8mtZKO0vSulaw9efpsfeN5+8XLV683O1vbQ2e8FTgQRhl7PgGHSmockCSF57VFqCYKzybX3+b1sxu0Thr9k6Y1jiq41LKUAiikxp3vQ/6Z73AYp3s83wuYRexH/BDxIOLHiIcRj/hue9zppr10EfwxySLpshin463Wbl4Y4SvUJBQ4d5H1axo1YEkKhbN27h3WIK7hEpvFJ2f8fUgVvDQ2HE18kV3pq6YV0FVonIN7WLrwVH4aNVLXnlCL5azSK06Gz03hhbQoSE05CBGe5IGClLgCC4KCeSsyDrRbCuX3tJ1b1HgrTFWBLpq8hEqqaYEleEWzJnflPV+dJL2W9CsknXRIvm5qtPuVKTDspJxrhxXNgrnZ/1Y+JsN+L0t72Y+se/w12rzB3rJ3bIdl7IgdsxN2ygZMsN/sD7tjf5P1ZD85SA6XrUkr3nnDViL58g/36b5m</latexit>= (a
0, a
1, a
2, a
3, a
4, a
5, a
6, a
7) C
<latexit sha1_base64="XWQfpAJykbxFZMbbLxAsv0iSe4E=">AAACdHicZVFNaxsxEB1vv1L3y2mPLUXUBBJIza5TaC8F01x6TKBOAlljZO1sIqKPRRo1MYt/SP9Jb722P6G3/oqeK+86EDcDGj3ejOZJT7NKSU9p+ruT3Ll77/6DjYfdR4+fPH3W23x+5G1wAsfCKutOZtyjkgbHJEnhSeWQ65nC49nF/rJ+/BWdl9Z8oXmFE83PjCyl4BSpaW9vn31k2yxlu/ku22ty1uThDdzy71Z5Z9rrp4O0CXYbZCvQH73+fvgHAA6mm52dvLAiaDQkFPf+NBtWNKm5IykULrp58FhxccHPsG7etGBbkSpYaV1chljDrvXpueZ0HhuXm79ZOg1UfpjU0lSB0Ih2VhkUI8uWHrBCOhSk5owLEa8UOEUpcc4dFxS9WpPx3PhWKL+G3dyhwUthteamqPOSa6nmBZY8KFrUuS+v8fokGYykq0h66ZFCVVfo3mpbYPyCcqkdf2QRzc3+t/I2OBoOsnSQHWb90SdoYwNewhvYhgzewwg+wwGMQcA3+AE/4Vfnb/Iq6SdbbWvSWZ15AWuRDP4B4qq62w==</latexit><latexit sha1_base64="TdGzCAQxiza6XKZ1+0X1bRj4HtA=">AAACdHicZVFNTxsxEHW2tIX0g9AeQchqhAQSjXYDEr1UiuDCESQCSGwUOd5ZsPDHyh4XolWO/RH9J9x6pT+hv6H3nnF2g0RgJI+f3ozn2c+jQgqHcfy3Eb1aeP3m7eJS8937Dx+XWyufTp3xlkOfG2ns+Yg5kEJDHwVKOC8sMDWScDa6PpjWz36AdcLoExwXMFDsUotccIaBGrZ2Duh3ukljup1u050qJ1XuPsE1vzvLW8NWO+7EVdCXIJmBdm/97vjfz/W7o+FKYyvNDPcKNHLJnLtIugUOSmZRcAmTZuodFIxfs0soqzdN6EagMpobG5ZGWrFzfWqsGF6FxunmnpYuPObfBqXQhUfQvJ6Ve0nR0KkHNBMWOMoxZZyHK3mGQYpfMcs4Bq/mZBzTrhZKH2EztaDhhhulmM7KNGdKyHEGOfMSJ2Xq8kc8P0l4LfA2kE44QF+UBdivymQQviCfaocfmQRzk+dWvgSn3U4Sd5LjpN3bJ3UsklXyhWyShOyRHjkkR6RPOPlFfpN78qfxP1qL2tFG3Ro1Zmc+k7mIOg88DrxB</latexit><latexit sha1_base64="TdGzCAQxiza6XKZ1+0X1bRj4HtA=">AAACdHicZVFNTxsxEHW2tIX0g9AeQchqhAQSjXYDEr1UiuDCESQCSGwUOd5ZsPDHyh4XolWO/RH9J9x6pT+hv6H3nnF2g0RgJI+f3ozn2c+jQgqHcfy3Eb1aeP3m7eJS8937Dx+XWyufTp3xlkOfG2ns+Yg5kEJDHwVKOC8sMDWScDa6PpjWz36AdcLoExwXMFDsUotccIaBGrZ2Duh3ukljup1u050qJ1XuPsE1vzvLW8NWO+7EVdCXIJmBdm/97vjfz/W7o+FKYyvNDPcKNHLJnLtIugUOSmZRcAmTZuodFIxfs0soqzdN6EagMpobG5ZGWrFzfWqsGF6FxunmnpYuPObfBqXQhUfQvJ6Ve0nR0KkHNBMWOMoxZZyHK3mGQYpfMcs4Bq/mZBzTrhZKH2EztaDhhhulmM7KNGdKyHEGOfMSJ2Xq8kc8P0l4LfA2kE44QF+UBdivymQQviCfaocfmQRzk+dWvgSn3U4Sd5LjpN3bJ3UsklXyhWyShOyRHjkkR6RPOPlFfpN78qfxP1qL2tFG3Ro1Zmc+k7mIOg88DrxB</latexit><latexit sha1_base64="TdGzCAQxiza6XKZ1+0X1bRj4HtA=">AAACdHicZVFNTxsxEHW2tIX0g9AeQchqhAQSjXYDEr1UiuDCESQCSGwUOd5ZsPDHyh4XolWO/RH9J9x6pT+hv6H3nnF2g0RgJI+f3ozn2c+jQgqHcfy3Eb1aeP3m7eJS8937Dx+XWyufTp3xlkOfG2ns+Yg5kEJDHwVKOC8sMDWScDa6PpjWz36AdcLoExwXMFDsUotccIaBGrZ2Duh3ukljup1u050qJ1XuPsE1vzvLW8NWO+7EVdCXIJmBdm/97vjfz/W7o+FKYyvNDPcKNHLJnLtIugUOSmZRcAmTZuodFIxfs0soqzdN6EagMpobG5ZGWrFzfWqsGF6FxunmnpYuPObfBqXQhUfQvJ6Ve0nR0KkHNBMWOMoxZZyHK3mGQYpfMcs4Bq/mZBzTrhZKH2EztaDhhhulmM7KNGdKyHEGOfMSJ2Xq8kc8P0l4LfA2kE44QF+UBdivymQQviCfaocfmQRzk+dWvgSn3U4Sd5LjpN3bJ3UsklXyhWyShOyRHjkkR6RPOPlFfpN78qfxP1qL2tFG3Ro1Zmc+k7mIOg88DrxB</latexit><latexit sha1_base64="WCPO+XRvuACTZwa8/unq82HLC90=">AAACdHicZVFNTxsxEHUW2tLQj1CO7cEiQgKJRruhUnuphMqlRyo1gMRG0cQ7Cxb+WNnjttEqP4Rf02v7E/gjPdfZLBKBkTx+ejOeZz9PKyU9peltJ1lbf/L02cbz7uaLl69e97benHobnMCRsMq68yl4VNLgiCQpPK8cgp4qPJteHy/qZz/QeWnNd5pVONZwaWQpBVCkJr3DY/6Z7/GUH+QH/LDJWZOH9/CS/9Dm/Umvnw7SJvhjkLWgz9o4mWx19vPCiqDRkFDg/UU2rGhcgyMpFM67efBYgbiGS6ybN835bqQKXloXlyHesCt9eqaBrmLjYvP3SxeByk/jWpoqEBqxnFUGxcnyhQe8kA4FqRkHIeKVAlCUElfgQFD0akXGg/FLofwOdnOHBn8KqzWYos5L0FLNCiwhKJrXuS/v8OokGYykX5H00iOFqq7Qvde2wPgF5UI7/sg8mps9tPIxOB0OsnSQfcv6R19amzfYW7bD9ljGPrIj9pWdsBET7Ib9Zn/Y386/5F3ST3aXrUmnPbPNViIZ/AcHkriU</latexit>= (0, 3, 1, 2, 1, 3, 4, 4)
R
<latexit sha1_base64="NEj1efQu+P/ZTiabQ8C52re8q4o=">AAACaHicZVFNaxsxEB1vv1L3y0kPpRSKiClNIDW7biC5FEx76TEpdRLIGiNrZxMRfSzSKKlZ/BP6a3rpofkhufXcX1F5nUDdDOjp8TSapxlNKiU9pelVK7lz9979BysP248eP3n6rLO6duBtcAKHwirrjibco5IGhyRJ4VHlkOuJwsPJ2af5+eE5Oi+t+UrTCkeanxhZSsEpSuPO2y/sA9tgKdvKt1i/wfcNbje40+Au2xx3umkvbYLdJtk16Q5e/9z/DQB749XWZl5YETQaEop7f5z1KxrV3JEUCmftPHisuDjjJ1g3fczYmygVrLQuLkOsUZfy9FRzOo2J883/e3QcqNwd1dJUgdCIRa0yKEaWzftmhXQoSE0ZFyI+KXCKVuKUOy4ozmfJxnPjF0b5DW3nDg1eCKs1N0Wdl1xLNS2w5EHRrM59ecOXK8lgJH2LopceKVR1he6dtgXGsZdz7/gLszjc7P9R3iYH/V6W9rL9rDv4CItYgVewDhuQwQ4M4DPswRAEfIcf8AsuW3+STvIieblITVrXd57DUiTrfwGk8rgw</latexit><latexit sha1_base64="nkM+NWRgN5uHCfx2fNGUh85tnQc=">AAACaHicZVFNbxMxEHWWrxK+UjgghFRZjRCt1Ea7oVJ7QYrgwrFFpK3UjaKJd7a16o+VPaZEqxw58mu49AA/hL8AfwJn00qEjuTnp+fxPM94UinpKU1/tZJbt+/cvbdyv/3g4aPHTzqrTw+9DU7gUFhl3fEEPCppcEiSFB5XDkFPFB5Nzt/Pz48+o/PSmk80rXCk4dTIUgqgKI07rz/yt3yDp3wr3+L9Bt80uNPgboN7fHPc6aa9tAl+k2RXpDtYuzz483Xtcn+82trMCyuCRkNCgfcnWb+iUQ2OpFA4a+fBYwXiHE6xbvqY8VdRKnhpXVyGeKMu5empBjqLifPN/3t0EqjcG9XSVIHQiEWtMihOls/75oV0KEhNOQgRnxSAopU4AweC4nyWbDwYvzDKr2k7d2jwQlitwRR1XoKWalpgCUHRrM59ec2XK8lgJH2JopceKVR1hW5b2wLj2Mu5d/yFWRxu9v8ob5LDfi9Le9lB1h28Y4tYYS/ZOttgGdtlA/aB7bMhE+wb+85+sJ+t30kneZ68WKQmras7z9hSJOt/Af5HuZY=</latexit><latexit sha1_base64="nkM+NWRgN5uHCfx2fNGUh85tnQc=">AAACaHicZVFNbxMxEHWWrxK+UjgghFRZjRCt1Ea7oVJ7QYrgwrFFpK3UjaKJd7a16o+VPaZEqxw58mu49AA/hL8AfwJn00qEjuTnp+fxPM94UinpKU1/tZJbt+/cvbdyv/3g4aPHTzqrTw+9DU7gUFhl3fEEPCppcEiSFB5XDkFPFB5Nzt/Pz48+o/PSmk80rXCk4dTIUgqgKI07rz/yt3yDp3wr3+L9Bt80uNPgboN7fHPc6aa9tAl+k2RXpDtYuzz483Xtcn+82trMCyuCRkNCgfcnWb+iUQ2OpFA4a+fBYwXiHE6xbvqY8VdRKnhpXVyGeKMu5empBjqLifPN/3t0EqjcG9XSVIHQiEWtMihOls/75oV0KEhNOQgRnxSAopU4AweC4nyWbDwYvzDKr2k7d2jwQlitwRR1XoKWalpgCUHRrM59ec2XK8lgJH2JopceKVR1hW5b2wLj2Mu5d/yFWRxu9v8ob5LDfi9Le9lB1h28Y4tYYS/ZOttgGdtlA/aB7bMhE+wb+85+sJ+t30kneZ68WKQmras7z9hSJOt/Af5HuZY=</latexit><latexit sha1_base64="nkM+NWRgN5uHCfx2fNGUh85tnQc=">AAACaHicZVFNbxMxEHWWrxK+UjgghFRZjRCt1Ea7oVJ7QYrgwrFFpK3UjaKJd7a16o+VPaZEqxw58mu49AA/hL8AfwJn00qEjuTnp+fxPM94UinpKU1/tZJbt+/cvbdyv/3g4aPHTzqrTw+9DU7gUFhl3fEEPCppcEiSFB5XDkFPFB5Nzt/Pz48+o/PSmk80rXCk4dTIUgqgKI07rz/yt3yDp3wr3+L9Bt80uNPgboN7fHPc6aa9tAl+k2RXpDtYuzz483Xtcn+82trMCyuCRkNCgfcnWb+iUQ2OpFA4a+fBYwXiHE6xbvqY8VdRKnhpXVyGeKMu5empBjqLifPN/3t0EqjcG9XSVIHQiEWtMihOls/75oV0KEhNOQgRnxSAopU4AweC4nyWbDwYvzDKr2k7d2jwQlitwRR1XoKWalpgCUHRrM59ec2XK8lgJH2JopceKVR1hW5b2wLj2Mu5d/yFWRxu9v8ob5LDfi9Le9lB1h28Y4tYYS/ZOttgGdtlA/aB7bMhE+wb+85+sJ+t30kneZ68WKQmras7z9hSJOt/Af5HuZY=</latexit><latexit sha1_base64="CN1PzUBD912+KrjMegh6xIwccW0=">AAACaHicZVHbThsxEHW2tKXpLZSHCvFiEVUFiUa7KRK8VELtSx8BNYDERtHEOwsWvqzsMW20yif0a3htP6S/0K+oswkSKSN5fHQ8nuM5HldKekrTP63k0crjJ09Xn7Wfv3j56nVn7c2pt8EJHAirrDsfg0clDQ5IksLzyiHoscKz8fWX2fnZDTovrflGkwqHGi6NLKUAitSo8/6Ef+LbPOW7+S7vN/ljk/eavN/kA74z6nTTXtoEfwiyBeiyRRyN1lo7eWFF0GhIKPD+IutXNKzBkRQKp+08eKxAXMMl1s0cU/4uUgUvrYvLEG/YpTo90UBXsXC2+ftHF4HKg2EtTRUIjZj3KoPiZPlsbl5Ih4LUhIMQ8UkBKEqJK3AgKPqzJOPB+LlQfgfbuUOD34XVGkxR5yVoqSYFlhAUTevcl3d4uZMMRtKPSHrpkUJVV+g+aFtgtL2cacdfmEZzs/+tfAhO+70s7WXHWffw88LmVbbJttg2y9g+O2Rf2REbMMF+slv2i/1u/U06ydtkY16atBZ31tlSJFv/AMnLtek=</latexit>= (0, 2, 3, 4, 7, 8)
§
Implementation in C:where
n_rows = m , nnz = k ,
val = V ,
col_idx = C , row_start = R
struct {
int n_rows; // number of rows
int nnz; // total number of non-zero elements int row_start[n_rows+1];
int col_idx[nnz];
double val[nnz];
}
row_start col_idx
val nnz