Theory of Parallel and Distributed Systems (WS2016/17)
Kapitel 1
First Algorithms for PRAM
Walter Unger
Lehrstuhl für Informatik 1
12:00 Uhr, den 30. Januar 2017
1 Inhaltsverzeichnis Walter Unger 30.1.2017 12:00 WS2016/17 Z
Inhalt I
1 Motivation and History
Systolic Arrays and Vector Computer Transputer
Parallele Rechner PRAM
2 PRAM Introduction Definition Or Sum Matrices Prefixsum Maximum Idenfify Root
Situation
3 Efficiency Definition Overview
4 Selection
Idea for the k-th Element Examples
Algorithm and Running Time
5 Merging
Sequential Merging Parallel Merging Parallel Merging
1 Inhaltsverzeichnis Walter Unger 30.1.2017 12:00 WS2016/17 Z
Motivation
1 There are limits to the computing power of a singel Computer
2 Computers become cheaper
3 Specialized computers are expensive
4 There are tasks with large data
5 Manny problems are are very complex
1 Weather and other Simulations
2 Crash tests
3 Military applications
4 Large data: (SETI, ...)
5 More similar problems
6 Thus there is the need for computers with more than one CPU
7 or a quantum computer?
1:2 Systolic Arrays and Vector Computer Walter Unger 30.1.2017 12:00 WS2016/17 Z
Pipeline: (systolic array)
P1 P2 P3 P4 P5 P6 P7 P8 P9
input output
There is a sequence of processors(Pi)16i 6n.
ProcessorP1receives the input.
Output ofP1 will be passed as the input ofP2.
Output ofPi will be passed as the input ofPi+1 16i<n.
ProcessorPndelivers the final output.
Processors may be different.
Processors may run different programs.
intermediate outputs may be buffered.
Pipelining is one importent type of parallel system (in practice).
1:3 Systolic Arrays and Vector Computer Walter Unger 30.1.2017 12:00 WS2016/17 Z
Systolic Arrays
input input input input input
output output output output output input input input input input input input input input
Idea: use more than one data stream.
Data streams may interect each other.
Each processor is the same.
There is a global syncronatisation.
Processors may run simple programs.
Advantage: realy fast (for special applications).
1:4 Systolic Arrays and Vector Computer Walter Unger 30.1.2017 12:00 WS2016/17 Z
Systolic Array with three data streams
output output output output output
output
output
output
output input
input input
input
input input input input input
input
input
input
input
input input input input input
1:5 Systolic Arrays and Vector Computer Walter Unger 30.1.2017 12:00 WS2016/17 Z
Vector Computer
Vector of processos.
Each processor has different data.
But each processor executes the same programm.
Addition of two vectors:
1 Read vectorA
2 Read vectorB
3 Add (each processor)
4 Output the summ
Single Instruction Multiple Data SIMD-Computer.
Aim: Multiple Instruction Multiple Data MIMD-Computer.
I.e. Fast processors with fast connections.
1:6 Transputer Walter Unger 30.1.2017 12:00 WS2016/17 Z
Example: Transputer
Bus
CPU Memory
Link L1 L2 L3 L4
Advantage: very flexible, any fixed network of degree 4 possible.
Disadvantage: long wires may be necessary, only a fixed network possible.
1:7 Transputer Walter Unger 30.1.2017 12:00 WS2016/17 Z
Beispiel: Transputer II
Bus
CPU Memory
Link
Switch
1:8 Parallele Rechner Walter Unger 30.1.2017 12:00 WS2016/17 Z
Parallele Computer I
Bus
CPU Memory
Link Switch
Advantage: “normal” CPUs.
Advantage: fast links possible.
Advantage: no special hardware.
Advantage: variable network, may change during execution.
Advantage: very large networks may be possible.
Disadvantage: still a limited degree for the network.
Disadvantage: large network are complicated.
Problem: cooling large systems.
Problem: fault tolerance.
Problem: construct such a system.
Problem: generate good data throuput with constant degree network.
Problem: do the program structures fit the structure of the network.
1:9 Parallele Rechner Walter Unger 30.1.2017 12:00 WS2016/17 Z
Parallel Computer II (Goodput)
Bus
CPU Memory
Link Switch
Look for good networks.
Trees, Grids, Pyramids, ...
HQ(n),CCC(n),BF(n),SE(n),DB(n), ...
Pancake Network and Burned Pancake Network.
Problem: Physical placement of the processors.
Problem: Length of wires.
Problem: Has the network a nice structure.
If the network becomes too large, we may use efficiency.
Solution: choose a mixed network structure.
1:10 Parallele Rechner Walter Unger 30.1.2017 12:00 WS2016/17 Z
Parallel Computer III (Network)
Bus
CPU Memory
Link Switch
1:11 Parallele Rechner Walter Unger 30.1.2017 12:00 WS2016/17 Z
Parallel Computer IV (Network)
Bus
CPU Memory
Link Switch
1:12 Parallele Rechner Walter Unger 30.1.2017 12:00 WS2016/17 Z
Parallel Computer V (Network)
Bus
CPU Memory
Link Switch 1 CPU and memory are one logical unit:
Network
CPURAM CPURAM CPURAM CPURAM CPURAM
2 CPUs and memory are connected by a network:
Network
CPU CPU CPU CPU CPU
RAM RAM RAM RAM RAM
The difference is more on the practical side.
1:13 PRAM Walter Unger 30.1.2017 12:00 WS2016/17 Z
PRAM (theoretical model)
P1 P2 P3 P4 P5 P6 P7 P8
M1 M2 M3 M4 M5 M6 M7 M8
Ignore/unify the costs for each computation step.
Ignore/unify the costs for each communication step.
1:14 Definition Walter Unger 30.1.2017 12:00 WS2016/17 Z
Definition RAM
RAM: Random Access Machine CPU may access any memory cell Memory is unlimited
Complexity measurements
uniform: each operation cost one unit
logarithmic: cost are measured according to the size of the numbers
1:15 Definition Walter Unger 30.1.2017 12:00 WS2016/17 Z
Idea of PRAM
global control
P0 P1 P2 P3 P4 P5 P6 P7 P8
common memory
Many processos Common progamm
Program may select single processors Common memory
1:16 Definition Walter Unger 30.1.2017 12:00 WS2016/17 Z
Definition PRAM
Consists of processorsPi with 16i 6p(prozessor has idi).
Consists of registersRjwith 16j6m.
Each processor has some local registers.
Each processorPi may access each registerRj. Each processor executes the same programm.
The programm is synchronized, thus each processor executes the same instructions.
A selection is possible by using the processor id.
The input of lengthnis written to registersRj with 16j6n.
The output is placed in some known registers.
The registers contain words (numbers) in the uniform cost measurement.
The registers contain bits in the logarithmic cost measurement.
1:17 Definition Walter Unger 30.1.2017 12:00 WS2016/17 Z
Definition PRAM
The following instructions are possible:
1 processorPi reads registerRj:Rj→Pi(x).
2 processorPi writes value ofx into registerRj:Pi(x)→Rj.
3 processor may do some local computation using local registers:
x:=y∗5.
For the access to the register we have the following variations:
EREW Exclusive Read Exclusive Write CREW Concurrent Read Exclusive Write CRCW Concurrent Read Concurrent Write ERCW Exclusive Read Concurrent Write Write conflicts may be solved using the following rules:
Arbitrary: any processor gets access to the register.
Common: all processors writing to the same register have to write the same value.
Priority: the processor with the smallest id gets access to the register.
1:18 Or Walter Unger 30.1.2017 12:00 WS2016/17 Z
Computation of an “Or” (Idea)
x=0 x=1 x=0 x=0 x=1 x=0 x=0 x=1
0 ∨ 1 ∨ 0 ∨ 0 ∨ 1 ∨ 0 ∨ 0 ∨ 1 → 1
1:19 Or Walter Unger 30.1.2017 12:00 WS2016/17 Z
Computing an “Or”
Task: Computex =Wn i=1xi.
Input:xi is in registerRi (16i6n).
Output computed inRn+1.
Model: CRCW Arbitrary, Common oder Priority.
Programm: Or
for allPi where 16i 6ndo in parallel Ri →Pi(x)
ifx =truethenPi(x)→Rn+1
Running time:O(1)(exact 2 steps).
Number of processors:n.
Memory:n+1.
Possible models: ERCW (Arbitrary, Common oder Priority).
1:20 Or Walter Unger 30.1.2017 12:00 WS2016/17 Z
Computing an “Or” (EREW)
Problem:
no writing of two processors to the same register at the same time.
Idea: combine pairwise the results
With this idea, computing the sum is also possible.
Thus computing the “Or” is just a special case of computing a sum.
1:21 Sum Walter Unger 30.1.2017 12:00 WS2016/17 Z
Computing the Sum (Idea)
S1 S2 S3 S4 S5 S6 S7 S8
S1..2 S3..4 S5..6 S7..8
S1..4 S5..8
S1..8
1:22 Sum Walter Unger 30.1.2017 12:00 WS2016/17 Z
Computing the Sum (Idea)
P1 P2 P3 P4
12 6 34 5 7 23 4 11
103 45 30 15
1:23 Sum Walter Unger 30.1.2017 12:00 WS2016/17 Z
Computing the sum (EREW)
Assume w.l.o.gn=2kfork∈N.
Task: computex=Pn
i=1xi withn=2k. Input:xi is in registerRi (16i6n).
Output: should be inR1 (input may be overwritten).
Modell: EREW.
Programm: Summe
for allPi where 16i 6n/2do in parallel R2·i−1→Pi(x)
forj=1tok do
if(i−1)≡0 (mod 2j−1)then R2·i−1+2j−1→Pi(y) x:=x+y
Pi(x)→R2·i−1
Running time:O(k) =O(logn)(precise 3·k+1 steps).
Number of processors:n/2.
Size of memory:n.
1:24 Matrices Walter Unger 30.1.2017 12:00 WS2016/17 Z
Addition of Matrices
Assume w.l.o.gn=2kfork∈N.
LetA,B two(n×n)-Matrices.
SumA+B is computabel withn2 processors in ZeitO(1)on a EREW PRAM.
R1 tillRn2 containA(one row after the other).
R1+n2 bisR2·n2 containsB (one row after the other).
Result inR1+2·n2 bisR3·n2. Programm: MatSumme
for allPi where 16i 6n2 do in parallel Ri →Pi(x)
Ri+n2→Pi(y) x :=x+y
Pi(x)→Ri+2·n2
Running time:O(1).
Number of processors:O(n2).
Size of memory:O(n2).
1:25 Matrices Walter Unger 30.1.2017 12:00 WS2016/17 Z
Multiplication of Matrices
Assume w.l.o.gn=2kfork∈N.
LetA,B be two(n×n)-Matrices.
R1 tillRn2 containA(one row after the other).
R1+n2 bisR2·n2 containsB (one row after the other).
Result inR1+2·n2 bisR3·n2
RegisterAi,j=R(i−1)·n+j (16i,j6n).
RegisterBi,j=R(i−1)·n+j+n2 (16i,j6n).
RegisterCi,j=R(i−1)·n+j+2·n2 (16i,j6n).
processorPi,j=P(i−1)·n+j (16i,j6n).
Use the above notation to simplify the algorithem.
Each processor has to do some hidden local computation to implement the above expressions.
1:26 Matrices Walter Unger 30.1.2017 12:00 WS2016/17 Z
Multiplikation of Matrices
Ai,j=R(i−1)·n+j Bi,j=R(i−1)·n+j+n2 Ci,j=R(i−1)·n+j+2·n2 Pi,j=P(i−1)·n+j
LetA,B be two(n×n)-Matrices
ProductA·B is computable withn2processors in timeO(n)on a CREW PRAM.
Programm: MatrProd 1
for allPi,j where 16i,j6ndo in parallel h=0
forl=1tondo Ai,l→Pi,j(a) Bl,j→Pi,j(b) h=h+a·b Pi,j(h)→Ci,j
Running time:O(n).
Number of processors:O(n2).
Size of memory:O(n2).
1:27 Matrices Walter Unger 30.1.2017 12:00 WS2016/17 Z
Multiplikation of Matrices
Ai,j=R(i−1)·n+j Bi,j=R(i−1)·n+j+n2 Ci,j=R(i−1)·n+j+2·n2 Pi,j=P(i−1)·n+j
LetA,B be two(n×n)-Matrices
ProductA·B is computable withn2 processors in timeO(n)on a EREW PRAM.
Programm: MatrProd 2
for allPi,j where 16i,j6ndo in parallel h=0
forl=1tondo Ai,l→Pi,j(a) Bl,j→Pi,j(b) h=h+a·b Pi,j(h)→Ci,j
Running time:O(n).
Number of processors:O(n2).
Size of memory:O(n2).
1:28 Prefixsum Walter Unger 30.1.2017 12:00 WS2016/17 Z
Compute the Prefixsum
Problem:
Task: Computesi=Pi
j=1xj for 16i6n.
Input:xj is in registerRj (16j6n).
Output:si should be in registerRi for 16i 6n.
Motivation and History PRAM Introduction Efficiency Selection Merging
1:29 Prefixsum Walter Unger 30.1.2017 12:00 WS2016/17 Z
Computing Prefixsum (Idea)
x1..1 x2..2 x3..3 x4..4 x5..5 x6..6 x7..7 x8..8
x1..1 x2..2 x3..3 x4..4 x5..5 x6..6 x7..7 x8..8
Done!
1:30 Prefixsum Walter Unger 30.1.2017 12:00 WS2016/17 Z
Computing the Prefixsum
Task: Computesi=Pi
j=1xj for 16i6n.
Input:xj is in registerRj (16j6n).
Output:si should be in registerRi for 16i 6n.
Model: EREW Programm: Summe
for allPi where 16i 6ndo in parallel Ri →Pi(x)
forj=1tok do ifi >2j−1 then
Ri−2j−1→Pi(y) x:=x+y Pi(x)→Ri
Running time:O(k) =O(logn)(precisely 3·k+1 steps).
Number of processors:n.
Size of memory:n.
1:31 Maximum Walter Unger 30.1.2017 12:00 WS2016/17 Z
Compute the Maximum
Task: Computem=maxij=1xjwithn=2k. Input:xj is in registerRj (16j6n).
Output:mshould be in registerRn+1.
Possible withnprocessors in timeO(logn)using a EREW PRAM.
Question: could it be done faster? (i.e. on a ERCW PRAM).
A maximum is larger or equal than all other values.
Idea: compare all pairs of numbers.
The maximum will always win.
1:32 Maximum Walter Unger 30.1.2017 12:00 WS2016/17 Z
Compute the Maximum (Idea)
34 34 12
12 14
14 56
56 23
23 67
67 49
49 27
27 61
61 52
52 57
57 59
59 26
26 41
41 33
33 22
22 0 1
0 0
0 0
0 1
0 0
0 1
0 0
0 1
0 1
0 1
0 0
0 1
0 0
0 0
0 1
1 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
1 1
0 1
0 0
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
1 1
0 0
0 0
0 0
0 1
0 0
0 0
0 0
0 1
0 0
0 1
0 0
0 0
0 0
0 0
0 1
1 1
0 1
0 0
0 0
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 0
0 1
1 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
1 1
0 0
0 0
0 0
0 1
0 0
0 1
0 0
0 1
0 1
0 1
0 0
0 0
0 0
0 0
0 1
1 1
0 1
0 0
0 0
0 1
0 0
0 1
0 1
0 1
0 1
0 1
0 0
0 1
0 1
0 0
0 1
1 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
1 1
0 0
0 0
0 0
0 1
0 0
0 0
0 0
0 1
0 1
0 1
0 0
0 0
0 0
0 0
0 1
1 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
0 0
0 1
0 0
0 0
0 0
0 0
0 1
1 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
0 0
0 1
0 0
0 0
0 0
0 0
0 0
1 1
0 1
0 0
0 0
0 1
0 0
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 0
0 1
1 1
0 0
0 0
0 0
0 1
0 0
0 1
0 0
0 1
0 1
0 1
0 0
0 1
0 0
0 0
0 1
1 1
0 1
0 0
0 0
0 1
0 0
0 1
0 0
0 1
0 1
0 1
0 0
0 1
0 1
0 0
0 1
1 1
0 1
0 0
0 0
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
1 1
1:33 Maximum Walter Unger 30.1.2017 12:00 WS2016/17 Z
Compute the Maximum (Idea)
34 34 12
12 14
14 56
56 23
23 67
67 49
49 27
27 61
61 52
52 67
67 59
59 26
26 41
41 33
33 22
22 0 1
0 0
0 0
0 1
0 0
0 1
0 0
0 1
0 1
0 1
0 0
0 1
0 0
0 0
1 1
1 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
1 1
1 1
0 1
0 0
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
1 1
1 1
0 0
0 0
0 0
0 1
0 0
0 0
0 0
0 1
0 0
0 1
0 0
0 0
0 0
0 0
1 1
1 1
0 1
0 0
0 0
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 0
1 1
1 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
1 1
1 1
0 0
0 0
0 0
0 1
0 0
0 1
0 0
0 1
0 1
0 1
0 0
0 0
0 0
0 0
1 1
1 1
0 1
0 0
0 0
0 1
0 0
0 1
0 1
0 1
0 1
0 1
0 0
0 1
0 1
0 0
1 1
1 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
0 0
0 0
0 0
0 0
0 0
0 0
1 1
1 1
0 0
0 0
0 0
0 1
0 0
0 0
0 0
0 1
0 1
0 1
0 0
0 0
0 0
0 0
1 1
1 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
1 1
1 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
0 0
0 1
0 0
0 0
0 0
0 0
1 1
1 1
0 1
0 0
0 0
0 1
0 0
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 0
1 1
1 1
0 0
0 0
0 0
0 1
0 0
0 1
0 0
0 1
0 1
0 1
0 0
0 1
0 0
0 0
1 1
1 1
0 1
0 0
0 0
0 1
0 0
0 1
0 0
0 1
0 1
0 1
0 0
0 1
0 1
0 0
1 1
1 1
0 1
0 0
0 0
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
1 1
1 1
1:34 Maximum Walter Unger 30.1.2017 12:00 WS2016/17 Z
Computing the Maximum
Task: Computem=maxij=1xjwithn=2k. Input:xj is in registerRj (16xj6n).
Output:min registerRn+1. Model: CRCW.
Programm: Maximum
for allPi,1 where 16i6ndo in parallel Pi,1(1)→Wi
for allPi,j where 16i,j6ndo in parallel Ri →Pi,j(a)
Rj→Pi,j(b)
ifa<b thenPi,j(0)→Wi
for allPi,1 where 16i6ndo in parallel Wi →Pi,1(h)
ifh=1then Ri →Pi,1(h) Pi,1(h)→Rn+1
1:35 Maximum Walter Unger 30.1.2017 12:00 WS2016/17 Z
Computing the Maximum
Programm: Maximum
for allPi,1 where 16i6ndo in parallel Pi,1(1)→Wi
for allPi,j where 16i,j6ndo in parallel Ri →Pi,j(a)
Rj→Pi,j(b)
ifa<b thenPi,j(0)→Wi
for allPi,1 where 16i6ndo in parallel Wi →Pi,1(h)
ifh=1then Ri →Pi,1(h) Pi,1(h)→Rn+1
Running time:O(1).
Number of processors:O(n2).
Memory:O(n).
1:36 Idenfify Root Walter Unger 30.1.2017 12:00 WS2016/17 Z
Identify the Roots of a Forest
Nodes are identified by numbers from 1 tilln Input: Father of nodei is written in registerRi. For the rootsi we have: in registerRi is writteni. Programm: Ranking
for allPi where 16i 6ndo in parallel forj=1todlognedo
Ri →Pi(h) Rh→Pi(h) Pi(h)→Ri
Running time:O(logn).
Number of processors:O(n).
Memory:O(n).
Model: CREW.
1:37 Situation Walter Unger 30.1.2017 12:00 WS2016/17 Z
Short Summary
Problem processors memory time model
Or O(n/t) O(n) O(t) ERCW
Or O(n/logn) O(n) O(logn) EREW
Maximum O(n2/t) O(n) O(t) CRCW
Sum O(n/logn) O(n) O(logn) EREW Ranking O(n/logn) O(n) O(logn) CREW Prefixsum O(n/logn) O(n) O(logn) EREW Mat.sum O(n2/logn) O(n2) O(logn) EREW Mat.prod. O(n2/logn) O(n2) O(n·logn) CREW Mat.prod. O(n3/logn) O(n2) O(logn) CREW Mat.prod. O(n3/logn) O(n2) O(logn) EREW
Question: May we save some processors?
May we do this saving in any situation?
How do we estimate the efficiency of a parallel algorithm?
1:38 Definition Walter Unger 30.1.2017 12:00 WS2016/17 Z
Cost Measurement
LetAbe any parallel algorithm, we denote:
TA(n)the running time ofA.
PA(n)the number of processors used byA.
RA(n)the number of registers used byA.
WA(n)the numner of accesses to registers done byA.
ST(n)the running time of the best [known] sequential algorihtm.
EffA(n) := P ST(n)
A(n)·TA(n) the efficiency ofA.
AEffA(n) :=P WA(n)
A(n)·TA(n) the usage efficiency ofA.
1:39 Overview Walter Unger 30.1.2017 12:00 WS2016/17 Z
Efficiency
Problem processors time W(n) AEff Modell
Or O(n/t) O(t) O(n) 1 ERCW
Or O(n/logn) O(logn) O(n) 1 EREW
Maximum O(n2/t) O(t) O(n2) 1 CRCW
Sum O(n/logn) O(logn) O(n) 1 EREW Ranking O(n/logn) O(logn) O(n) 1 CREW Prefixsum O(n/logn) O(logn) O(n) 1 EREW Mat.sum O(n2/logn) O(logn) O(n2) 1 EREW Mat.prod. O(n2/logn) O(nlogn) O(n3) 1 CREW Mat.prod. O(n3/logn) O(logn) O(n3) 1 CREW Mat.prod. O(n3/logn) O(logn) O(n3) 1 EREW
1:40 Overview Walter Unger 30.1.2017 12:00 WS2016/17 Z
Efficiency
Problem processors timet ST(n) Eff Modell
Or O(n/t) O(t) O(n) 1 ERCW
Or O(n/logn) O(logn) O(n) 1 EREW
Maximum O(n2/t) O(t) O(n) O(1/n) CRCW
Sum O(n/logn) O(logn) O(n) 1 EREW
Ranking O(n/logn) O(logn) O(n) 1 CREW
Prefixsum O(n/logn) O(logn) O(n) 1 EREW
Mat.sum O(n2/logn) O(logn) O(n2) 1 EREW Mat.prod. O(n2/logn) O(nlogn) O(n2.276) O(n−0.734) CREW Mat.prod. O(n3/logn) O(logn) O(n2.276) O(n−0.734) CREW Mat.prod. O(n3/logn) O(logn) O(n2.276) O(n−0.734) EREW
1:41 Idea for the k-th Element Walter Unger 30.1.2017 12:00 WS2016/17 Z
k-th Element
Task: Compute thek-th (k-smallest) element in a unsorted sequence S={s1,· · ·,sn}.
Lower bound:n−1 comparisons Start with a nice sequential algorithm Programm: Select(k,S)
if|S|650then returnk-th number inS SplitS indn/5esub-sequencesHi of size65 Sort eachHi
LetMbe the sequence of the middle elements inHi
m:=Select(d|M|/2e,M) S1:={s∈S|s<m}
S2:={s∈S|s=m}
S3:={s∈S|s>m}
if|S1|>kthen returnSelect(k,S1) if|S1|+|S2|>k then returnm returnSelect(k− |S1| − |S2|,S3)
1:42 Examples Walter Unger 30.1.2017 12:00 WS2016/17 Z
Example for the k-th Element (Slow Motion)
Input/Data:
M:
sorted M:
66 53 24 29 50 13 60 48 82 50 73 50 19 52 18 95 74 18 94 75 76 74 77 44 29 19 33 67 70 57 1 71 83 34 92 14 81 95 31 46 4 85 52 19 26 49 10 59 29 87 8 20 43 89 60 8 7 51 60 36 61 93 9 44 11 8 70 61 65 79 47 42 71 17 24 82 67 43 91 79 88 89 64 71 66 92 39 12 29 74 13 68 21 11 11 11 3 52 9 59 16 78 93 91 46
66 53 26 29 29 59 60 68 42 20 43 50 34 52 14 81 74 36 89 75 71 14 20 26 29 29 34 36 42 43 50 52 53 59 60 66 68 71 74 75 81 89
1:43 Examples Walter Unger 30.1.2017 12:00 WS2016/17 Z
Example for the k-th Element
Input/Data:
M:
sorted M:
95 2 82 24 58 66 8 78 25 47 57 22 47 49 66 47 65 33 27 59 18 84 65 32 55 83 87 77 16 16 96 13 54 9 62 17 78 46 48 32 28 78 15 34 45 12 26 83 79 28 66 16 73 48 28 76 84 27 30 91 64 59 57 13 94 56 61 89 44 60 76 58 63 90 70 72 39 45 94 88 22 6 60 28 42 52 77 37 24 58 38 3 44 47 96 62 60 61 3 84 52 46 88 29 84
42 52 56 37 58 66 60 28 44 47 73 54 47 61 45 78 52 46 32 59 57 28 32 37 42 44 45 46 47 47 52 52 54 56 57 58 59 60 61 66 73 78
1:44 Examples Walter Unger 30.1.2017 12:00 WS2016/17 Z
Example for the k-th Element (Worst Case)
Input/Data:
M:
sorted M:
28 24 1 21 36 15 23 21 10 23 35 76 80 83 90 51 55 53 66 61 64 10 5 36 16 13 44 23 13 23 4 18 71 58 73 65 76 50 64 86 58 62 30 32 42 0 39 10 40 6 5 32 22 71 72 86 85 77 86 73 83 87 55 90 68 91 94 89 93 51 66 82 88 61 65 63 89 84 55 71 88 60 90 82 63 93 88 87 77 92 80 91 64 63 64 58 70 62 76 87 56 66 82 66 93
30 32 42 21 39 44 40 21 23 32 35 71 70 83 84 76 56 66 82 66 64 21 21 23 30 32 32 35 39 40 42 44 56 64 66 66 70 71 76 82 83 84
1:45 Algorithm and Running Time Walter Unger 30.1.2017 12:00 WS2016/17 Z
Running Time
if|S|650then returnk-th number inS SplitSindn/5esub-sequencesHiof size65 Sort eachHi
LetMbe the sequence of the middle elements inHi m:=Select(d|M|/2e,M)
S1:={s∈S|s<m}
S2:={s∈S|s=m}
S3:={s∈S|s>m}
if|S1|>kthen returnSelect(k,S1) if|S1|+|S2|>kthen returnm returnSelect(k− |S1| − |S2|,S3)
For some constantsc,d we get:
T(n)6d·nforn650
T(n)6c·n+T(n/5) +T(3n/4)
10 5 1 0 13 10 23 6 5 4 18 58 58 62 65 51 50 53 60 58 55 28 24 36 16 36 15 23 13 10 23 22 65 63 73 76 55 55 64 66 61 62 30 32 42 21 39 44 40 21 23 32 35 71 70 83 84 76 56 66 82 66 64 63 68 88 87 77 92 51 66 64 63 61 71 72 86 85 77 71 73 83 87 82 90 93 91 94 89 93 80 91 82 88 64 76 80 89 90 87 86 88 86 90 93
1:46 Algorithm and Running Time Walter Unger 30.1.2017 12:00 WS2016/17 Z
Running Time
Claim:T(n)620·r·nwithr=max(d,c).
Proof:
n=50:
T(n)6c·n+d·n
5 +3·d·n 4 n>50:
T(n)6c·n+T(d·n
5 ) +T(3·d·n
4 )
T(n)6c·n+4·r·n+15·r·n Running timeT(n)is inO(n).
1:47 Algorithm and Running Time Walter Unger 30.1.2017 12:00 WS2016/17 Z
Parallel k-Select
InputS={s1,· · ·,sn}.
ProcessorsP1,P2,· · ·Pdn1−xe, thusP(n) =dn1−xe.
EachPi knowsn,P(n).
EachPi works ondnxeelements.
We will now create a prallel version of the programm Select(k,S).
We will get a parallel recursive programm.
1 Easy solution for smallS.
2 SplitS into small sub-sequences for the processors.
3 Compute parallel the median of the sub-sequences.
4 Compute parallel and recursive the median of medians.
5 Compute the splitting into the three sub-sequences.
6 Do the final recursion.
1:48 Algorithm and Running Time Walter Unger 30.1.2017 12:00 WS2016/17 Z
Example for the k-th Element
Input/Data:
M:
sorted M:
P1 10 96 71 71 80 33 53 80 80 62 38 83 5 88 20
P2 50 43 53 54 69 84 1 36 71 57 97 45 26 23 84
P3 1 7 93 83 3 64 91 85 15 84 52 78 36 96 46
P4 30 39 87 73 45 86 78 20 13 29 69 86 49 55 97
P5 52 18 51 85 4 0 80 55 78 34 30 97 92 45 75
P6 32 17 17 65 11 31 47 24 96 97 24 54 9 11 53
P7 56 88 43 66 95 62 38 60 37 84 49 36 28 24 30
P8 45 87 24 76 27 68 6 83 41 21 56 66 85 35 13
P9 91 69 94 46 92 93 39 30 6 37 83 88 57 77 85
P10 52 39 16 82 51 55 3 36 65 77 32 21 78 82 55
P11 31 90 34 91 92 42 77 7 32 87 0 74 60 31 13
P12 92 23 88 76 35 30 34 92 72 9 71 46 51 66 1
P13 24 44 52 6 92
8 71 84 15 34 46 45 57 13 89
P14 43 43 45 25 15 55 90 28 49 43 87 50 75 10 81
P15 81 18 39 35 60 61 79 93 45 63 25 82 90 87 13
P16 3 76
3 0 60
4 0 24 61 2 46 22 50 82 84
P17 93 20 3 64 17 11 13 95 70 94 69 58 51 92 41
P18 23 12 68 45 50 7 29
8 29 34 24 11 78 75 77
P19 1 44 25 39 24 37 37 91 36 60 22 18 45 69 5
P20 26 94 91 56 69 39 78 53 46 89 75 33 69 74 26
P21 9 82 60 18 81 0 44 72 43 66 45 57 81 28 59
P22 52 13 55 19 90 23 71 83 8 43 44 6 78 94 40
P23 20 61 14 22 53 80 77 84 19 53 48 57 12 49 43
P24 22 34 10 74 48 66 70 90 88 48 62 21 2 71 20
P25 74 26 20 31 43 81 90 8 91
6 47 45 23 0 37
P26 13 7 80 67 66 53 68 57 77 49 9 40 24 70 53
P27 22 58 53 40 83 21 21 7 11 70 59 39 28 28 95
P28 45 93 62 75 18 8 74 21 46 32 28 63 59 88 77
P29 11 80 74 0 21 47 82 6 40 89 56 53 19 72 3
P30 94 47 63 97 86 16 59 40 13 87 29 82 0 90 14
P31 5 44
5 48 28 96 59 87 50 15 10 15 25 97 22
P32 74 56 5 5 25 57 13 16 65 9 96 25 81 25 95
P33 30 50 62 43 6 1 68 52 72 75 0 24 38 64 96
P34 74 80 20 73 80 95 56 35 6 95 54 75 69 8 77
P35 52 58 4 9 39 61 74 39 83 58 93 91 73 33 97
71 53 64 55 52 31 49 45 77 52 42 51 45 45 61 24 58 29 37 69 57 44 49 48 37 53 39 59 47 59 28 25 50 73 58 24 25 28 29 31 37 37 39 42 44 45 45 45 47 48 49 495051 52 52 53 53 55 57 58 58 59 59 61 64 69 71 73 77
1:49 Algorithm and Running Time Walter Unger 30.1.2017 12:00 WS2016/17 Z
Parallel k-Select
Programm: ParSelect(k,S) 1:
if|S|6k1 thenP1 returnsSelect(k,S).
2:
S is split intod|S|1−xesub-sequencesSi with|Si|6dnxe Pi stores the start-address ofSi.
3:
for allPi where 16i 6dn1−xedo in parallel mi :=Select(d|Si|/2e,Si)
Pi(m1)→Ri.
Assume in the following thatM is the sequence of these values.
4:
m:=ParSelect(d|M|/2e,M).
5: More to come!
1:50 Algorithm and Running Time Walter Unger 30.1.2017 12:00 WS2016/17 Z
Parallel k-Select
Programm: ParSelect(k,S) Steps 5 5.1:
Distributemvia broadcast to allPi.
for allPi where 16i 6dn1−xedo in parallel Li:={s∈Si|s<m}
Ei :={s∈Si |s=m}
Gi :={s ∈Si |s>m}
5.2:
Compute with Prallel Prefix:
li :=Pi
j=1|Li|for all 16i6dn1−xe.
ei :=Pi
j=1|Ei|for all 16i6dn1−xe.
gi :=Pi
j=1|Gi|for all 16i 6dn1−xe.
Let:l0=e0=g0=0 5.3:
Even more to come!
1:51 Algorithm and Running Time Walter Unger 30.1.2017 12:00 WS2016/17 Z
Parallel k-Select
Programm: ParSelect(k,S) Steps 5+6 5.3:
ComputeL={s∈S|s<m},E ={s∈S|s=m}
andG={s∈S|s>m}as follows:
for allPi where 16i 6dn1−xedo in parallel Pi writesLi inRli−1+1, . . . ,Rli.
Pi writesEi inRei−1+1, . . . ,Rei. Pi writesGi inRgi−1+1, . . . ,Rgi. 6:
if|L|>kthen returnParSelect(k,L) if|L|+|E|>k then returnm returnSelect(k− |L| − |E|,G)
1:52 Algorithm and Running Time Walter Unger 30.1.2017 12:00 WS2016/17 Z
Parallel k-Select (Running Time)
Programm: ParSelect(k,S) 1:O(1)
if|S|6k1 thenP1 returnsSelect(k,S).
2:O(log2(|S|1−x))thus we haveO(logn)
S is split intod|S|1−xesub-sequencesSi with|Si|6dnxe Pi stores the start-address ofSi.
3:O(nx)
for allPi where 16i 6dn1−xedo in parallel mi :=Select(d|Si|/2e,Si)
Pi(m1)→Ri.
Assume in the following thatM is the sequence of these values 4:TParSelect(n1−x)
m:=ParSelect(d|M|/2e,M).
1:53 Algorithm and Running Time Walter Unger 30.1.2017 12:00 WS2016/17 Z
Parallel k-Select (Running Time)
Programm: ParSelect(k,S) Steps 5 5.1a:O(log2(n1−x))
Distributemvia broadcast to allPi. 5.1b:O(|Si|)thus we haveO(nx)
for allPi where 16i 6dn1−xedo in parallel Li:={s∈Si|s<m}
Ei :={s∈Si |s=m}
Gi :={s ∈Si |s>m}
5.2:O(log2(n1−x))
Compute with Prallel Prefix:
li :=Pi
j=1|Li|for all 16i6dn1−xe.
ei :=Pi
j=1|Ei|for all 16i6dn1−xe.
gi :=Pi
j=1|Gi|for all 16i 6dn1−xe.
Let:l0=e0=g0=0
1:54 Algorithm and Running Time Walter Unger 30.1.2017 12:00 WS2016/17 Z
Parallel k-Select (Running Time)
Programm: ParSelect(k,S) Steps 5+6 5.3:O(nx)
ComputeL={s∈S|s<m},E ={s∈S|s=m}
andG={s∈S|s>m}as follows:
for allPi where 16i 6dn1−xedo in parallel Pi writesLi inRli−1+1, . . . ,Rli.
Pi writesEi inRei−1+1, . . . ,Rei. Pi writesGi inRgi−1+1, . . . ,Rgi. 6:TParSelect(3·n/4)
if|L|>kthen returnParSelect(k,L) if|L|+|E|>k then returnm returnSelect(k− |L| − |E|,G)
1:55 Algorithm and Running Time Walter Unger 30.1.2017 12:00 WS2016/17 Z
Parallel k-Select (Running Time)
Adding all up we get:
TParSelect(n) =c1logn+c2·nx+TParSelect(n1−x) +TParSelect(3/4·n).
TParSelect(n) =O(nx)withPParSelect(n) =O(n1−x).
EffParSelect(n) = O(n)
O(nx)·O(n1−x) =O(1)
1:56 Sequential Merging Walter Unger 30.1.2017 12:00 WS2016/17 Z
Sequential Merging
Input:
A= (a1,a2,· · ·,ar)andB= (b1,b2,· · ·,bs)two sorted sequences Output:
C = (c1,c2,· · ·,cn)sorted sequence ofAandB withn=r+s.
Programm: Merge i :=1;j:=1;n:=r+s fork:=1tondo
ifai <bj
thenck:=ai;i:=i+1;
elseck:=bj;j:=j+1;
Algorithm does not care about special cases.
Running time: at mostr+scomparisons, i.e.O(n).
Lower bound on the number of comparisons isr+s, i.e.Ω(n).
1:57 Parallel Merging Walter Unger 30.1.2017 12:00 WS2016/17 Z
Idea for Parallel Merging (CREW)
1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16
17 18 19 20
21 22 23 24 25 26
27 28 29 30 31 32
33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48
49 50 51 52 53 54
55 56 57 58 59
60 61 62 63 64
The border lines may not intersect each other.
Thus we may separate the two sequences into disjoint blocks.
LetAi thei block of sizedr/pe.
LetBˆi block inBwhich should be merged withAi. Thus we may uses a PRAM easily (in this case).
1:58 Parallel Merging Walter Unger 30.1.2017 12:00 WS2016/17 Z
Idea for Parallel Merging (CREW)
1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35
36 37 38 39
40 41 42
43 44
45 46 47 48 49 50
51 52 53
54 55 56
57
58
59 60 61 62 63 64
LetAi [resp.Bi] thei block of sizedr/pe[resp.ds/pe].
LetBˆi [resp.Ai] block inB [resp.A] which should be merged withAi
[resp.Bi].
Pi cares aboutAi andBˆi if|Bˆi|6dr/pe.
LetC be those where onePjtakes already care of.
Pi cares aboutAi\C andBˆi\C.
1:59 Parallel Merging Walter Unger 30.1.2017 12:00 WS2016/17 Z
Parallel Merging (CREW)
1 UseP(n)processors.
2 Each processorPi computes forA[B] its part of sizer/P(n) [s/P(n)].
3 Each processorPi computes the part fromB [A] which should be merged with itsA-block [B-block].
4 Each processor computes itsAorB block, where only he is responsible for.
5 This block has sizeO(n/P(n)).
6 Each processor merges its block into the resulting sequence.
7 Time:O(logn+n/P(n)).
8 Efficiency
n
O(P(n))·O(logn+n/P(n)).
9 Efficiency is 1 forP(n)6n/logn.
1:60 Parallel Merging Walter Unger 30.1.2017 12:00 WS2016/17 Z
Idea for Merging (EREW)
1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35
36 37 38 39
40 41 42
43 44
45 46 47 48 49 50
51 52 53
54 55 56
57
58
59 60 61 62 63 64
Do some splitting into pairs of blocks of the same size.
Rekursive splitting into pairs of blocks of the same size.
Thus we may avoid read conflicts.
1:61 Parallel Merging Walter Unger 30.1.2017 12:00 WS2016/17 Z
Merging (EREW)
1 UseP(n)processors.
2 Compute the medianmof the sequencesAandB.
3 Split the sequencesAandBin two sub-sequences each of the “same” size (−16|A| − |B|61).
4 Continue recursively, till all sub-sequences are smaller thann/P(n).
5 Do the merging in the same way as before.
Remaining problem: Find the median of two sequences.
1:62 Parallel Merging Walter Unger 30.1.2017 12:00 WS2016/17 Z
Example for the Median for two Sorted Sequences
1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35
36 37 38 39
40 41 42
43 44
45 46 47 48 49 50
51 52 53
54 55 56
57
58
59 60 61 62 63 64
SequencesAandB are sorted.
Compute medianaofAand medianb ofB.