Theory of Parallel and Distributed Systems (WS2016/17)

(1)

Theory of Parallel and Distributed Systems (WS2016/17)

Kapitel 1

First Algorithms for PRAM

Walter Unger

Lehrstuhl für Informatik 1

12:00 Uhr, den 30. Januar 2017

(2)

1 Inhaltsverzeichnis Walter Unger 30.1.2017 12:00 WS2016/17 Z

Inhalt I

1 Motivation and History

Systolic Arrays and Vector Computer Transputer

Parallele Rechner PRAM

2 PRAM Introduction Definition Or Sum Matrices Prefixsum Maximum Idenfify Root

Situation

3 Efficiency Definition Overview

4 Selection

Idea for the k-th Element Examples

Algorithm and Running Time

5 Merging

Sequential Merging Parallel Merging Parallel Merging

(3)

1 Inhaltsverzeichnis Walter Unger 30.1.2017 12:00 WS2016/17 Z

Motivation

1 There are limits to the computing power of a singel Computer

2 Computers become cheaper

3 Specialized computers are expensive

4 There are tasks with large data

5 Manny problems are are very complex

1 Weather and other Simulations

2 Crash tests

3 Military applications

4 Large data: (SETI, ...)

5 More similar problems

6 Thus there is the need for computers with more than one CPU

7 or a quantum computer?

(4)

1:2 Systolic Arrays and Vector Computer Walter Unger 30.1.2017 12:00 WS2016/17 Z

Pipeline: (systolic array)

P₁ P₂ P₃ P₄ P₅ P₆ P₇ P₈ P₉

input output

There is a sequence of processors(Pi)16i 6n.

ProcessorP1receives the input.

Output ofP1 will be passed as the input ofP2.

Output ofPi will be passed as the input ofPi+1 16i<n.

ProcessorPndelivers the final output.

Processors may be different.

Processors may run different programs.

intermediate outputs may be buffered.

Pipelining is one importent type of parallel system (in practice).

(5)

Systolic Arrays

input input input input input

output output output output output input input input input input input input input input

Idea: use more than one data stream.

Data streams may interect each other.

Each processor is the same.

There is a global syncronatisation.

Processors may run simple programs.

Advantage: realy fast (for special applications).

(6)

Systolic Array with three data streams

output output output output output

output

output input

input input

input

(7)

Vector Computer

Vector of processos.

Each processor has different data.

But each processor executes the same programm.

Addition of two vectors:

1 Read vectorA

2 Read vectorB

3 Add (each processor)

4 Output the summ

Single Instruction Multiple Data SIMD-Computer.

Aim: Multiple Instruction Multiple Data MIMD-Computer.

I.e. Fast processors with fast connections.

(8)

1:6 Transputer Walter Unger 30.1.2017 12:00 WS2016/17 Z

Example: Transputer

Bus

CPU Memory

Link L1 L2 L3 L4

Advantage: very flexible, any fixed network of degree 4 possible.

Disadvantage: long wires may be necessary, only a fixed network possible.

(9)

1:7 Transputer Walter Unger 30.1.2017 12:00 WS2016/17 Z

Beispiel: Transputer II

Bus

CPU Memory

Link

Switch

(10)

1:8 Parallele Rechner Walter Unger 30.1.2017 12:00 WS2016/17 Z

Parallele Computer I

Bus

CPU Memory

Link Switch

Advantage: “normal” CPUs.

Advantage: fast links possible.

Advantage: no special hardware.

Advantage: variable network, may change during execution.

Advantage: very large networks may be possible.

Disadvantage: still a limited degree for the network.

Disadvantage: large network are complicated.

Problem: cooling large systems.

Problem: fault tolerance.

Problem: construct such a system.

Problem: generate good data throuput with constant degree network.

Problem: do the program structures fit the structure of the network.

(11)

Parallel Computer II (Goodput)

Bus

CPU Memory

Link Switch

Look for good networks.

Trees, Grids, Pyramids, ...

HQ(n),CCC(n),BF(n),SE(n),DB(n), ...

Pancake Network and Burned Pancake Network.

Problem: Physical placement of the processors.

Problem: Length of wires.

Problem: Has the network a nice structure.

If the network becomes too large, we may use efficiency.

Solution: choose a mixed network structure.

(12)

Parallel Computer III (Network)

Bus

CPU Memory

Link Switch

(13)

Parallel Computer IV (Network)

Bus

CPU Memory

Link Switch

(14)

Parallel Computer V (Network)

Bus

CPU Memory

Link Switch 1 CPU and memory are one logical unit:

Network

CPURAM CPURAM CPURAM CPURAM CPURAM

2 CPUs and memory are connected by a network:

Network

CPU CPU CPU CPU CPU

RAM RAM RAM RAM RAM

The difference is more on the practical side.

(15)

1:13 PRAM Walter Unger 30.1.2017 12:00 WS2016/17 Z

PRAM (theoretical model)

P1 P2 P3 P4 P5 P6 P7 P8

M₁ M₂ M₃ M₄ M₅ M₆ M₇ M₈

Ignore/unify the costs for each computation step.

Ignore/unify the costs for each communication step.

(16)

1:14 Definition Walter Unger 30.1.2017 12:00 WS2016/17 Z

Definition RAM

RAM: Random Access Machine CPU may access any memory cell Memory is unlimited

Complexity measurements

uniform: each operation cost one unit

logarithmic: cost are measured according to the size of the numbers

(17)

Idea of PRAM

global control

P₀ P₁ P₂ P₃ P₄ P₅ P₆ P₇ P₈

common memory

Many processos Common progamm

Program may select single processors Common memory

(18)

Definition PRAM

Consists of processorsPi with 16i 6p(prozessor has idi).

Consists of registersRjwith 16j6m.

Each processor has some local registers.

Each processorPi may access each registerRj. Each processor executes the same programm.

The programm is synchronized, thus each processor executes the same instructions.

A selection is possible by using the processor id.

The input of lengthnis written to registersRj with 16j6n.

The output is placed in some known registers.

The registers contain words (numbers) in the uniform cost measurement.

The registers contain bits in the logarithmic cost measurement.

(19)

Definition PRAM

The following instructions are possible:

1 processorPi reads registerRj:Rj→Pi(x).

2 processorPi writes value ofx into registerRj:Pi(x)→Rj.

3 processor may do some local computation using local registers:

x:=y∗5.

For the access to the register we have the following variations:

EREW Exclusive Read Exclusive Write CREW Concurrent Read Exclusive Write CRCW Concurrent Read Concurrent Write ERCW Exclusive Read Concurrent Write Write conflicts may be solved using the following rules:

Arbitrary: any processor gets access to the register.

Common: all processors writing to the same register have to write the same value.

Priority: the processor with the smallest id gets access to the register.

(20)

1:18 Or Walter Unger 30.1.2017 12:00 WS2016/17 Z

Computation of an “Or” (Idea)

x=0 x=1 x=0 x=0 x=1 x=0 x=0 x=1

0 ∨ 1 ∨ 0 ∨ 0 ∨ 1 ∨ 0 ∨ 0 ∨ 1 → 1

(21)

Computing an “Or”

Task: Computex =Wn i=1xi.

Input:xi is in registerRi (16i6n).

Output computed inRn+1.

Model: CRCW Arbitrary, Common oder Priority.

Programm: Or

for allPi where 16i 6ndo in parallel Ri →Pi(x)

ifx =truethenPi(x)→Rn+1

Running time:O(1)(exact 2 steps).

Number of processors:n.

Memory:n+1.

Possible models: ERCW (Arbitrary, Common oder Priority).

(22)

Computing an “Or” (EREW)

Problem:

no writing of two processors to the same register at the same time.

Idea: combine pairwise the results

With this idea, computing the sum is also possible.

Thus computing the “Or” is just a special case of computing a sum.

(23)

1:21 Sum Walter Unger 30.1.2017 12:00 WS2016/17 Z

Computing the Sum (Idea)

S1 S2 S3 S4 S5 S6 S7 S8

S1..2 S3..4 S5..6 S7..8

S1..4 S5..8

S1..8

(24)

Computing the Sum (Idea)

P1 P2 P3 P4

12 6 34 5 7 23 4 11

103 45 30 15

(25)

Computing the sum (EREW)

Assume w.l.o.gn=2^kfork∈N.

Task: computex=Pn

i=1xi withn=2^k. Input:xi is in registerRi (16i6n).

Output: should be inR1 (input may be overwritten).

Modell: EREW.

Programm: Summe

for allPi where 16i 6n/2do in parallel R2·i−1→Pi(x)

forj=1tok do

if(i−1)≡0 (mod 2^j−1)then R_2·i−1+2j−1→Pi(y) x:=x+y

Pi(x)→R2·i−1

Running time:O(k) =O(logn)(precise 3·k+1 steps).

Number of processors:n/2.

Size of memory:n.

(26)

1:24 Matrices Walter Unger 30.1.2017 12:00 WS2016/17 Z

Addition of Matrices

LetA,B two(n×n)-Matrices.

SumA+B is computabel withn² processors in ZeitO(1)on a EREW PRAM.

R1 tillRn² containA(one row after the other).

R1+n² bisR2·n² containsB (one row after the other).

Result inR_1+2·n2 bisR_3·n2. Programm: MatSumme

for allPi where 16i 6n² do in parallel Ri →Pi(x)

R_i+n2→Pi(y) x :=x+y

Pi(x)→Ri+2·n²

Running time:O(1).

Number of processors:O(n²).

Size of memory:O(n²).

(27)

Multiplication of Matrices

LetA,B be two(n×n)-Matrices.

R1 tillRn² containA(one row after the other).

R_1+n2 bisR2·n² containsB (one row after the other).

Result inR1+2·n² bisR3·n²

RegisterAi,j=R(i−1)·n+j (16i,j6n).

RegisterBi,j=R(i−1)·n+j+n² (16i,j6n).

RegisterCi,j=R(i−1)·n+j+2·n² (16i,j6n).

processorPi,j=P(i−1)·n+j (16i,j6n).

Use the above notation to simplify the algorithem.

Each processor has to do some hidden local computation to implement the above expressions.

(28)

Multiplikation of Matrices

A_i,j=R_(i−1)·n+j B_i,j=R(i−1)·n+j+n2 C_i,j=R(i−1)·n+j+2·n2 P_i,j=P_(i−1)·n+j

LetA,B be two(n×n)-Matrices

ProductA·B is computable withn²processors in timeO(n)on a CREW PRAM.

Programm: MatrProd 1

for allPi,j where 16i,j6ndo in parallel h=0

forl=1tondo Ai,l→Pi,j(a) Bl,j→Pi,j(b) h=h+a·b Pi,j(h)→Ci,j

Running time:O(n).

(29)

Multiplikation of Matrices

A_i,j=R_(i−1)·n+j B_i,j=R(i−1)·n+j+n2 C_i,j=R(i−1)·n+j+2·n2 P_i,j=P_(i−1)·n+j

LetA,B be two(n×n)-Matrices

ProductA·B is computable withn² processors in timeO(n)on a EREW PRAM.

Programm: MatrProd 2

for allPi,j where 16i,j6ndo in parallel h=0

forl=1tondo Ai,l→Pi,j(a) Bl,j→Pi,j(b) h=h+a·b Pi,j(h)→Ci,j

Running time:O(n).

(30)

1:28 Prefixsum Walter Unger 30.1.2017 12:00 WS2016/17 Z

Compute the Prefixsum

Problem:

Task: Computesi=Pi

j=1xj for 16i6n.

Input:xj is in registerRj (16j6n).

Output:si should be in registerRi for 16i 6n.

(31)

Motivation and History PRAM Introduction Efficiency Selection Merging

Computing Prefixsum (Idea)

x1..1 x2..2 x3..3 x4..4 x5..5 x6..6 x7..7 x8..8

Done!

(32)

Computing the Prefixsum

Task: Computesi=Pi

j=1xj for 16i6n.

Input:xj is in registerRj (16j6n).

Output:si should be in registerRi for 16i 6n.

Model: EREW Programm: Summe

for allPi where 16i 6ndo in parallel Ri →Pi(x)

forj=1tok do ifi >2^j−1 then

R_i−2j−1→Pi(y) x:=x+y Pi(x)→Ri

Running time:O(k) =O(logn)(precisely 3·k+1 steps).

Number of processors:n.

Size of memory:n.

(33)

1:31 Maximum Walter Unger 30.1.2017 12:00 WS2016/17 Z

Compute the Maximum

Task: Computem=maxⁱ_j=1xjwithn=2^k. Input:xj is in registerRj (16j6n).

Output:mshould be in registerRn+1.

Possible withnprocessors in timeO(logn)using a EREW PRAM.

Question: could it be done faster? (i.e. on a ERCW PRAM).

A maximum is larger or equal than all other values.

Idea: compare all pairs of numbers.

The maximum will always win.

(34)

Compute the Maximum (Idea)

34 34 12

12 14

14 56

56 23

23 67

67 49

49 27

27 61

61 52

52 57

57 59

59 26

26 41

41 33

33 22

22 0 1

0 0

0 1

0 0

0 1

0 0

0 1

0 0

0 1

0 0

0 1

1 1

0 1

1 1

0 1

0 0

0 1

1 1

0 0

0 1

0 0

0 1

0 0

0 1

0 0

0 1

1 1

0 1

0 0

0 1

0 0

0 1

1 1

0 0

1 1

0 0

0 1

0 0

0 1

0 0

0 1

0 0

0 1

1 1

0 1

0 0

0 1

0 0

0 1

0 0

0 1

0 0

0 1

1 1

0 0

0 1

0 0

1 1

0 0

0 1

0 0

0 1

0 0

0 1

1 1

0 0

0 1

0 0

0 1

0 0

0 1

1 1

0 0

0 1

0 0

0 1

0 0

1 1

0 1

0 0

0 1

0 0

0 1

0 0

0 1

1 1

0 0

0 1

0 0

0 1

0 0

0 1

0 0

0 1

0 0

0 1

1 1

0 1

0 0

0 1

0 0

0 1

0 0

0 1

0 0

0 1

0 0

0 1

1 1

0 1

0 0

0 1

1 1

(35)

Compute the Maximum (Idea)

34 34 12

12 14

14 56

56 23

23 67

67 49

49 27

27 61

61 52

52 67

67 59

59 26

26 41

41 33

33 22

22 0 1

0 0

0 1

0 0

0 1

0 0

0 1

0 0

0 1

0 0

1 1

0 1

1 1

0 1

0 0

0 1

1 1

0 0

0 1

0 0

0 1

0 0

0 1

0 0

1 1

0 1

0 0

0 1

0 0

1 1

0 0

1 1

0 0

0 1

0 0

0 1

0 0

0 1

0 0

1 1

0 1

0 0

0 1

0 0

0 1

0 0

0 1

0 0

1 1

0 0

0 1

0 0

1 1

0 0

0 1

0 0

0 1

0 0

1 1

0 0

1 1

0 0

0 1

0 0

0 1

0 0

1 1

0 1

0 0

0 1

0 0

0 1

0 0

1 1

0 0

0 1

0 0

0 1

0 0

0 1

0 0

0 1

0 0

1 1

0 1

0 0

0 1

0 0

0 1

0 0

0 1

0 0

0 1

0 0

1 1

0 1

0 0

0 1

1 1

(36)

Computing the Maximum

Task: Computem=maxⁱ_j=1xjwithn=2^k. Input:xj is in registerRj (16xj6n).

Output:min registerRn+1. Model: CRCW.

Programm: Maximum

for allPi,1 where 16i6ndo in parallel Pi,1(1)→Wi

for allPi,j where 16i,j6ndo in parallel Ri →Pi,j(a)

Rj→Pi,j(b)

ifa<b thenPi,j(0)→Wi

for allPi,1 where 16i6ndo in parallel Wi →Pi,1(h)

ifh=1then Ri →Pi,1(h) Pi,1(h)→Rn+1

(37)

Computing the Maximum

Programm: Maximum

for allPi,1 where 16i6ndo in parallel Pi,1(1)→Wi

for allPi,j where 16i,j6ndo in parallel Ri →Pi,j(a)

Rj→Pi,j(b)

ifa<b thenPi,j(0)→Wi

for allPi,1 where 16i6ndo in parallel Wi →Pi,1(h)

ifh=1then Ri →Pi,1(h) Pi,1(h)→Rn+1

Running time:O(1).

Memory:O(n).

(38)

1:36 Idenfify Root Walter Unger 30.1.2017 12:00 WS2016/17 Z

Identify the Roots of a Forest

Nodes are identified by numbers from 1 tilln Input: Father of nodei is written in registerRi. For the rootsi we have: in registerRi is writteni. Programm: Ranking

for allPi where 16i 6ndo in parallel forj=1todlognedo

Ri →Pi(h) Rh→Pi(h) Pi(h)→Ri

Running time:O(logn).

Number of processors:O(n).

Memory:O(n).

Model: CREW.

(39)

1:37 Situation Walter Unger 30.1.2017 12:00 WS2016/17 Z

Short Summary

Problem processors memory time model

Or O(n/t) O(n) O(t) ERCW

Or O(n/logn) O(n) O(logn) EREW

Maximum O(n²/t) O(n) O(t) CRCW

Sum O(n/logn) O(n) O(logn) EREW Ranking O(n/logn) O(n) O(logn) CREW Prefixsum O(n/logn) O(n) O(logn) EREW Mat.sum O(n²/logn) O(n²) O(logn) EREW Mat.prod. O(n²/logn) O(n²) O(n·logn) CREW Mat.prod. O(n³/logn) O(n²) O(logn) CREW Mat.prod. O(n³/logn) O(n²) O(logn) EREW

Question: May we save some processors?

May we do this saving in any situation?

How do we estimate the efficiency of a parallel algorithm?

(40)

Cost Measurement

LetAbe any parallel algorithm, we denote:

TA(n)the running time ofA.

PA(n)the number of processors used byA.

RA(n)the number of registers used byA.

WA(n)the numner of accesses to registers done byA.

ST(n)the running time of the best [known] sequential algorihtm.

EffA(n) := _P ^ST(n)

A(n)·T_A(n) the efficiency ofA.

AEffA(n) :=_P ^W^A⁽ⁿ⁾

A(n)·T_A(n) the usage efficiency ofA.

(41)

1:39 Overview Walter Unger 30.1.2017 12:00 WS2016/17 Z

Efficiency

Problem processors time W(n) AEff Modell

Or O(n/t) O(t) O(n) 1 ERCW

Or O(n/logn) O(logn) O(n) 1 EREW

Maximum O(n²/t) O(t) O(n²) 1 CRCW

Sum O(n/logn) O(logn) O(n) 1 EREW Ranking O(n/logn) O(logn) O(n) 1 CREW Prefixsum O(n/logn) O(logn) O(n) 1 EREW Mat.sum O(n²/logn) O(logn) O(n²) 1 EREW Mat.prod. O(n²/logn) O(nlogn) O(n³) 1 CREW Mat.prod. O(n³/logn) O(logn) O(n³) 1 CREW Mat.prod. O(n³/logn) O(logn) O(n³) 1 EREW

(42)

1:40 Overview Walter Unger 30.1.2017 12:00 WS2016/17 Z

Efficiency

Problem processors timet ST(n) Eff Modell

Or O(n/t) O(t) O(n) 1 ERCW

Or O(n/logn) O(logn) O(n) 1 EREW

Maximum O(n²/t) O(t) O(n) O(1/n) CRCW

Sum O(n/logn) O(logn) O(n) 1 EREW

Ranking O(n/logn) O(logn) O(n) 1 CREW

Prefixsum O(n/logn) O(logn) O(n) 1 EREW

Mat.sum O(n²/logn) O(logn) O(n²) 1 EREW Mat.prod. O(n²/logn) O(nlogn) O(n^2.276) O(n^−0.734) CREW Mat.prod. O(n³/logn) O(logn) O(n^2.276) O(n^−0.734) CREW Mat.prod. O(n³/logn) O(logn) O(n^2.276) O(n^−0.734) EREW

(43)

1:41 Idea for the k-th Element Walter Unger 30.1.2017 12:00 WS2016/17 Z

k-th Element

Task: Compute thek-th (k-smallest) element in a unsorted sequence S={s1,· · ·,sn}.

Lower bound:n−1 comparisons Start with a nice sequential algorithm Programm: Select(k,S)

if|S|650then returnk-th number inS SplitS indn/5esub-sequencesHi of size65 Sort eachHi

LetMbe the sequence of the middle elements inHi

m:=Select(d|M|/2e,M) S1:={s∈S|s<m}

S2:={s∈S|s=m}

S3:={s∈S|s>m}

if|S1|>kthen returnSelect(k,S1) if|S1|+|S2|>k then returnm returnSelect(k− |S1| − |S2|,S3)

(44)

1:42 Examples Walter Unger 30.1.2017 12:00 WS2016/17 Z

Example for the k-th Element (Slow Motion)

Input/Data:

M:

sorted M:

66 53 24 29 50 13 60 48 82 50 73 50 19 52 18 95 74 18 94 75 76 74 77 44 29 19 33 67 70 57 1 71 83 34 92 14 81 95 31 46 4 85 52 19 26 49 10 59 29 87 8 20 43 89 60 8 7 51 60 36 61 93 9 44 11 8 70 61 65 79 47 42 71 17 24 82 67 43 91 79 88 89 64 71 66 92 39 12 29 74 13 68 21 11 11 11 3 52 9 59 16 78 93 91 46

66 53 26 29 29 59 60 68 42 20 43 50 34 52 14 81 74 36 89 75 71 14 20 26 29 29 34 36 42 43 50 52 53 59 60 66 68 71 74 75 81 89

(45)

Example for the k-th Element

Input/Data:

M:

sorted M:

95 2 82 24 58 66 8 78 25 47 57 22 47 49 66 47 65 33 27 59 18 84 65 32 55 83 87 77 16 16 96 13 54 9 62 17 78 46 48 32 28 78 15 34 45 12 26 83 79 28 66 16 73 48 28 76 84 27 30 91 64 59 57 13 94 56 61 89 44 60 76 58 63 90 70 72 39 45 94 88 22 6 60 28 42 52 77 37 24 58 38 3 44 47 96 62 60 61 3 84 52 46 88 29 84

42 52 56 37 58 66 60 28 44 47 73 54 47 61 45 78 52 46 32 59 57 28 32 37 42 44 45 46 47 47 52 52 54 56 57 58 59 60 61 66 73 78

(46)

Example for the k-th Element (Worst Case)

Input/Data:

M:

sorted M:

28 24 1 21 36 15 23 21 10 23 35 76 80 83 90 51 55 53 66 61 64 10 5 36 16 13 44 23 13 23 4 18 71 58 73 65 76 50 64 86 58 62 30 32 42 0 39 10 40 6 5 32 22 71 72 86 85 77 86 73 83 87 55 90 68 91 94 89 93 51 66 82 88 61 65 63 89 84 55 71 88 60 90 82 63 93 88 87 77 92 80 91 64 63 64 58 70 62 76 87 56 66 82 66 93

30 32 42 21 39 44 40 21 23 32 35 71 70 83 84 76 56 66 82 66 64 21 21 23 30 32 32 35 39 40 42 44 56 64 66 66 70 71 76 82 83 84

(47)

1:45 Algorithm and Running Time Walter Unger 30.1.2017 12:00 WS2016/17 Z

Running Time

if|S|650then returnk-th number inS SplitSindn/5esub-sequencesH_iof size65 Sort eachH_i

LetMbe the sequence of the middle elements inH_i m:=Select(d|M|/2e,M)

S₁:={s∈S|s<m}

S2:={s∈S|s=m}

S₃:={s∈S|s>m}

if|S₁|>kthen returnSelect(k,S1) if|S₁|+|S₂|>kthen returnm returnSelect(k− |S₁| − |S₂|,S3)

For some constantsc,d we get:

T(n)6d·nforn650

T(n)6c·n+T(n/5) +T(3n/4)

10 5 1 0 13 10 23 6 5 4 18 58 58 62 65 51 50 53 60 58 55 28 24 36 16 36 15 23 13 10 23 22 65 63 73 76 55 55 64 66 61 62 30 32 42 21 39 44 40 21 23 32 35 71 70 83 84 76 56 66 82 66 64 63 68 88 87 77 92 51 66 64 63 61 71 72 86 85 77 71 73 83 87 82 90 93 91 94 89 93 80 91 82 88 64 76 80 89 90 87 86 88 86 90 93

(48)

Running Time

Claim:T(n)620·r·nwithr=max(d,c).

Proof:

n=50:

T(n)6c·n+d·n

5 +3·d·n 4 n>50:

T(n)6c·n+T(d·n

5 ) +T(3·d·n

4 )

T(n)6c·n+4·r·n+15·r·n Running timeT(n)is inO(n).

(49)

Parallel k-Select

InputS={s1,· · ·,sn}.

ProcessorsP1,P2,· · ·Pdn^1−xe, thusP(n) =dn^1−xe.

EachPi knowsn,P(n).

EachPi works ondn^xeelements.

We will now create a prallel version of the programm Select(k,S).

We will get a parallel recursive programm.

1 Easy solution for smallS.

2 SplitS into small sub-sequences for the processors.

3 Compute parallel the median of the sub-sequences.

4 Compute parallel and recursive the median of medians.

5 Compute the splitting into the three sub-sequences.

6 Do the final recursion.

(50)

Example for the k-th Element

Input/Data:

M:

sorted M:

P1 10 96 71 71 80 33 53 80 80 62 38 83 5 88 20

P2 50 43 53 54 69 84 1 36 71 57 97 45 26 23 84

P3 1 7 93 83 3 64 91 85 15 84 52 78 36 96 46

P4 30 39 87 73 45 86 78 20 13 29 69 86 49 55 97

P5 52 18 51 85 4 0 80 55 78 34 30 97 92 45 75

P6 32 17 17 65 11 31 47 24 96 97 24 54 9 11 53

P7 56 88 43 66 95 62 38 60 37 84 49 36 28 24 30

P8 45 87 24 76 27 68 6 83 41 21 56 66 85 35 13

P9 91 69 94 46 92 93 39 30 6 37 83 88 57 77 85

P10 52 39 16 82 51 55 3 36 65 77 32 21 78 82 55

P11 31 90 34 91 92 42 77 7 32 87 0 74 60 31 13

P12 92 23 88 76 35 30 34 92 72 9 71 46 51 66 1

P13 24 44 52 6 92

8 71 84 15 34 46 45 57 13 89

P14 43 43 45 25 15 55 90 28 49 43 87 50 75 10 81

P15 81 18 39 35 60 61 79 93 45 63 25 82 90 87 13

P16 3 76

3 0 60

4 0 24 61 2 46 22 50 82 84

P17 93 20 3 64 17 11 13 95 70 94 69 58 51 92 41

P18 23 12 68 45 50 7 29

8 29 34 24 11 78 75 77

P19 1 44 25 39 24 37 37 91 36 60 22 18 45 69 5

P20 26 94 91 56 69 39 78 53 46 89 75 33 69 74 26

P21 9 82 60 18 81 0 44 72 43 66 45 57 81 28 59

P22 52 13 55 19 90 23 71 83 8 43 44 6 78 94 40

P23 20 61 14 22 53 80 77 84 19 53 48 57 12 49 43

P24 22 34 10 74 48 66 70 90 88 48 62 21 2 71 20

P25 74 26 20 31 43 81 90 8 91

6 47 45 23 0 37

P26 13 7 80 67 66 53 68 57 77 49 9 40 24 70 53

P27 22 58 53 40 83 21 21 7 11 70 59 39 28 28 95

P28 45 93 62 75 18 8 74 21 46 32 28 63 59 88 77

P29 11 80 74 0 21 47 82 6 40 89 56 53 19 72 3

P30 94 47 63 97 86 16 59 40 13 87 29 82 0 90 14

P31 5 44

5 48 28 96 59 87 50 15 10 15 25 97 22

P32 74 56 5 5 25 57 13 16 65 9 96 25 81 25 95

P33 30 50 62 43 6 1 68 52 72 75 0 24 38 64 96

P34 74 80 20 73 80 95 56 35 6 95 54 75 69 8 77

P35 52 58 4 9 39 61 74 39 83 58 93 91 73 33 97

71 53 64 55 52 31 49 45 77 52 42 51 45 45 61 24 58 29 37 69 57 44 49 48 37 53 39 59 47 59 28 25 50 73 58 24 25 28 29 31 37 37 39 42 44 45 45 45 47 48 49 495051 52 52 53 53 55 57 58 58 59 59 61 64 69 71 73 77

(51)

Parallel k-Select

Programm: ParSelect(k,S) 1:

if|S|6k1 thenP1 returnsSelect(k,S).

2:

S is split intod|S|^1−xesub-sequencesSi with|Si|6dn^xe Pi stores the start-address ofSi.

3:

for allPi where 16i 6dn^1−xedo in parallel mi :=Select(d|Si|/2e,Si)

Pi(m1)→Ri.

Assume in the following thatM is the sequence of these values.

4:

m:=ParSelect(d|M|/2e,M).

5: More to come!

(52)

Parallel k-Select

Programm: ParSelect(k,S) Steps 5 5.1:

Distributemvia broadcast to allPi.

for allPi where 16i 6dn^1−xedo in parallel Li:={s∈Si|s<m}

Ei :={s∈Si |s=m}

Gi :={s ∈Si |s>m}

5.2:

Compute with Prallel Prefix:

li :=Pi

j=1|Li|for all 16i6dn^1−xe.

ei :=Pi

j=1|Ei|for all 16i6dn^1−xe.

gi :=Pi

j=1|Gi|for all 16i 6dn^1−xe.

Let:l0=e0=g0=0 5.3:

Even more to come!

(53)

Parallel k-Select

Programm: ParSelect(k,S) Steps 5+6 5.3:

ComputeL={s∈S|s<m},E ={s∈S|s=m}

andG={s∈S|s>m}as follows:

for allPi where 16i 6dn^1−xedo in parallel Pi writesLi inRl_i−1+1, . . . ,Rl_i.

Pi writesEi inRe_i−1+1, . . . ,Re_i. Pi writesGi inRg_i−1+1, . . . ,Rg_i. 6:

if|L|>kthen returnParSelect(k,L) if|L|+|E|>k then returnm returnSelect(k− |L| − |E|,G)

(54)

Parallel k-Select (Running Time)

Programm: ParSelect(k,S) 1:O(1)

if|S|6k1 thenP1 returnsSelect(k,S).

2:O(log₂(|S|^1−x))thus we haveO(logn)

S is split intod|S|^1−xesub-sequencesSi with|Si|6dn^xe Pi stores the start-address ofSi.

3:O(n^x)

for allPi where 16i 6dn^1−xedo in parallel mi :=Select(d|Si|/2e,Si)

Pi(m1)→Ri.

Assume in the following thatM is the sequence of these values 4:TParSelect(n^1−x)

m:=ParSelect(d|M|/2e,M).

(55)

Parallel k-Select (Running Time)

Programm: ParSelect(k,S) Steps 5 5.1a:O(log₂(n^1−x))

Distributemvia broadcast to allPi. 5.1b:O(|Si|)thus we haveO(n^x)

for allPi where 16i 6dn^1−xedo in parallel Li:={s∈Si|s<m}

Ei :={s∈Si |s=m}

Gi :={s ∈Si |s>m}

5.2:O(log₂(n^1−x))

Compute with Prallel Prefix:

li :=Pi

j=1|Li|for all 16i6dn^1−xe.

ei :=Pi

j=1|Ei|for all 16i6dn^1−xe.

gi :=Pi

j=1|Gi|for all 16i 6dn^1−xe.

Let:l0=e0=g0=0

(56)

Parallel k-Select (Running Time)

Programm: ParSelect(k,S) Steps 5+6 5.3:O(n^x)

ComputeL={s∈S|s<m},E ={s∈S|s=m}

andG={s∈S|s>m}as follows:

for allPi where 16i 6dn^1−xedo in parallel Pi writesLi inRl_i−1+1, . . . ,Rl_i.

Pi writesEi inRe_i−1+1, . . . ,Re_i. Pi writesGi inRg_i−1+1, . . . ,Rg_i. 6:TParSelect(3·n/4)

if|L|>kthen returnParSelect(k,L) if|L|+|E|>k then returnm returnSelect(k− |L| − |E|,G)

(57)

Parallel k-Select (Running Time)

Adding all up we get:

TParSelect(n) =c1logn+c2·n^x+TParSelect(n^1−x) +TParSelect(3/4·n).

TParSelect(n) =O(n^x)withPParSelect(n) =O(n^1−x).

EffParSelect(n) = O(n)

O(n^x)·O(n^1−x) =O(1)

(58)

1:56 Sequential Merging Walter Unger 30.1.2017 12:00 WS2016/17 Z

Sequential Merging

Input:

A= (a1,a2,· · ·,ar)andB= (b1,b2,· · ·,bs)two sorted sequences Output:

C = (c1,c2,· · ·,cn)sorted sequence ofAandB withn=r+s.

Programm: Merge i :=1;j:=1;n:=r+s fork:=1tondo

ifai <bj

thenck:=ai;i:=i+1;

elseck:=bj;j:=j+1;

Algorithm does not care about special cases.

Running time: at mostr+scomparisons, i.e.O(n).

Lower bound on the number of comparisons isr+s, i.e.Ω(n).

(59)

1:57 Parallel Merging Walter Unger 30.1.2017 12:00 WS2016/17 Z

Idea for Parallel Merging (CREW)

1 2 3 4 5

6 7 8 9 10 11

12 13 14 15 16

17 18 19 20

21 22 23 24 25 26

27 28 29 30 31 32

33 34 35 36 37 38 39

40 41 42 43 44 45 46 47 48

49 50 51 52 53 54

55 56 57 58 59

60 61 62 63 64

The border lines may not intersect each other.

Thus we may separate the two sequences into disjoint blocks.

LetAi thei block of sizedr/pe.

LetBˆi block inBwhich should be merged withAi. Thus we may uses a PRAM easily (in this case).

(60)

Idea for Parallel Merging (CREW)

1 2 3 4 5

6 7 8 9 10 11

12 13 14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31 32 33 34 35

36 37 38 39

40 41 42

43 44

45 46 47 48 49 50

51 52 53

54 55 56

57

58

59 60 61 62 63 64

LetAi [resp.Bi] thei block of sizedr/pe[resp.ds/pe].

LetBˆi [resp.Ai] block inB [resp.A] which should be merged withAi

[resp.Bi].

Pi cares aboutAi andBˆi if|Bˆi|6dr/pe.

LetC be those where onePjtakes already care of.

Pi cares aboutAi\C andBˆi\C.

(61)

Parallel Merging (CREW)

1 UseP(n)processors.

2 Each processorPi computes forA[B] its part of sizer/P(n) [s/P(n)].

3 Each processorPi computes the part fromB [A] which should be merged with itsA-block [B-block].

4 Each processor computes itsAorB block, where only he is responsible for.

5 This block has sizeO(n/P(n)).

6 Each processor merges its block into the resulting sequence.

7 Time:O(logn+n/P(n)).

8 Efficiency

n

O(P(n))·O(logn+n/P(n)).

9 Efficiency is 1 forP(n)6n/logn.

(62)

Idea for Merging (EREW)

1 2 3 4 5

6 7 8 9 10 11

12 13 14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31 32 33 34 35

36 37 38 39

40 41 42

43 44

45 46 47 48 49 50

51 52 53

54 55 56

57

58

59 60 61 62 63 64

Do some splitting into pairs of blocks of the same size.

Rekursive splitting into pairs of blocks of the same size.

Thus we may avoid read conflicts.

(63)

Merging (EREW)

1 UseP(n)processors.

2 Compute the medianmof the sequencesAandB.

3 Split the sequencesAandBin two sub-sequences each of the “same” size (−16|A| − |B|61).

4 Continue recursively, till all sub-sequences are smaller thann/P(n).

5 Do the merging in the same way as before.

Remaining problem: Find the median of two sequences.

(64)

Example for the Median for two Sorted Sequences

1 2 3 4 5

6 7 8 9 10 11

12 13 14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31 32 33 34 35

36 37 38 39

40 41 42

43 44

45 46 47 48 49 50

51 52 53

54 55 56

57

58

59 60 61 62 63 64

SequencesAandB are sorted.

Compute medianaofAand medianb ofB.