Dipl.-Inf., Dipl.-Ing. (FH) Michael Wilhelm

(1)

Parallele Algorithmen

Dipl.-Inf., Dipl.-Ing. (FH) Michael Wilhelm

Hochschule Harz

FB Automatisierung und Informatik

mwilhelm@hs-harz.de

Raum 2.202

Tel. 03943 / 659 338

(2)

Inhalt

1. Einführung, Literatur, Motivation 2. Architektur paralleler Rechner 3. Software

4. Open MP

5. MIMD

6. Algorithmen

7. Computernumerik

8. PS/3

(3)

Gliederung

Parallele Programmierung mit PS/3

1. Compiler / Linker mit gcc oder g++

2. Debugger

3. Threads

4. SIMD

(4)

Architektur der Cell Broadband Engine

Verschiedene Superchips:

Cell-Computer der Firmen Sony, Toshiba und IBM

General Purpose GPUs ATI Radeon HD 2900XT

Nvidias Geforce 8800 Ultra

Terascale-Prozessor Polaris von Intel (80 CPU´s)

Multi-Threaded Array Processor von ClearSpeeds

Alle benötigen eine Master-CPU

(5)

■ Multi-Core/Thread-Architektur

■ Unterstützung multipler Betriebssysteme

■ hohe Bandbreite zum Arbeitsspeicher und Peripherie

■ flexibles Schnittstellen-Interface

■ Resourcenmanagement für Echtzeitanwendungen

■ 90 nm-SOI-Technologie

■ Stromspar-Technik

■ 64 Bit-Prozessor, 4 GHz, PPC: 256 MByte

■ Instruktionpipeline: 21 Stufen

■ L1 Cache: 32 kB / 32 kB

■ L2 Cache: 512 kB (Daten und Programme)

■ Speicherbandbreite: max. 25,6 GB/s

Eigenschaften

(6)

Parallele Programmierung PS/3

Element Interconnect BUS (EIB) PPE

Power PC Processor Element

SPE SPE SPE

SPE

Synergetic Processor

Element

SPE SPE SPE

SPE

Synergetic Processor

Element

(7)

Architektur des Cell

(8)

Beschreibung

Der Cell bzw. dessen Architektur basiert auf der 64-Bit IBM-POWER-Architektur, mit 8 Synergistic Prozessing Units (SPUs) bzw. Synergistic Prozessing Elements (SPEs), im folgenden einfach SPE/SPU genannt

Jede SPE besteht aus der Recheneinheit SXU, einem kleinen Speicher von 256 kByte (Local Storage). Die SXU besteht aus einer Fließkomma- und Integer-Einheit, sowie Permutations- und Lese/Schreib-Einheit. Jede SPE hat 32 Register

Gesteuert werden die SPE‘s von einem 64-Bit-POWER-Prozessor (PPE), bestehend aus 2 x 32 kByte L1-Cache, 128 Registern (128 Bit-Breite), einer Fließkommaeinheit, einer VMX-Erweiterung für Gleitkomma-Berechnungen und einem Dual-Thread-

SMT, der mit Intels HyperThreading vergleichbar ist. Unter Linux sieht man zwei CPU‘s

Der POWER-Prozessor hat 512 kByte Speicher und verfügt über eine 21-stufige

Pipeline mit einer In-Order-Struktur (keine Umstrukturierung der Befehle). Mittels der

8 SPEs führt der gesamte Cell-Prozessor bis zu 8 Befehlsfolgen gleichzeitig aus.

(9)

Beschreibung

Ein mit halbem Systemtakt arbeitender Systembus (Element Interface Bus, EIB) verbindet die Prozessoreinheiten miteinander. Er überträgt 96 Byte pro Taktzyklus, wobei die Bandbreite einer Prozessoreinheit auf 16 Byte pro Taktzyklus beschränkt ist.

Im Cell ist ein Speichercontroller (Memory Interface Controller, MIC) integriert, der über das Dual Channel XDR von Rambus mit 72 Bit an den Arbeitsspeicher angebunden ist. Diese Schnittstelle arbeitet mit einer Frequenz von 3,2 GHz und überträgen mit zwei Kanälen 25 GByte/s.

Als Verbindung zur Peripherie, z. B. Grafikkarte, dient der Flex I/O, ebenfalls von Rambus. Sieben 8-Bit breite Verbindungen dienen zur Verbindung zum Nvidia- Chip (vier schreibens, drei lesend). Jede Verbindung hat eine Übertragungsrate von 6,4 GByte/s. Insgesamt entsteht so eine Schnittstelle mit 76,8 Gyte/s zur

Peripherie. Zum internen Bus, EIB, ist der Controller mit 2 „16 Byte“ pro

Taktzyklus verbunden.

(10)

Beschreibung des Blockbilds einer SPE

Jede SPE besteht aus:

Der Recheneinheit SXU

-

Die SPU besteht aus einer Fließkomma- und Integer-Einheit

-

sowie Permutations- und Lese/Schreib-Einheit (Channel unit).

-

jeweils 2 x 8 Bit breite Verbindungen mit 6,4 GByte/s Übertragungsrate .

einem kleinen Speicher von 256 kByte (Local Storage)

Insgesamt hat die SPE 23 Pipeline-Stufen (12 × fetch, 2 × ib, 3 × decode, 2 × issue, 2 × reg access, 6 × Execute, 1 × write back)

128 Register, es wird immer die komplette Breite adressiert, kleinere Datentypen sind aber erlaubt

Datentypen

-

1 6 × 8-Bit Integer

-

8 × 16-Bit Integer

-

4 × 32-Bit Integer

-

4 × 32-Bit Gleitkommazahlen

-

2 × 64-Bit Gleitkommazahlen

Leistungsfähigkeit

-

32-Bit Floating Point: vier FMAC-Befehle pro Takt pro SPE

-

8 × Gleitkommaoperationen × SPE × 3,2 GHz + 25 GFlops = 226 GFlops

(11)

Weitere Links

■ http://de.wikipedia.org/wiki/Cell_(Prozessor)

■ http://www-1.ibm.com/businesscenter/venturedevelopment/

us/en/featurearticle/gcl_xmlid/8649/nav_id/emerging

■ http://www-01.ibm.com/chips/techlib/techlib.nsf/

products/Cell_Broadband_Engine

■ http://www.golem.de/0608/46880.html

■ „Cell-Computer“ als PCI-Karte

(12)

Parallele Programmierung PS/3

Beispielcode: Ausgabe eines Textes

#include <stdio.h>

int main(void){

puts("Hallo Wernigerode");

return 0;

}

Übersetzen mit:

•

gcc bsp1.c -o bsp1

•

-S nur kompilieren

•

-c nur kompilieren und assemblieren

•

-o <Datei> kompilieren, assemblieren, linken

•

-O Optimieren

•

-lname Library name.lib einbinden (Threads)

(13)

Beispielcode: Threads erzeugen

// standardisierte Thread – Methode void * calc(void *param) {

int retcode=0;

puts("im thread");

pthread_exit ( (void *) retcode );

}

int main(int argc, char *argv[]) { pthread_t pt1;

if ( pthread_create(&pt1, NULL, calc, NULL) ) {

fprintf( stderr, "Fehler beim Erzeugen des 1. Threads\n");

exit (EXIT_FAILURE);

}

(14)

Beispielcode: Mehrere Threads erzeugen

// Thread - Methode

void * calc(void *param) {

… }

int main(int argc, char *argv[]) { pthread_t pt1, pt2;

if ( pthread_create(&pt1, NULL, calc, NULL) ) {

fprintf( stderr, "Fehler beim Erzeugen des 1. Threads\n");

exit (EXIT_FAILURE);

}

…

pthread_join(pt1, NULL); // Join, Barrier pthread_join(pt2, NULL);

return EXIT_SUCCESS;

}

(15)

Debugger

Programm gdb

Konsolen orientiert

Vorgehen:

vi b.c // schreiben und speichern

gcc b.c –ggdb3 -o b // übersetzen

gdb b // Aufruf des Debuggers

run // starten des Programms

- l // anzeige, list um main

- r // starten

- n // next, nächster Schritt, ohne Funktion

- s // step. Nächster Schritt, mit Funktion

(16)

■ r // starten, komplett Summe: 55

■ l // listing um main

■ l 1,30 // listing von Zeile 1 bis 30

■ b 23 // breakpoint setzen auf Zeile 23

■ r // Starten bis zum ersten Breakpoint oder vollständig

■ // Angezeigt wird IMMER der nächste Befehl

■ n // nächster Befehl, bis zum Ende

■ r // Starten bis zum ersten Breakpoint oder vollständig // Angezeigt wird IMMER der nächste Befehl

■ s // nächster Befehl step, in die Funktion

■ print i // Anzeige des Inhalts von i

■ print s // Anzeige des Inhalts von s

■ whatis i // Anzeige des Typs von i

■ set variable s=0 // ändern der Summe

■ print s // Anzeige des Inhalts von s, Test

Optionen

(17)

int summe (int n) { int i, s;

s=0;

for (i=0; i<=n; i++) s+=i;

return s;

}

int main(int argc, char* argv[]) { int i, n, s;

int f[10];

n=10;

s=summe(n);

printf("summe: %d \n",s);

for (i=0; i<10; i++) { f[i]=i;

}

Testbeispiel

für das Debugging

(18)

Parallele Programmierung mit der PS/3

Erste Variante: Multi-CPU / Thread

Single Thread mit einer SPU 75, 3 s Labor pi0

Threads mit sechs SPU 12,7 s Labor pi()

Faktor: 5,93

Single Thread mit dem Power PC 9,7 s pi1, pi

Zwei Threads mit dem Power PC 6,7 s pi2, pi

Zwei Threads PPC und sechs SPU 2,66 s pi2, pi

(19)

Parallele Programmierung mit der PS/3

Zweite Varianten: SIMD

A(0): 1 A(1): 2 A(2): 3 A(3): 4

B(0): 10 B(1): 20 B(2): 30 B(3): 40

■ Jedes float-Array wird in 4 Float-Arrays gepackt

■ Die Grundoperationen werden nun auf allen vier Daten durchgeführt

=

+

(20)

(21)

■ Vector integer arithmetic instructions (add, sub, div, mult)

■ Vector integer compare instructions

■ Vector integer rotate and shift instructions (rol, ror, shift)

■ Vector floating-point instructions

■ Vector floating-point arithmetic instructions

■ Vector floating-point rounding and conversion instructions

■ Vector floating-point compare instruction

■ Vector floating-point estimate instructions

■ Vector memory access instructions

Vector integer instructions, zwei Vektoren

(22)

d = spu_add(a, b) Vector add (d=a+b) d = spu_addx(a, b, c) Vector add extended d = spu_genb(a, b) Vector generate borrow

d = spu_genbx(a, b, c) Vector generate borrow extended d = spu_genc(a, b) Vector generate carry

d = spu_gencx(a, b, c) Vector generate carry extended d = spu_madd(a, b, c) Vector multiply and add, d=ab+c d = spu_mhhadd(a, b, c) Vector multiply high high and add d = spu_msub(a, b, c) Vector multiply and subtract; d=ab-c d = spu_mul(a, b) Vector multiply; d=a*b

d = spu_mulh(a, b) Vector multiply high

d = spu_mulhh(a, b) Vector multiply high high d = spu_mulo(a, b) Vector multiply odd

d = spu_mulsr(a, b) Vector multiply and shift right d = spu_nmadd(a, b, c) Negative vector multiply and add

d = spu_nmsub(a, b, c) Negative vector multiply and subtract d = spu_re(a) Vector floating-point reciprocal estimate

d = spu_rsqrte(a) Vector floating-point reciprocal square root estimate d = spu_sub(a, b) Vector subtract; d=a-b

d = spu_subx(a, b, c) Vector subtract extended

(23)

spu_addx: Vector Add Extended

•

d = spu_addx(a, b, c)

•

Each element of vector a is added to the corresponding element of vector b and to the least significant bit of the

•

corresponding element of vector c. The result is returned in the corresponding element of vector d.

spu_genb: Vector Generate Borrow

•

d = spu_genb(a, b)

•

Each element of vector b is subtracted from the corresponding element of vector a. The resulting borrow out is

•

placed in the least significant bit of the corresponding element of vector d. The remaining bits of d are set to 0.

spu_mhhadd: Vector Multiply High High and Add

•

d = spu_mhhadd(a, b, c)

•

Each even element of vector a is multiplied by the corresponding even element of

vector b, and the 32-bit result is

(24)

spu_re: Vector Floating-Point Reciprocal Estimate

•

d = spu_re(a)

•

For each element of vector a, an estimate of its floating-point reciprocal is computed, and the result is returned in

•

the corresponding element of vector d. The resulting estimate is accurate to 12 bits.

spu_rsqrte: Vector Floating-Point Reciprocal Square Root Estimate

•

d = spu_rsqrte(a)

•

For each element of vector a, an estimate of its floating-point reciprocal square root is computed, and the result is

•

returned in the corresponding element of vector d. The resulting estimate is

accurate to 12 bits.

(25)

Beschreibung des Blockbilds einer SPE

Jede SPE besteht aus:

Der Recheneinheit SXU

-

Die SPU besteht aus einer Fließkomma- und Integer-Einheit

-

sowie Permutations- und Lese/Schreib-Einheit (Channel unit).

-

jeweils 2 x 8 Bit breite Verbindungen mit 6,4 GByte/s Übertragungsrate .

einem kleinen Speicher von 256 kByte (Local Storage)

Insgesamt hat die SPE 23 Pipeline-Stufen (12 × fetch, 2 × ib, 3 × decode, 2 × issue, 2 × reg access, 6 × Execute, 1 × write back)

128 Register, es wird immer die komplette Breite adressiert, kleinere Datentypen sind aber erlaubt

Datentypen

-

1 6 × 8-Bit Integer

-

8 × 16-Bit Integer

-

4 × 32-Bit Integer

-

4 × 32-Bit Gleitkommazahlen

-

2 × 64-Bit Gleitkommazahlen

Leistungsfähigkeit

-

32-Bit Floating Point: vier FMAC-Befehle pro Takt pro SPE

(26)

Parallele Programmierung PS/3

PS/3 - Labor

■ Erstellen eines SPU-Programms

■ Berechnung: d ² = a ² * b ² + c ² // Euler / Determinante Aufbau:

■ Deklaration der vier dynamischen Felder a, b, c und d

■ Init der drei Felder

■ 1) Normale Schleife mit Zeitmessung

■ Deklaration der Vectorfelder a4,b4,c4,d4 (jeweils vier float´s)

■ 2) Schleife mit Zuweisung a4={ a[i], a[i+1], a[i+2], a[i+3] }

■ a4=a4a4 b4=b4b4 c4=c4c4 d4=a4b4+c4

(27)

•

Einfacher Beispielcode für allgemeine Vectorfelder:

•

vector float vA={1,2,3,4};

•

vector float vB={3,4,5,6};

•

vector float vC={0,0,0,0};

•

vC = spu_add(vA,vB);

•

oder

•

vC = vA + vB;

•

print_f_vector("vA",vA);

•

print_f_vector("vB",vB);

•

print_f_vector("vC",vC); // 4,6,8,10

•

vC = spu_mul(vA,vB);

•

oder

•

vC = vA * vB;

•

print_f_vector("vA",vA);

(28)

#define N 10000

#define MAX 10000

for (i=0; i<MAX; i++) { a[i] = (i+1) % 100;

b[i] = (i+2) % 100;

c[i] = (i+3) % 100;

}

for (k=0; k<N; k++) {

for (i=0; i<MAX; i++) {

d[i] = a[i]a[i] b[i]b[i] + c[i]c[i];

} }

PS/3 - Labor: Quellcode normale Version

1. Variante: 6,3 s

(29)

2. Variante

vector float a4, b4, c4, d4; // 4 single for (i=0; i<MAX; i+=4) {

// kopieren aus Array in Spezialvariablen memcpy( &a4[0], &a[i], 4*sizeof(float));

memcpy( &b4[0], &b[i], 4*sizeof(float));

memcpy( &c4[0], &c[i], 4*sizeof(float));

a4=a4a4; b4=b4b4; c4=c4*c4;

d4=a4*b4+c4;

memcpy( &d[i], &d4[0], 4*sizeof(float));

1. Variante: 6,3 s 2. Variante: 10,9 s 3. Variante: 4,5 s

PS/3 - Labor: Quellcode, memcpy

(30)

3. Variante: dynamische Felder

typedef union { float *fVal;

vector float * myVec;

} floatVec;

PS/3 - Labor: Quellcode

Name 0 1 2 3 4 5 6 7 8 9 10 11

*fVal 11 12 13 14 15 16 17 18 19 20 21 22

*myVec

•

float *a;

•

floatVec uA;

•

a = malloc(…)

•

uA.fVal = a;

Name 0 1 2 3 4 5 6 7 8 9 10 11

*fVal 11 12 13 14 15 16 17 18 19 20 21 22

*myVec

Name 0 1 2 3 4 5 6 7 8 9 10 11

*fVal 11 12 13 14 15 16 17 18 19 20 21 22

*myVec

•

uA.fVal += 4;

•

uA.fVal += 4;

(31)

3. Variante: dynamische Felder

**float a, b, c, d;**

floatVec a4, b4, c4, d4;

a4.fVal=a; b4.fVal=b; // root Referenz c4.fVal=c; d4.fVal=d;

for (i=0; i<MAX; i+=4) {

a4.myVec = a4.myVec * *a4.myVec;

b4.myVec = b4.myVec * *b4.myVec;

c4.myVec = c4.myVec * *c4.myVec;

d4.myVec = a4.myVec * b4.myVec + c4.myVec;

a4a.fVal+=4; b4a.fVal+=4; // Increment

1. Variante: 7,96 s 2. Variante: 2,05 s

typedef union { float *fVal;

vector float * myVec;

} floatVec;

PS/3 - Labor: Quellcode

(32)

4. Variante: dynamische Felder, spu Funktion

**float a, b, c, d;**

floatVec a4, b4, c4, d4;

a4a.fVal=a; b4a.fVal=b; // root Referenz c4a.fVal=c; d4a.fVal=d;

for (i=0; i<MAX; i+=4) {

a4a.myVec = a4a.myVec * a4a.myVec;

b4a.myVec = b4a.myVec * b4a.myVec;

c4a.myVec = c4a.myVec * c4a.myVec;

d4a.myVec = spu_funktion( a4a.myVec, b4a.myVec, c4a.myVec);

a4a.fVal+=4; b4a.fVal+=4; // Increment c4a.fVal+=4; d4a.fVal+=4;

}

1. Variante: 7,96 s 2. Variante 2,05 s 3. Variante: 2,04 s

typedef union { float *fVal;

vector float * myVec;

} floatVec;

PS/3 - Labor: Quellcode

(33)

5. Variante:

• dynamische Felder

• spu Funktion

• Bessere Schleife

1. Variante: 7,96 s 2. Variante 2,05 s 3. Variante: 2,04 s

PS/3 - Labor: Quellcode

(34)

sum_spu.c sum_spu.o sum_spu

sum.c sum.o

sum_spu_csf.o

sum spu-gcc

ppu-gcc

ppu-embedspu

•

spu-gcc sum_spu.c -o sum_spu

•

ppu-embedspu sum_distance_handle sum_spu sum_spu_csf.o

•

ppu-gcc sum.c sum_spu_csf.o -lspe -o sum

(35)

sum_spu.c sum_spu.o sum_spu

sum.c sum.o

sum_spu_csf.o

sum spu-g++

ppu-embedspu

•

spu-gcc sum_spu.c -o sum_spu

(36)

sum.c

extern spe_program_handle_t sum_distance_handle;

#define SPU_THREADS 2 int main(void) {

int i;

dma_packet cb[SPU_THREADS] attribute((aligned(16))); // aligment wichtig speid_t spe_ids[SPU_THREADS]; // von ibm

long long summe;

for(i=0; i<SPU_THREADS; i++) { cb[i].id = i; // struct cb[i].anz = MAX;

cb[i].summe = 0;

spe_ids[i] = spe_create_thread(0, &sum_distance_handle, &cb[i] , NULL, -1, 0);

} // for

for (i=0; i<SPU_THREADS; i++) { spe_wait(spe_ids[i], NULL, 0);

}

summe=0;

for (i=0; i<SPU_THREADS; i++) { summe+=cb[i].summe;

printf(" id: %d summe: %lld\n",i, cb[i].summe);

}

printf("Gesamtsumme: %lld\n",summe);

sum.h

typedef struct { int id;

int anz;

long long summe;

} dma_packet;

(37)