Notizen zur Diskussion des Papers:

(1)

Stefan de Lorenzo Perlen der Datenbankliteratur

Notizen zur Diskussion des Papers:

MapReduce: Simplified Data Processing on Large Clusters

von J. Dean und S. Ghemawat Warum wurde MapReduce entwickelt?

In den letzten Jahren und Jahrzehnten stieg die Menge der zu verarbeitenden Daten exponentiell an. Diese Entwicklung macht eine parallele Verarbeitung eben dieser Daten erstrebenswert. MapReduce ist ein Programmiermodel, welches ver- sucht möglichst viel Komplexität (Parallelisierung, Fehlertoleranz, Lastverteilung, usw.) vor dem Anwender zu verstecken. So wird es möglich Programme zu schreiben, welche sich praktisch nicht von jenen unterscheiden, die für die Ausführung auf einer einzigen Recheneinheit entworfen worden sind. Dies führt zu generischen und überschaubaren Lösungen. Ein weiterer Vorteil dieses Ansatzes ist die gute horizontale Skalierbarkeit, was das hinzufügen neuer Knoten zum bestehenden Sys- tem einfach macht.

Was ist die Grundlegende Idee hinter dem MapReduce Programmiermodell und wie sind die map- undreduce-Funktionen definiert?

Die Wurzeln von MapReduce liegen in der funktionellen Programmierung. Auch hier finden wir die namensgebendenmap- undreduce-Funktionen wir. Betrachten wir ein einfaches Beispiel, um zu illustrieren wie diese in MapReduce verwendet werden. Hierbei geht es darum zu erheben wie oft ein bestimmtes Wort in einem Dokument vorkommt.

map(String key, String value):

// key: document name

// value: document contents for each word w in value:

EmitIntermediate(w, "1");

reduce(String key, Iterator values):

// key: a word

// values: a list of counts int result = 0;

for each v in values:

result += ParseInt(v);

Emit(AsString(result));

Die map-Funktion sendet jedes Wort zusammen mit der Information wie oft es Auftritt (in diesem simplem Fall “1”). Abschließend z¨ahlt diereduce-Funktion alle gesendeten Werte (welche mit einem bestimmten Wort assoziiert werden) zusammen. Formal lassen sich beide Funktionen wie folgt beschreiben.

1

(2)

map : (k1, v1)→list(k1, v2) map : (k2,list(v2))→list(v2)

Ein map-Aufruf wird auf mehrere Maschinen verteilt. Die Eingabemenge wird in M Partitionen unterteilt, welche parallel abgearbeitet werden. Später wird die Schlüsselmenge in R Teile zerlegt (z.B. durch hash(key) modR). Es folgen R reduce-Aufträge, von denen jeder schlussendlich eine Ausgabedatei erzeugt.

Was sind die Schritte wenn MapReduce vom user program aufgerufen wird?

1. Die Eingabedatei(n) werden in M Partitionen unterteilt. Danach werden entsprechend viele Kopien des Programm am Cluster ausgef¨uhrt.

2. Eine dieser Kopien ist besonders - der Master. Die Restlichen sind Worker.

Es gibt M map- undR reduce-Aufträge. Der Master wählt untätige Worker und teilt ihnen entsprechende Aufträge zu.

3. Wir einem Worker eine Aufgabe zugewiesen, liest dieser den Inhalt der ihm zugeteilten Eingabemenge. Er analysiert die Schl¨ussel/Wert-Paare und leitet diese an die, vom Benutzer festgelegtemap-Funktion weiter.

4. Nach und nach werden die Paare auf die lokale Platte geschrieben. Die Lage dieser Daten wir anschließend an den Master ¨ubermittelt.

5. Nachdem einreduce-Worker vom Master über die Lage der benötigten Dateien informiert wurde, sortiert dieser die Schlüssel. Dies ist notwendig da zumeist viele verschiedene Schlüssel auf einen reduce-Auftrag abbilden.

6. Der reduce-Worker iteriert nun über die sortierten Daten. Nun übermittelt der Worker, für jeden einzigartigen Schlüssel den er findet, den Schlüssen und die entsprechenden Zwischenwerte an die benutzerdefiniertereduce-Funktion weiter.

7. Sind alle map- und reduce-Auftr¨age erledigt informiert der Master das Be- nutzerprogramm.

Was ist ein “straggler”, welches Problem rufen “straggler” hervor und wie wird dieses in MapReduce gel¨ost?

Ein Grund warum sich das MapReduce-Verfahren in die Länge ziehen kann sind sogenannte “straggler”. Hierbei handelt es sich um Maschinen die ungewöhnlich lange brauchen, um eine ihnen zugeteilte Aufgabe zu erledigen. Straggler können aus vielerlei Gründen auftreten (z.B. aufgrund von defekten Speichereinheiten).

Dieses Problem wird dadurch gelöst, dass, sobald eine MapReduce-Operation kurz vor dem Abschluss steht, der Master Backup-Ausführungen der Aufträge anlegt, welche gerade abgearbeitet werden. Eine Aufgabe wird als beendet markiert, wenn entweder eine primäre oder eine Backup-Ausführung fertiggestellt wurde. Dies reduziert die Ausführungszeit ungemein.

2

(3)

Was kann man aus den Testergebnissen des Sortierungsbeispiels in Bezug auf Backup Tasks und Ausfällen von einzelnen Rechnern aussagen? Was lässt sich über Maschinenausfälle sagen?

In der unten angeführten Grafik (b) sehen wir die Auswirkung die das deaktivieren von Backup-Aufträgen auf die Ausführungsdauer hat. Diese ist recht ähnlich zu (a) mit dem Unterschied, dass in (b) zu Ende hin relativ wenig Lese- bzw. Schreibop- erationen stattfinden, welche den Abschluss der MapReduce-Operation allerdings erheblich verzögern. Nach 960 sind, bis auf 5 reduce-Aufträge, alle Aufgaben abgearbeitet. Die letzten paar straggler sind erst 300 Sekunden später fertig. Dies erhöht die Dauer der Ausführung um 44%.

In (c) wird die Abarbeitung eines Programms gezeigt, wobei nach kurzer Zeit 200 von 1746 Worker wegfallen. Der Cluster-Scheduler startet unverzüglich Backup- Aufträge. Die Neuausführung der Aufgaben findet relativ zügig statt, die gesamte Berechnung dauert 933 Sekunden und somit 5% länger als eine normale Ausführung.

Wird ein Master-Fehler erkannt wird die Operation abgebrochen. Allerdings l¨asst sich der Master so konfigurieren, dass er Checkpoints setzt.

3

(4)

Pearls of Database Literature

Discussion Protocol

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat Summary by Vasker Pokhrel

Salzburg, June 10, 2015

We had four related questions:

1) What was the reasons for the developing MapReduce? (by Mario) 2) Give an overview of the execution process of MapReduce? (by Mario)

3) Was sind die Schritte wenn MapReduce vom user programm aufgerufen wird? (by Robert) 4) What is the basic idea behind the MapReduce programming model and what do map/reduce function take/produce, respectively? (by Daniel)

Answer:

MapReduce is a programming model that allows to express the simple computations but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.

This abstraction with user specified map and reduce operation allows to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.

Programming Model

The computation takes a set of input key/value pairs, and produces a set of output key/value pairs.

Two functions: Map and Reduce.

Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function.

The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user’s reduce function via an iterator. This allows us to handle lists of values that are too

large to fit in memory.

map (k1,v1) → list(k2,v2)

reduce (k2,list(v2)) → list(v2) Execution Overview

The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines.

Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g.,hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user.

When the user program calls the MapReduce function the following sequence of action occurs:

1) Split the input files into M pieces of 16 to 64 MB per piece.

2) Master assigns M map and R reduce tasks to workers.

3) Worker parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs are buffered in memory.

On worker failure: The master pings every worker periodically. If no response is received then, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their

(5)

initial idle state. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system.

4) The buffered pairs are written to local disk. Master is responsible for forwarding these locations to the reduce workers.

5) The reduce worker uses remote procedure calls to read the buffered data. Then it sorts it by the intermediate keys.

Subquestion: why are M and R much larger then number of worker machine?

Answer: Having each worker perform many different tasks improves dynamic load balancing, and also speeds up recovery when a worker fail.

6) The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user’s Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.

7) After task completion master wakes up the user program and returns back to the user code.

Two equivalent questions

1) What is a straggler in the context of the paper, and how is the straggler problem solved? (by Mario)

2) What is the straggler problem and how is the approach to alleviate it? (by Daniel)

Answer

A straggler is a machine that takes an unusually long time to complete one of the last few map or reduce tasks in the computation. For example, a machine with a bad disk. The cluster scheduling system may have scheduled other tasks on the machine, causing it to execute the MapReduce code more slowly, a bug in machine initialization code.

A general mechanism to alleviate the problem of stragglers is, when a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks.

The task is marked as completed whenever either the primary or the backup execution completes.

Two equivalent questions

1) Was kann man aus den Testergebnissen des Sortierungsbeispiels in Bezug auf Backup Tasks und Ausfällen von einzelnen Rechnern aussagen? (by Robert)

2) What do the performance tests tell us about the effects of backup tasks and machine failures? (by Daniel)

Answer

In general: the input rate is higher than the shuffle rate and the output rate because most data is read from a local disk and bypasses relatively bandwidth constrained network. The shuffle rate is higher than the output rate because the output phase writes two copies of the sorted data.

Effect of backup tasks: an execution of the sort program with backup tasks disabled. The execution flow is similar, except at end where hardly any write activity occurs. Because of last few stragglers don’t finish the entire computation time taken increases of 44% in elapsed time.

Machine failures: an execution of the sort program where 200 out of 1746 were killed worker processes several minutes into the computation. The underlying cluster scheduler immediately

(6)

restarted new worker processes on these machines. The worker deaths show up as a negative input rate since some previously completed map work disappears (since the corresponding map workers were killed) and needs to be redone. An increase of 5% over the normal execution time.

Small talk on: Semantic in the Presence of Failures

In case when map and reduce operators are deterministic, distributed implementation produces the same output as would have been produced by a non-faulting sequential execution of the entire program.

Atomic commits of map and reduce task outputs are reliable. Each in-progress task

Writes its output to private temporary files. A reduce task produces one such file, and a map task produces R such files (one per reduce task). When a map task completes, the worker sends a

message to the master and includes the names of the R temporary files in the message. If the master receives a completion message for an already completed map task, it ignores the message.

Otherwise, it records the names of R files in a master data structure.

When a reduce task completes, the reduce worker atomically renames its temporary output file to the final output file. If the same reduce task is executed on multiple machines, multiple rename calls will be executed for the same final output file. To guarantee that the final file system state contains just the data produced by one execution of the reduce task atomic rename operation provided by the underlying file system is reliable.