• Keine Ergebnisse gefunden

Distributed Data Management

N/A
N/A
Protected

Academic year: 2021

Aktie "Distributed Data Management"

Copied!
94
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Christoph Lofi José Pinto

Institut für Informationssysteme

Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

Distributed Data Management

(2)

12.3 Google Spanner 13.1 Map & Reduce

13.2 Cloud beyond Storage 13.3 Computing as a Service

SaaSPaaSIaaS

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 2

12.0 The Cloud

(3)

12.3 Spanner

• Outline and Key Features

• System Architecture:

– Software Stack – Directories

– Data Model – TrueTime

• Evaluation

• Case Study

[Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J. J., … Woodford, D. (2012). Spanner : Google’s Globally-Distributed Database. In Proceedings of OSDI’12: Tenth Symposium on Operating System Design and Implementation (pp. 251–264). http://doi.org/10.1145/2491245]

(4)

12.3 Motivation: Social Network

User posts Friend lists User posts Friend lists User posts Friend lists User posts Friend lists US

Brazil

Russia Spain

San Francisco Seattle

Arizona

Sao Paulo Santiago Buenos Aires

Moscow Berlin Krakow London

Paris Berlin Madrid Lisbon User posts Friend lists

4

x1000

x1000

x1000

x1000

(5)

12.3 Spanner

• “We believe it is better to have application

programmers deal with performance problems due to overuse of transactions as bottlenecks

arise, rather than always coding around the lack of

transactions”

(6)

12.3 Outline

• Next step from Bigtable in RDBMS path with strong time semantics

• Key Features:

Temporal Multi-version database

Externally consistent global write-transactions with synchronous replication.

Transactions across Datacenters.

Lock-free read-only transactions.

Schematized, semi-relational (tabular) data model.

SQL-like query interface.

(7)

12.3 Key Features cont.

– Auto-sharding, auto-rebalancing, automatic failure response.

– Exposes control of data replication and placement to user/application.

– Enables transaction serialization via global timestamps – Acknowledges clock uncertainty and guarantees a

bound on it

– Uses novel TrueTime API to accomplish concurrency control

– Uses GPS devices and Atomic clocks to get accurate time

(8)

12.3 Server configuration

Universe: Spanner deployment

Zones: analogues to deployment of BigTable servers (units of physical isolation)

(9)

12.3 Spannserver Software Stack

2-phase commit across groups (when needed)

(10)

(key:string, timestamp:int64) → string

• Back End: Colossus (successor of GFS)

• To support replication:

– each spanserver implements a Paxos State

Machine on top of each tablet, and the state machine stores meta data and logs of its tablet.

• Set of replicas is collectively a Paxos group

12.3 Spannserver Software Stack

(11)

Leader among replicas in a Paxos group is

chosen and all write requests for replicas in that group initiate at leader.

• At every replica that is a leader each spanserver implements:

– a lock table and

– a transaction manager

12.3 Spannserver Software Stack

(12)

• Directory – analogous to bucket in BigTable

– Smallest unit of data placement

– Smallest unit to define replication properties

• Directory might in turn be sharded into Fragments if it grows too large.

12.3 Directories

(13)

12.3 Data model

• Query language expanded from SQL.

• Multi-version database: uses a version when storing data in a column (time stamp).

• Supports transactions and provides strong consistency.

• Database can contain unlimited schematized

tables

(14)

• Not purely relational:

– Requires rows to have names

– Names are nothing but a set(can be singleton) of primary keys

– In a way, it’s a key value store with primary keys mapped to non-key columns as values

12.3 Data model

(15)

12.3 Data model

Implications of Interleave : hierarchy

(16)

12.3 TrueTime

• Novel API behind Spanner’s core innovation

• Leverages hardware features like GPS and Atomic Clocks

• Implemented via TrueTime API.

Method Returns

TT.now() TTinterval: [earliest, latest]

TT.after(t) True if t has passed TT.before(t) True if t has not arrived

(17)

• “Global wall-clock time” with bounded uncertainty

time

earliest latest

TT.now()

2*ε

17

12.3 TrueTime

(18)

• Set of time master server per datacenters and time slave daemon per machines.

• Majority of time masters are GPS fitted and few others are atomic clock fitted (Armageddon

masters).

• Daemon polls variety of masters and reaches a consensus about correct timestamp.

12.3 TrueTime implementation

(19)

12.3 TrueTime implementation

(20)

12.3 TrueTime implementation

(21)

12.3 TrueTime Architecture

Datacenter 1 Datacenter 2 Datacenter n GPS

timemaster

GPS timemaster

GPS timemaster

Atomic-clock timemaster

GPS timemaster

Client

21

GPS timemaster

Compute reference [earliest, latest] = now ± ε

(22)

12.3 TrueTime

• TrueTime uses both GPS and Atomic clocks since they are different failure rates and scenarios.

• Two other boolean methods in API are

– After(t) – returns TRUE if t is definitely passed

– Before(t) – returns TRUE if t is definitely not arrived

• TrueTime uses these methods in concurrency

control and to serialize transactions.

(23)

12.3 TrueTime

• After() is used for Paxos Leader Leases

– Uses after(Smax) to check if Smax is passed so that Paxos Leader can abdicate its slaves.

• Paxos Leaders can not assign timestamps(Si) greater than Smax for transactions(Ti) and clients can not see the data commited by transaction Ti till after(Si) is true.

– After(t) – returns TRUE if t is definitely passed

– Before(t) – returns TRUE if t is definitely not arrived

• Replicas maintain a timestamp tsafe which is the

maximum timestamp at which that replica is up to

date.

(24)

12.3 Concurrency control

1. Read-Write – requires lock.

2. Read-Only – lock free.

– Requires declaration before start of transaction.

– Reads information that is up to date

3. Snapshot Read – Read information from past by specifying a timestamp or bound

– User specifies specific timestamp from past or

timestamp bound so that data till that point will be read.

(25)

12.3 Timestamps

• Strict two-phase locking for write transactions

• Assign timestamp while locks are held

T

Pick s = now()

Acquired locks Release locks

25

(26)

12.3 Timestamp Invariants

26

• Timestamp order == commit order

• Timestamp order respects global wall-time order

T2

T3

T4 T1

(27)

12.3 Timestamps and TrueTime

T

Pick s = TT.now().latest

Acquired locks Release locks

Wait until TT.now().earliest > s s

average ε

Commit wait

average ε

27

(28)

12.3 Commit Wait and Replication

T

Acquired locks Release locks

Start consensus Notify slaves

Commit wait done Pick s

28

Achieve consensus

(29)

12.3 Commit Wait and 2-Phase Commit

TC

Acquired locks Release locks

TP1

Acquired locks Release locks

TP2

Acquired locks Release locks

Notify participants of s

Commit wait done Compute s for each

29

Start logging Done logging

Prepared

Compute overall s

Committed

Send s

(30)

12.3 Example

30

TP

Remove X from my friend list

Remove myself from X’s friend list sC=6

sP=8

s=8 s=15

Risky post P

s=8

Time <8

[X]

[me]

15

TC T2

[P]

My friends My posts X’s friends

8 []

[]

(31)

12.3 Evaluation

• Evaluated for replication, transactions and availability.

• Results on epsilon of TrueTime

• Benchmarked on Spanner System with

– 50 Paxos groups – 250 Directories

– Clients(applicatons) and Zones are at a network distance of 1ms

(32)

12.3 Evaluation - Availability

(33)

12.3 Evaluation - Epsilon

“…bad CPUs are 6 times more likely than bad clocks…”

(34)

12.3 Case Study

• Spanner is currently in production used by Google’s advertising backend F1.

• F1 previously used MySQL that was manually sharded many ways.

• Spanner provides synchronous replication and

automatic failover for F1.

(35)

12.3 Case Study cont.

• Enabled F1 to specify data placement via

directories of spanner based on their needs.

• F1 operation latencies measured over 24 hours

(36)

12.3 Summary

• Multi-version, scalable, globally distributed and synchronously replicated database.

• Key enabling technology: TrueTime

– Interval-based global time

• First system to distribute data at global scale and support externally consistent distributed

transactions.

• Implementation keypoints: integration of

concurrency control, replication and 2PC

(37)

13.1 Map & Reduce

13.2 Cloud beyond Storage 13.3 Computing as a Service

SaaSPaaSIaaS

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 37

13.0 The Cloud

(38)

• Just storing massive amounts of data is often not enough!

– Often, we also need to process and transform that data

Large-Scale Data Processing

– Use thousands of worker nodes within a computation cluster to process large data batches

But don’t want hassle of managing things

Map & Reduce provides

– Automatic parallelization & distribution – Fault tolerance

– I/O scheduling

– Monitoring & status updates

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 38

13.1 Map & Reduce

(39)

• Initially, implemented by Google for building the Google search index

– i.e. crawling the web, building

inverted word index, computing page rank, etc.

General framework for parallel high volume data processing

J. Dean, S. Ghemawat. “MapReduce: Simplified Data

Processing on Large Clusters”, Symp. Operating System Design and Implementation, San Francisco, USA, 2004

– Also available as Open Source implementation as part of Apache Hadoop

http://hadoop.apache.org/mapreduce/

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 39

13.1 Map & Reduce

(40)

Base idea

– There is a large number of input data, identified by a key

i.e. input given as key-value pairs

e.g. all web pages of the internet identified by their URL

– A map operation is a simple function which accepts one input key-value pair

A map operation runs as a autonomous thread on one single node of a cluster

Many map jobs can run in parallel on different input keys

Returns for a single input key-value pair a set of intermediate key-value pairs

map(key, value) Set of intermediate (key, value)

After map job is finished, the node is free to perform another map job for the next input key-value pair

A central controller distributes map jobs to free nodes

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 40

13.1 Map & Reduce

(41)

– After input data is mapped, reduce jobs can start – reduce(key, values) is run for each unique key

emitted by map()

Each reduce job is also run autonomously on one single node

Many reduce jobs can run in parallel on different intermediate key groups

Reduce emits final output of the map-reduce operation

Each reduce job takes all map tuples with a given key as input

Generate usually one, but possible more output tuples

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 41

13.1 Map & Reduce

(42)

• Each reduce is executed on a set of intermediate map results which have the same key

– To efficiently select that set, the intermediate key- value pairs are usually shuffled

i.e. just sorted and grouped by their respective key

– After shuffling, reduce input data can be selected by a simple range scan

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 42

13.1 Map & Reduce

(43)

• Example: Counting words in documents

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 43

13.1 Map & Reduce

reduce(key, values):

// key: a word;

// values: list of counts result = 0;

for each v in values) result += v;

emit(key, result);

map(key, value):

// key: doc name;

// value: text of doc

for each word w in value:

emit(w, 1);

(44)

• Example: Counting words in documents

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 44

13.1 Map & Reduce

doc1: “distributed db and p2p”

distributed 1

db 1

and 1

p2p 1

map 1

and 1

reduce 1

is

a 1

distributed 1

distributed 2

db 2

and 2

p2p 1

map 1

reduce 1

is 1

doc2: “map and reduce is a distributed processing technique for db”

map(key,value) reduce(key,values)

(45)

• Improvement: Combiners

– Combiners are mini-reducers that run in-memory after the map phase

– Used to group rare map keys into larger groups

e.g. word counts: group multiple extremely rare words under one key (and mark that they are grouped…)

– Used to reduce network and worker scheduling overhead

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 45

13.1 Map & Reduce

(46)

• Responsibility of the map and reduce master

Often, also called scheduler

Assign Map and Reduce tasks to workers on nodes

Usually, map tasks are assigned to worker nodes as a batch and not one by one

Often called a split, i.e. subset of the whole input data

Split often implemented by a simple hash function with as many buckets as worker nodes

Full split data is assigned to worker node which starts a map task for each input key-value pair

Check for node failure

Check for task completion

Route map results to reduce tasks

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 46

13.1 Map & Reduce

(47)

• Map and Reduce overview

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 47

13.1 Map & Reduce

(48)

• Master is responsible for worker node fault tolerance

– Handled via re-execution

Detect failure via periodic heartbeats

Re-execute completed + in-progress map tasks

Re-execute in progress reduce tasks

Task completion committed through master

– Robust: lost 1600/1800 machines once  finished ok

• Master failures are not handled

– Unlikely due to redundant hardware…

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 48

13.1 Map & Reduce

(49)

• Showcase: machine usage during web indexing

– Fine granularity tasks: map tasks >> machines

Minimizes time for fault recovery

Can pipeline shuffling with map execution

Better dynamic load balancing

– Showcase uses 200,000 map & 5,000 reduce tasks – Running on 2,000 machines

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 49

13.1 Map & Reduce

(50)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 50 [Source: Bill Howe, UW]

13.1 Google Systems Summary

(51)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 51

Spanner MapReduce

Megastore BigTable

Tenzing

Dremel Pregel

Hadoop

HBase

2004 2005 2006 2007 2008 2009 2010 2011 2012

13.1 Google Systems Summary

(52)

PageRank is one of the major algorithm behind Google Search

See our wonderful IRWS lecture (No 12)!!

Key Question: How important is a given website?

Importance independent of query

Idea: other pages “vote” for a site by linking to it

also called “giving credit to”

Pages with many votes are probably important

If an important site “votes” for another site, that vote has a higher weight as when an unimportant site votes

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 52

13.1 MR - PageRank

t1

x

t2 t3

(53)

• Given page 𝑥 with in-bound links 𝑡

1

, … , 𝑡

𝑛

, where

– 𝐶(𝑡) is the out-degree of 𝑡

– 𝛼 is probability of random jump – 𝑁 is the total number of

nodes in the graph

– 𝑃𝑅 𝑥 = 𝛼 1

𝑁 + (1 − 𝛼) 𝑖=1𝑛 (𝑃𝑅 𝑡𝑖

𝐶 𝑡𝑖 )

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 53

13.1 MR - PageRank

(54)

• Properties of PageRank

– Can be computed iteratively – Effects at each iteration is local

• Sketch of algorithm:

– Start with seed PRi values

– Each page distributes PRi “credit” to all pages it links to

– Each target page adds up “credit” from multiple in- bound links to compute PRi+1

– Iterate until values converge

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 54

13.1 MR - PageRank

(55)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 55

13.1 MR - PageRank

Map Step: Distribute Page Rank “Credits” to link targets

Reduce Step: gather up PageRank “credit” from multiple sources to compute new PageRank value

(56)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 56

• Dryad (Microsoft)

– Relational Algebra

• Pig (Yahoo)

– Near Relational Algebra over MapReduce

• HIVE (Facebook)

– SQL over MapReduce

• Cascading

– University of Wisconsin

• Hbase

– Indexing on HDFS

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig

13.1 MapReduce Contemporaries

(57)

• An engine for executing programs on top of

Hadoop.

• It provides a language, Pig Latin, to specify these

programs.

• An Apache open source

project http://pig.apache.org

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 57

13.2 Pig

(58)

• Suppose you have user data in one file, website data in

another, and you

need to find the top 5 most visited sites by users aged 18-25

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 58

13.2 Pig: motivation

Load Users Load Pages

Filter by age

Join on name

Group on url Count clicks

Order by clicks

Take top 5

(59)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 59

13.2 In MapReduce

170 lines of code, 4 hours to write

(60)

Users = load ‘users’ as (name, age);

Fltrd = filter Users by

age >=18 and age <=25;

Pages = load ‘pages’ as (user, url);

Jnd = join Fltrd by name , Pages by users;

Grpd = group Jnd by url;

Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;

Srtd = order Smmd by clicks desc;

Top5 = limit Srtd 5;

store Top5 into ‘top5sites’;

9 lines of code, 15 minutes to write

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 60

13.2 In Pig Latin

(61)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 61

13.2 Pig System Overview

Pig Latin program

A = LOAD‘file’ AS (sid, pid, mass, px:double) ; B = LOAD‘file2’ AS (sid, pid, mass,px:double);

C = FILTER A BY px < 1.0;

D = JOIN C BY sid, B BY sid;

STORE g INTO ‘output.txt’;

Pig parser

Pig compiler

Parsed program

Execution plan JOIN

FILTER

LOAD LOAD

DISK A DISK B

(62)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 62

13.2 Pig Performance vs Map-Reduce

How fast is Pig compared to a pure Map-Reduce implementation?

(63)

• Atom: Integer, string ,etc.

• Tuple:

– Sequence of fields

– Each field of any type

• Bag

– A collection of tuples

– Not necessarily the same type – Duplicates allowed

• Map:

– String literal keys mapped to any type

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 63

13.2 Data model

(64)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 64

13.2 Pig Latin statement

A = LOAD‘student’ AS (name:chararray, age:int, , gpa:float) ; X = FOREACHA GENERATE name, $2

First Field Second Field Third Field

Data type Chararray Int Float

Positional notation (generated by system) $0 $1 $2 Possible name (assigned by user using a

schema)

name age gpa

(65)

• Map-Reduce: Iterative Jobs

– Iterative jobs involve a lot of disk I/O for each repetition

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 65

13.2 Apache Spark motivation

(66)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 66

13.2 Apache Spark motivation

Using Map Reduce for complex jobs, interactive queries and online processing involves lots of disk I/O

Idea: keep more data in memory!

(67)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 67

13.2 Use memory instead of disk

(68)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 68

13.2 In memory data sharing

(69)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 69

13.2 MapReduce Programmability

• Most real applications require multiple MR steps:

– Google indexing pipeline: 21 steps

– Analytics queries (e.g. count clicks & topk) :2-5 steps – Iterative algorithms (e.g. PageRank): 10’s of steps

• Multi step jobs create spaghetti code

– 21 MR steps -> 21 mapper and reducer classes

(70)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 70

13.2 Programmability

(71)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 71

13.2 Performance

[Source: Daytona GraySort benchmark, sortbenchmark.org]

(72)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 72

13.2 Apache Spark

• Open source processing engine.

• Originally developed at UC Berkeley in 2009.

• More than 100 operators for transforming data.

• World record for large-scale on disk sorting.

• Built in support for many data sources (HDFS, RDBMS,S3, Cassandra)

[Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark : Cluster Computing with Working Sets. In HotCloud’10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing(p. 10). doi:10.1007/s00256-009- 0861-0]

[Zaharia, M., Chowdhury, M., Das, T., & Dave, A. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI’12 Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, 2–2. doi:10.1111/j.1095-8649.2005.00662.x]

(73)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 73

13.2 Spark Tools

(74)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 74

13.2 Resilient Distributed Datasets (RDDs)

Write programs in terms of distributed datasets and operations on them

Resilient Distributed Datasets

Collections of objects spread

across a cluster, stored in RAM or on Disk.

Built through parallel transformations

Automatically rebuilt on failure

Operations

Transformations (e.g. map, filter, groupBy).

Actions ( e.g. count, collect, save)

(75)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 75

13.2 Working with RDDs

(76)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 76

13.2 Spark and Map Reduce Differences

Hadoop

Map Reduce

Spark

Storage Disk only In-memory or on disk

Operations Map and Reduce Map, Reduce, Join, Sample, etc…

Execution model Batch Batch, interactive, streaming Programming

environments

Java Scala, Java, R, and Python

(77)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 77

Sample case

13.2 Log mining

(78)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 78

13.2 Log mining

(79)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 79

13.2 Log mining

(80)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 80

13.2 Log mining

(81)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 81

13.2 Log mining

(82)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 82

13.2 Log mining

(83)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 83

13.2 Log mining

(84)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 84

13.2 Log mining

(85)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 85

13.2 Log mining

(86)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 86

13.2 Log mining

(87)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 87

13.2 Log mining

(88)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 88

13.2 Log mining

(89)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 89

13.2 Log mining

(90)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 90

13.2 Log mining

(91)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 91

13.2 Log mining

(92)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 92

13.2 Log mining

(93)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 93

13.2 Log mining

(94)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 94

13.2 Log mining

Referenzen

ÄHNLICHE DOKUMENTE

• Dynamo is a low-level distributed storage system in the Amazon service infrastructure.

– Specialized root tablets and metadata tablets are used as an index to look up responsible tablet servers for a given data range. • Clients don’t communicate with

• If an acceptor receives an accept request with higher or equal number that its highest seen proposal, it sends its value to each learner. • A value is chosen when a learner

• Basic storage is offered within the VM, but usually additional storage services are used by application which cost extra.

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 2?. 7.0

– Page renderer service looses connection to the whole partition containing preferred Dynamo node. • Switches to another node from the

– Specialized root tablets and metadata tablets are used as an index to look up responsible tablet servers for a given data range. • Clients don’t communicate with

•  Send accept message to all acceptors in quorum with chosen value.