• Keine Ergebnisse gefunden

Distributed Data Management

N/A
N/A
Protected

Academic year: 2021

Aktie "Distributed Data Management"

Copied!
58
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Prof. Dr. Wolf-Tilo Balke

Institut für Informationssysteme

Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

Distributed Data Management

(2)

12.0 Trade-Offs in Distributed Databases 12.1 The PAXOS Protocol

12.2 Google Megastore

12.3 Outlook: Google Spanner

12 Advanced Consistency

(3)

• For large-scale Web data management features, which are hard to get in RDBMS, may be quite interesting

Linear scalability and elasticity

We want to add new machines with only little overhead, performance of DB should increase with each addition

Full availability

DB should never fail, and should always accept reads and writes

DB should be disaster-proof

Flexibility during development and support for prototyping

DB should not enforce only design time decisions, but should allow for changes to the data model during runtime

12.0 Trade-Offs in Big Data Storage

(4)

These new requirements are often conflicting with typical RDBMS behavior and features

– We need to find suitable trade-offs

Example: Amazon Dynamo

– Focus on availability, scalability, and latency, but…

No guaranteed consistency between replica

Asynchronous replication with eventual consistency

Simple key-value store

No complex data model, no query language

No traditional transaction support

12.0 Trade-Offs in Big Data Storage

(5)

Example: Google BigTable

– Still focus on availability and scalability, but with a slightly more powerful data model

One big “table” with multiple attributes

– But…

No guaranteed consistency between replica

Replication is asynchronous

Separation of data mutations and control flow decreases conflicts Append-only write model decreases conflict potential

No complex query languages

No traditional transaction support

12.0 Trade-Offs in Big Data Storage

(6)

• Core problems for performance seem to be

replica consistency and transaction support

– What is the exact problem?

12.0 Trade-Offs in Big Data Storage

State: S State: S1 State: S’

State: S2

Partition Recovery partition start partition end

partition mode

Time detect partition?

(7)

There are three common approaches for ensuring replica consistency

Synchronous Master-Slave Replication

One master server holds the primary copy

Each mutation is pushed to all slaves (replicas)

Mutation is only acknowledged after each slave was mutated successfully

e.g., 2-Phase Commit Protocol

Advantage:

Replicas always consistent, even during failure

Very suitable for developing proper transaction protocols

Disadvantage:

Potentially unable to deal with partitions

System can be unavailable during failure states, bad scalability?

Failing master requires expensive repairs

12.0 Trade-Offs in Big Data Storage

(8)

Asynchronous Master-Slave

Master writes mutations to at least one slave

Usually using write-ahead logs

Acknowledge write when log write is acknowledged by slave For improved consistency, wait for more log acknowledgements

Propagate mutation to other slaves as soon as possible

e.g., Google BigTable

Advantages:

Can be used for efficient transactions (with minor problems)

Disadvantages:

Could result in data loss or inconsistencies

» Loss of master or/and first slave

» Reads during mutation propagation Repairs after losing master can be complex

» Needs a consensus protocol

12.0 Trade-Offs in Big Data Storage

(9)

Optimistic Replication

All nodes are homogenous, no masters and no slaves

Anybody can accept a mutations,

which are then asynchronously pushed to all replicas

e.g., Amazon Dynamo

Acknowledgements after the first n writes (n may be 1)

Advantages:

Very available Very low latency

Disadvantages:

Consistency problems are quite common

No transaction protocols can be build on top, because global mutation order is unknown

12.0 Trade-Offs in Bigdata Storage

(10)

Core problem:

When using asynchronous replication, and the system crashes, which state is correct?

– In Master-Slave settings, the master is always correct

If the master fails, an expensive master failover recovery is necessary

– In homogenous settings,…?!

There is no perfect system for determining the global order of mutations

Normal timestamps via internal clocks are too unreliable Vector clocks help, but are still not good enough

Any chance to rely on consensus?

12.1 The Paxos Protocol

(11)

The PAXOS Problem

How to reach consensus/data

consistency in distributed systems

that can tolerate non-malicious failures?

Original paper:

L. Lamport: “The part-time parliament”. ACM Transactions on Computer Systems, 1998. doi:10.1145/279227.279229

– Paxos made simple:

http://courses.cs.vt.edu/cs5204/fall10-kafura-

NVC/Papers/FaultTolerance/Paxos-Simple-Lamport.pdf

12.1 The Paxos Protocol

(12)

• Actually, Paxos is a family of protocols for solving consensus problems in a

network of unreliable processors

– Unreliable, but not malicious!

For malicious processors, see Byzantine Agreements

• Consensus protocols are the basis for the state machine replication approach

12.1 The Paxos Protocol

(13)

• Why should we care about PAXOS?

– It can only find a consensus on a simple single value…

This seems rather…useless..?!

But, transaction logs can be replicated to multiple nodes, and then PAXOS can find a consensus on log entries!

This can solve the distributed log problem

i.e. determining a serializable order of operations

Allows to develop proper log-based transaction and recovery algorithms

12.1 The Paxos Protocol

(14)

Basic problem: find a consensus about the value of some data item

Safety

Only values actually proposed by some node may be chosen

At each time only a single value is chosen

Only chosen values are replicated to (or learned by) nodes

Liveness

Some proposed value is eventually chosen

If some value has been chosen, each node eventually learns about this choice

12.1 The Paxos Protocol

(15)

• Classes of agents

Proposers

Take a client request and start a voting phase, then act as coordinator of the voting phase

Acceptors (or Voters)

Acceptors vote on the value of an item

Usually organized in quorums

i.e. a majority group of acceptors which can reliably vote on an item Subset of acceptors

Any two quorums share at least one member

Learners

Learners replicate some functionality in the PAXOS protocol for reliability

e.g., communicate with client, repeating vote results, etc.

• Each node can act as more than one agent

12.1 Paxos Notation

(16)

12.1 Paxos Algorithm

Proposer

Proposer

Acceptor

Acceptor

Acceptor

Learner

(17)

Phase 1 (prepare)

A proposer (acting as the leader) creates a new proposal increasingly numbered by n

Send “prepare” request to all acceptors

Acceptor receives a prepare request with number n

If n is greater than any previous prepare request, it accepts it and promises not to accept any lower numbered

requests

If the acceptor ever accepted a lower numbered request in the past, return proposal number and value to proposer

If n is smaller than any previous prepare request, decline

12.1 Paxos Algorithm

(18)

• Phase 2 (accept)

Proposer

If proposer receives enough promises from acceptors, it sets a value for its proposal

“Enough” means from a valid majority of acceptors (quorum)

If any acceptor had accepted a different proposal before, the proposer already got the respective value in the prepare phase

» Use the value of the highest returned proposal

» If no proposals returned by acceptors, choose a value

Send accept message to all acceptors in quorum with the chosen value

Acceptor

Acceptor accepts proposal, if and only if it has not promised to only accept higher proposals

i.e. accept value, if not already promised to vote on a newer proposal Registers the value, and sends accept messages to proposer and

all learners

12.1 Paxos Algorithm

(19)

A value is chosen at proposal

number n, if and only if a majority of acceptors accept that value in phase 2

As reported by learners

• No choice if…

…multiple proposers send conflicting prepare messages …there is no quorum of responses

In that case: restart with a higher proposal number

• Eventually, there will be a concensus

12.1 Definition of “chosen”

(20)

Three important properties must hold

Any proposal number is unique

– Any two set of acceptors (quorums) have at least one acceptor in common

– The value sent out in phase 2 is always the value of the

highest-numbered proposal of all the responses in phase 1

12.1 Paxos Properties

(21)

12.1 Paxos by Example

prepare request [n =2 , v=8]

Proposer A Proposer B

Acceptor X Acceptor Y Acceptor Z

prepare request [n =4 , v=5]

[n =2 , v=8]

[n =2 , v=8]

[n = 4 , v=5]

Example taken from http://angus.nyc/2012/paxos-by-example time

(22)

12.1 Paxos by Example

prepare response [no previous]

Proposer A Proposer B

Acceptor X Acceptor Y Acceptor Z

prepare response [no previous]

[n =2 , v=8]

[n =2 , v=8]

[n = 4 , v=5]

(23)

Acceptor Z ignores A’s request since it already received higher request from B

Proposer A sends accept requests with n=2 and value v=8 to acceptors X and Y

They are ignored! (in the meantime X and Y already promised a proposal with n=4)

12.1 Paxos by Example

Proposer A Proposer B

Acceptor X Acceptor Y Acceptor Z

prepare response [n=2, v=8]

[n =4 , v=5]

[n =4 , v=5]

(24)

Proposer B sends accept request with highest received previous value v=8 to its quorum

12.1 Paxos by Example

Proposer A Proposer B

Acceptor X Acceptor Y Acceptor Z

accept request [n=4, v=8]

[n =4 , v=8]

[n =4 , v=8]

[n = 4 , v=8]

Learner

(25)

If an acceptor receives an accept request with higher or equal number that its highest seen proposal, it sends its value to each learner

A value is chosen when a learner get messages from a majority of its acceptors

12.1 Paxos by Example

Proposer A Proposer B

Acceptor X Acceptor Y Acceptor Z

accept request [n=4, v=8]

[v=8] [v=8] [v=8]

(26)

There are some options to learn a chosen value

– Whenever an acceptor accepts a proposal, it informs all the learners

Acceptors may inform a distinguished learner (usually the proposer) and the distinguished

learner broadcast the result

12.1 Learning a Chosen Value

(27)

• Chubby lock service

M. Burrows: “The Chubby lock service for loosely-coupled distributed systems”. In Procs. of the 7th Symp. on Operating Systems Design and Implementation (OSDI), 2006.

http://portal.acm.org/citation.cfm?id=1298487

• Petal: Distributed virtual disks

E. Lee & C. Thekkath:“Petal: Distributed virtual disks” In Procs. of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS),1996.

doi:10.1145/237090.237157

• Frangipani: A scalable distributed file system

C. Thekkath, T. Mann & E. Lee: “Frangipani: a scalable distributed file system”. In SIGOPS Operation Systems Review, vol. 31, 1997.

doi:10.1145/269005.266694

12.1 Early applications

(28)

Often, PAXOS is considered to have a very limited scalability

– Some authors argued that its not adaptable to Cloud applications

– But…

Google Megastore!

– Actually used PAXOS for building consistent scalable data stores ACID transactions!

Baker, Jason, et al. “Megastore: Providing Scalable, Highly Available Storage for Interactive Services.” CIDR, vol. 11, 2011.

12.1 Using PAXOS in Practice

(29)

Megastore was originally designed to run Google’s AppEngine

– AppEngine developers demanded some more traditional DB features

Support for (simple) schemas

Support for secondary indexes

Support for (simple) joins

Support for (simple) transactions

– Still, they wanted Google’s infrastructure

Built on top of BigTable and GFS

Still, they needed high scalability

12.2 Megastore

(30)

Megastore stores tables in Bigtable

Megastore is a geoscale structured databaseBigtable is a cluster-level structured storage

12.2 Megastore

(31)

Each schema has a set of tables

Each table has a set of entities

Each entity has a set of strongly types properties

Can be required, optional, or repeated (multi-value)

So, no 1st normal form…

Each table needs a (usually composite) primary key

• Each table is either a entity group root table or a child table

There is a single foreign key between children and their root

An entity group is a root entity with all its child entities

12.2 Megastore - Schema

(32)

• Example schema:

12.2 Megastore - Schema

(33)

Map Megastore schemas to Bigtable

Each entity is mapped to a single Bigtable row

Use primary keys to cluster entities which will likely be read and accessed together

Entity groups will be stored consecutively

“IN TABLE” command forces an entity into a specific Bigtable

Bigtable column name = Megastore table name + property name

In this example: photos are stored close to users

12.2 Megastore - Schema

(34)

• You can have two levels of indexes

Local indexes with each entity group

Stored in the group, and updated consistently and atomically

Global index

Spans all entity groups of a table

Not necessarily consistent

Consistent update of global index

Requires additional latching (index locking)

Expensive

Many Web applications don’t need this anyway…

12.2 Megastore - Schema

(35)

• Considerations

We want geo-replication across multiple datacenters

But also replication within a datacenter

Data can probably be sharded by user location

Most read and writes for a data item will go to one specific datacenter

“Master datacenter”

Within one data center, there should be no master

Read and writes can be issued to any replica node Better performance with respect to availability

12.2 Megastore - Transactions

(36)

Each entity group is independently and synchronously replicated

Uses a low latency version of Paxos (not original Paxos!)

Write only require 2 inter-roundtrips, reads require one inter-replica roundtrip

12.2 Megastore - Transactions

(37)

• Each region has an app server, the local data center is the local replica

Transaction management happens within the local replica Remote changes are pushed to a special replication server

12.2 Megastore - Transactions

(38)

• Simple assumption: Each entity group is a “Mini- Database”

Have serializable ACID semantics within each entity group

Serializable ACID >> Strict Consistency >> Eventual Consistency

Megastore uses the MVCC protocol

“MultiVersion Concurrency Control” with transaction timestamps

Timestamping can be used because transactions are handled within a single “master” datacenter

12.2 Megastore - Transactions

(39)

• What are MVCC transactions?

Data is not replaced, but instead new versions are written

Use timestamps to label versions

Increases concurrency (old versions can be read while work on new versions is ongoing)

Used by e.g. BigTable

…but also optionally found in many normal RDBMS

Transactions with multiversions

You can use multiversion 2-Phase-Commits

You can use multiversion timestamp ordering!

Reading: Select appropriate version of data based on timestamp

12.2 Megastore - Transactions

(40)

Read consistency: Three read levels

Current: Wait for uncommitted writes, then read last committed values

We know form the write-ahead-logs what is yet uncommitted

Might delay a read request, higher latency

Snapshot: Does not wait, just read the last committed value

That’s the value if the last transaction known to be successfully committed

Inconsistent: Just read from any node the last available value

Might be stale, might be dirty, might be otherwise problematic

12.2 Megastore - Transactions

(41)

Write consistency:

Current read: Get timestamp and log position of last committed transaction

Determine the next available log position

Gather writes into a log entry

Commit: Use Paxos to achieve consensus for appending the log entry to log

Use Paxos to settle on the next timestamp and a proper log position

Finding the next timestamp can be tricky in a concurrent distributed environment…

Optimistic concurrency: If multiple concurrent writes try to modify the same log position, only one will win!

The other will realize and retry All logs will by in-sync

Write data mutations to Bigtable entities and indexes

Clean up after successful operation

12.2 Megastore - Transactions

(42)

• Use limited consistency guarantees

• Messages (mutation) within an entity group are synchronous

– Use two phase locking protocol – Potentially slow with high latency

• Messages between entity groups are asynchronous

– Use message queues

– Fast but might miss some updates

12.2 Megastore - Transactions

(43)

• Operations across entity groups

12.2 Megastore - Transactions

(44)

• Timeline for read on local replica A

12.2 Megastore - Transactions

(45)

• Timeline for writes

12.2 Megastore - Transactions

(46)

Key Features of Megastore are…

Schematized semi-relational tables Replicated ACID transactions

Only within entity groups, sloppy consistency when spanning multiple groups

Heavy reliance on Paxos to ensure proper ordering of operations in logs

Synchronous replication support across data-centers.

Still, we have…

Limited scalability

Lack of query language

Manual partitioning of the data

12.3 Summary Megastore

(47)

Google‘s Spanner is a highly available global-scale distributed database

Adresses some of Megastore‘s shortcomings

Complex system architecture

Software Stack

Directories

Data Model

TrueTime

J. C. Corbett, J. Dean, M. Epstein, et al. “Spanner: Google’s

Globally-Distributed Database”. In Procs. of the 10th Symp. on Operating System Design and Implementation (OSDI), 2012.

http://doi.org/10.1145/2491245

12.3 Google Spanner

(48)

Key features are…

– Schematized, semi-relational (tabular) data model – SQL-like query interface

Enables transaction serialization via global timestamps

Uses the novel TrueTime API to accomplish concurrency control

Acknowledges clock uncertainty and guarantees a strict bound on it

Uses GPS devices and atomic clocks to get accurate time

12.3 Key Features

(49)

Servers work on two levels

Universe: Spanner deployment

Zones: analogues to deployment of BigTable servers (units of physical isolation)

12.3 Server Configuration

(50)

(key:string, timestamp:int64) → string

Back End: Colossus (successor of GFS)

• To support replication:

each spanserver implements a Paxos State

Machine on top of each tablet, and the state machine stores metadata and logs of its tablet.

Set of replicas is collectively a Paxos group

12.3 Spanserver Software Stack

(51)

12.3 Spanserver Software Stack

(52)

A directory is a set of contiguous keys sharing a common prefix

Smallest unit of data placement

Smallest unit to define replication properties

• Each directory might can be sharded into fragments, if it grows too large

12.3 Directories

(53)

12.3 Data Model

Implications of INTERLEAVE:

hierarchy

(54)

Novel API distributing a globally synchronized proper time

– Leverages hardware features like GPS and atomic clocks

– Implemented via the TrueTime API

12.3 TrueTime

Method Returns

TT.now() TTinterval: [earliest, latest]

TT.after(t) True if t has passed

TT.before(t) True if t has not arrived

TTinterval is guaranteed to contain the absolute time during which TT.now() was invoked

(55)

12.3 TrueTime Implementation

timeslave daemon on each machine

set of time master machines per datacenter

The majority of masters feature GSP receivers with dedicated antennas

Some of the masters feature atomic clocks (“Armageddon”)

.. .

(56)

The daemon polls a variety of masters

Chosen from nearby datacenters and from far-off datacenters

“Armageddon masters”

A daemon’s poll interval is 30 seconds

• Daemon reaches a consensus about correct timestamp

• Between synchronizations, each daemon advertises a slowly increasing time uncertainty (e)

12.3 TrueTime implementation

(57)

True time allows globally meaningful commit timestamps for distributed transactions

If A happens-before B, then

timestamp(A) < timestamp(B)

A happens-before B, if its effects become visible before B begins, in real time

Visible: received by client or updates were applied to some replica

Begins: first request arrived at Spanner server

12.3 Transactions in Spanner

(58)

• Data Management in the Cloud

– Map-Reduce Frameworks – Cloud Storage

– Everything as a Service

Next Lecture

Referenzen

ÄHNLICHE DOKUMENTE

This ARI explores the concepts of cyberspace and cyber security, the known risks and threats, their current management in Spain and the need to develop a national cyber

By completing a Portfolio of Independent Study, students are given a chance to take charge of their own learning by assessing their current level/capabilities; setting realistic

Contrastive interlanguage analysis is not concentrated on errors but compares cate- gories (of any kind) of learner language with the same categories in native speaker

The selection of headwords for the Swahili–Polish dictionary (ultimately 10,000 entries, published incrementally) has been made primarily on the basis of a frequency

The system processor board, along with the interface boards, the Xebec controller, the disk drives, power supply and fan, are contained in the Main System

In some bus specifications, the protocol is treated on an abstract level. · While this does allow a protocol to be defined in an application independent

Memory protection does not apply to external mempry (memory in the Bally Arcade, Pinball, Add-On).. The RESET pushbuttons have no effect on the

These functions return the integer constant ,EOF at end .... Fgetc should be used instead.. Getgrgid searches from the beginning of the file until a numerical