Distributed Data Management

(1)

Prof. Dr. Wolf-Tilo Balke

Institut für Informationssysteme

Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

Distributed Data Management

(2)

12.0 Trade-Offs in Distributed Databases 12.1 The PAXOS Protocol

12.2 Google Megastore

12.3 Outlook: Google Spanner

12 Advanced Consistency

(3)

• For large-scale Web data management features, which are hard to get in RDBMS, may be quite interesting

– Linear scalability and elasticity

• We want to add new machines with only little overhead, performance of DB should increase with each addition

– Full availability

• DB should never fail, and should always accept reads and writes

• DB should be disaster-proof

– Flexibility during development and support for prototyping

• DB should not enforce only design time decisions, but should allow for changes to the data model during runtime

12.0 Trade-Offs in Big Data Storage

(4)

• These new requirements are often conflicting with typical RDBMS behavior and features

– We need to find suitable trade-offs

• Example: Amazon Dynamo

– Focus on availability, scalability, and latency, but…

• No guaranteed consistency between replica

– Asynchronous replication with eventual consistency

• Simple key-value store

– No complex data model, no query language

• No traditional transaction support

12.0 Trade-Offs in Big Data Storage

(5)

• Example: Google BigTable

– Still focus on availability and scalability, but with a slightly more powerful data model

• One big “table” with multiple attributes

– But…

• No guaranteed consistency between replica

– Replication is asynchronous

– Separation of data mutations and control flow decreases conflicts – Append-only write model decreases conflict potential

• No complex query languages

• No traditional transaction support

12.0 Trade-Offs in Big Data Storage

(6)

• Core problems for performance seem to be

replica consistency and transaction support

– What is the exact problem?

12.0 Trade-Offs in Big Data Storage

State: S State: S₁ State: S’

State: S₂

Partition Recovery partition start partition end

partition mode

Time detect partition?

(7)

• There are three common approaches for ensuring replica consistency

– Synchronous Master-Slave Replication

• One master server holds the primary copy

• Each mutation is pushed to all slaves (replicas)

• Mutation is only acknowledged after each slave was mutated successfully

– e.g., 2-Phase Commit Protocol

• Advantage:

– Replicas always consistent, even during failure

– Very suitable for developing proper transaction protocols

• Disadvantage:

– Potentially unable to deal with partitions

– System can be unavailable during failure states, bad scalability?

– Failing master requires expensive repairs

12.0 Trade-Offs in Big Data Storage

(8)

– Asynchronous Master-Slave

• Master writes mutations to at least one slave

– Usually using write-ahead logs

– Acknowledge write when log write is acknowledged by slave – For improved consistency, wait for more log acknowledgements

• Propagate mutation to other slaves as soon as possible

– e.g., Google BigTable

• Advantages:

– Can be used for efficient transactions (with minor problems)

• Disadvantages:

– Could result in data loss or inconsistencies

» Loss of master or/and first slave

» Reads during mutation propagation – Repairs after losing master can be complex

» Needs a consensus protocol

12.0 Trade-Offs in Big Data Storage

(9)

– Optimistic Replication

• All nodes are homogenous, no masters and no slaves

• Anybody can accept a mutations,

which are then asynchronously pushed to all replicas

– e.g., Amazon Dynamo

• Acknowledgements after the first n writes (n may be 1)

• Advantages:

– Very available – Very low latency

• Disadvantages:

– Consistency problems are quite common

– No transaction protocols can be build on top, because global mutation order is unknown

12.0 Trade-Offs in Bigdata Storage

(10)

• Core problem:

When using asynchronous replication, and the system crashes, which state is correct?

– In Master-Slave settings, the master is always correct

• If the master fails, an expensive master failover recovery is necessary

– In homogenous settings,…?!

• There is no perfect system for determining the global order of mutations

– Normal timestamps via internal clocks are too unreliable – Vector clocks help, but are still not good enough

• Any chance to rely on consensus?

12.1 The Paxos Protocol

(11)

• The PAXOS Problem

– How to reach consensus/data

consistency in distributed systems

that can tolerate non-malicious failures?

• Original paper:

L. Lamport: “The part-time parliament”. ACM Transactions on Computer Systems, 1998. doi:10.1145/279227.279229

– Paxos made simple:

• http://courses.cs.vt.edu/cs5204/fall10-kafura-

NVC/Papers/FaultTolerance/Paxos-Simple-Lamport.pdf

12.1 The Paxos Protocol

(12)

• Actually, Paxos is a family of protocols for solving consensus problems in a

network of unreliable processors

– Unreliable, but not malicious!

– For malicious processors, see Byzantine Agreements

• Consensus protocols are the basis for the state machine replication approach

12.1 The Paxos Protocol

(13)

• Why should we care about PAXOS?

– It can only find a consensus on a simple single value…

• This seems rather…useless..?!

– But, transaction logs can be replicated to multiple nodes, and then PAXOS can find a consensus on log entries!

• This can solve the distributed log problem

– i.e. determining a serializable order of operations

• Allows to develop proper log-based transaction and recovery algorithms

12.1 The Paxos Protocol

(14)

• Basic problem: find a consensus about the value of some data item

– Safety

• Only values actually proposed by some node may be chosen

• At each time only a single value is chosen

• Only chosen values are replicated to (or learned by) nodes

– Liveness

• Some proposed value is eventually chosen

• If some value has been chosen, each node eventually learns about this choice

12.1 The Paxos Protocol

(15)

• Classes of agents

– Proposers

• Take a client request and start a voting phase, then act as coordinator of the voting phase

– Acceptors (or Voters)

• Acceptors vote on the value of an item

• Usually organized in quorums

– i.e. a majority group of acceptors which can reliably vote on an item – Subset of acceptors

– Any two quorums share at least one member

– Learners

• Learners replicate some functionality in the PAXOS protocol for reliability

– e.g., communicate with client, repeating vote results, etc.

• Each node can act as more than one agent

12.1 Paxos Notation

(16)

12.1 Paxos Algorithm

Proposer

Acceptor

Learner

(17)

• Phase 1 (prepare)

– A proposer (acting as the leader) creates a new proposal increasingly numbered by n

• Send “prepare” request to all acceptors

– Acceptor receives a prepare request with number n

• If n is greater than any previous prepare request, it accepts it and promises not to accept any lower numbered

requests

– If the acceptor ever accepted a lower numbered request in the past, return proposal number and value to proposer

• If n is smaller than any previous prepare request, decline

12.1 Paxos Algorithm

(18)

• Phase 2 (accept)

– Proposer

• If proposer receives enough promises from acceptors, it sets a value for its proposal

– “Enough” means from a valid majority of acceptors (quorum)

– If any acceptor had accepted a different proposal before, the proposer already got the respective value in the prepare phase

» Use the value of the highest returned proposal

» If no proposals returned by acceptors, choose a value

• Send accept message to all acceptors in quorum with the chosen value

– Acceptor

• Acceptor accepts proposal, if and only if it has not promised to only accept higher proposals

– i.e. accept value, if not already promised to vote on a newer proposal – Registers the value, and sends accept messages to proposer and

all learners

12.1 Paxos Algorithm

(19)

• A value is chosen at proposal

number n, if and only if a majority of acceptors accept that value in phase 2

– As reported by learners

• No choice if…

– …multiple proposers send conflicting prepare messages – …there is no quorum of responses

– In that case: restart with a higher proposal number

• Eventually, there will be a concensus

12.1 Definition of “chosen”

(20)

• Three important properties must hold

– Any proposal number is unique

– Any two set of acceptors (quorums) have at least one acceptor in common

– The value sent out in phase 2 is always the value of the

highest-numbered proposal of all the responses in phase 1

12.1 Paxos Properties

(21)

12.1 Paxos by Example

prepare request [n =2 , v=8]

Proposer A Proposer B

Acceptor X Acceptor Y Acceptor Z

prepare request [n =4 , v=5]

[n =2 , v=8]

[n = 4 , v=5]

Example taken from http://angus.nyc/2012/paxos-by-example time

(22)

12.1 Paxos by Example

prepare response [no previous]

[n =2 , v=8]

[n = 4 , v=5]

(23)

• Acceptor Z ignores A’s request since it already received higher request from B

• Proposer A sends accept requests with n=2 and value v=8 to acceptors X and Y

– They are ignored! (in the meantime X and Y already promised a proposal with n=4)

12.1 Paxos by Example

prepare response [n=2, v=8]

[n =4 , v=5]

(24)

• Proposer B sends accept request with highest received previous value v=8 to its quorum

12.1 Paxos by Example

accept request [n=4, v=8]

[n =4 , v=8]

[n = 4 , v=8]

Learner

(25)

• If an acceptor receives an accept request with higher or equal number that its highest seen proposal, it sends its value to each learner

• A value is chosen when a learner get messages from a majority of its acceptors

12.1 Paxos by Example

accept request [n=4, v=8]

[v=8] [v=8] [v=8]

(26)

• There are some options to learn a chosen value

– Whenever an acceptor accepts a proposal, it informs all the learners

– Acceptors may inform a distinguished learner (usually the proposer) and the distinguished

learner broadcast the result

12.1 Learning a Chosen Value

(27)

• Chubby lock service

– M. Burrows: “The Chubby lock service for loosely-coupled distributed systems”. In Procs. of the 7th Symp. on Operating Systems Design and Implementation (OSDI), 2006.

http://portal.acm.org/citation.cfm?id=1298487

• Petal: Distributed virtual disks

– E. Lee & C. Thekkath:“Petal: Distributed virtual disks” In Procs. of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS),1996.

doi:10.1145/237090.237157

• Frangipani: A scalable distributed file system

– C. Thekkath, T. Mann & E. Lee: “Frangipani: a scalable distributed file system”. In SIGOPS Operation Systems Review, vol. 31, 1997.

doi:10.1145/269005.266694

12.1 Early applications

(28)

• Often, PAXOS is considered to have a very limited scalability

– Some authors argued that its not adaptable to Cloud applications

– But…

• Google Megastore!

– Actually used PAXOS for building consistent scalable data stores ACID transactions!

• Baker, Jason, et al. “Megastore: Providing Scalable, Highly Available Storage for Interactive Services.” CIDR, vol. 11, 2011.

12.1 Using PAXOS in Practice

(29)

• Megastore was originally designed to run Google’s AppEngine

– AppEngine developers demanded some more traditional DB features

• Support for (simple) schemas

• Support for secondary indexes

• Support for (simple) joins

• Support for (simple) transactions

– Still, they wanted Google’s infrastructure

• Built on top of BigTable and GFS

• Still, they needed high scalability

12.2 Megastore

(30)

• Megastore stores tables in Bigtable

– Megastore is a geoscale structured database – Bigtable is a cluster-level structured storage

12.2 Megastore

(31)

• Each schema has a set of tables

– Each table has a set of entities

– Each entity has a set of strongly types properties

• Can be required, optional, or repeated (multi-value)

• So, no 1^st normal form…

– Each table needs a (usually composite) primary key

• Each table is either a entity group root table or a child table

– There is a single foreign key between children and their root

– An entity group is a root entity with all its child entities

12.2 Megastore - Schema

(32)

• Example schema:

12.2 Megastore - Schema

(33)

• Map Megastore schemas to Bigtable

– Each entity is mapped to a single Bigtable row

• Use primary keys to cluster entities which will likely be read and accessed together

• Entity groups will be stored consecutively

– “IN TABLE” command forces an entity into a specific Bigtable

• Bigtable column name = Megastore table name + property name

• In this example: photos are stored close to users

12.2 Megastore - Schema

(34)

• You can have two levels of indexes

– Local indexes with each entity group

• Stored in the group, and updated consistently and atomically

– Global index

• Spans all entity groups of a table

• Not necessarily consistent

• Consistent update of global index

• Requires additional latching (index locking)

– Expensive

– Many Web applications don’t need this anyway…

12.2 Megastore - Schema

(35)

• Considerations

– We want geo-replication across multiple datacenters

• But also replication within a datacenter

– Data can probably be sharded by user location

• Most read and writes for a data item will go to one specific datacenter

– “Master datacenter”

• Within one data center, there should be no master

– Read and writes can be issued to any replica node – Better performance with respect to availability

12.2 Megastore - Transactions

(36)

• Each entity group is independently and synchronously replicated

– Uses a low latency version of Paxos (not original Paxos!)

• Write only require 2 inter-roundtrips, reads require one inter-replica roundtrip

12.2 Megastore - Transactions

(37)

• Each region has an app server, the local data center is the local replica

– Transaction management happens within the local replica – Remote changes are pushed to a special replication server

12.2 Megastore - Transactions

(38)

• Simple assumption: Each entity group is a “Mini- Database”

– Have serializable ACID semantics within each entity group

• Serializable ACID >> Strict Consistency >> Eventual Consistency

– Megastore uses the MVCC protocol

• “MultiVersion Concurrency Control” with transaction timestamps

– Timestamping can be used because transactions are handled within a single “master” datacenter

12.2 Megastore - Transactions

(39)

• What are MVCC transactions?

– Data is not replaced, but instead new versions are written

• Use timestamps to label versions

• Increases concurrency (old versions can be read while work on new versions is ongoing)

• Used by e.g. BigTable

– …but also optionally found in many normal RDBMS

– Transactions with multiversions

• You can use multiversion 2-Phase-Commits

• You can use multiversion timestamp ordering!

– Reading: Select appropriate version of data based on timestamp

12.2 Megastore - Transactions

(40)

• Read consistency: Three read levels

– Current: Wait for uncommitted writes, then read last committed values

• We know form the write-ahead-logs what is yet uncommitted

• Might delay a read request, higher latency

– Snapshot: Does not wait, just read the last committed value

• That’s the value if the last transaction known to be successfully committed

– Inconsistent: Just read from any node the last available value

• Might be stale, might be dirty, might be otherwise problematic

12.2 Megastore - Transactions

(41)

• Write consistency:

– Current read: Get timestamp and log position of last committed transaction

• Determine the next available log position

– Gather writes into a log entry

– Commit: Use Paxos to achieve consensus for appending the log entry to log

• Use Paxos to settle on the next timestamp and a proper log position

– Finding the next timestamp can be tricky in a concurrent distributed environment…

• Optimistic concurrency: If multiple concurrent writes try to modify the same log position, only one will win!

– The other will realize and retry – All logs will by in-sync

– Write data mutations to Bigtable entities and indexes

• Clean up after successful operation

12.2 Megastore - Transactions

(42)

• Use limited consistency guarantees

• Messages (mutation) within an entity group are synchronous

– Use two phase locking protocol – Potentially slow with high latency

• Messages between entity groups are asynchronous

– Use message queues

– Fast but might miss some updates

12.2 Megastore - Transactions

(43)

• Operations across entity groups

12.2 Megastore - Transactions

(44)

• Timeline for read on local replica A

12.2 Megastore - Transactions

(45)

• Timeline for writes

12.2 Megastore - Transactions

(46)

• Key Features of Megastore are…

– Schematized semi-relational tables – Replicated ACID transactions

• Only within entity groups, sloppy consistency when spanning multiple groups

• Heavy reliance on Paxos to ensure proper ordering of operations in logs

– Synchronous replication support across data-centers.

– Still, we have…

• Limited scalability

• Lack of query language

• Manual partitioning of the data

12.3 Summary Megastore

(47)

• Google‘s Spanner is a highly available global-scale distributed database

– Adresses some of Megastore‘s shortcomings

– Complex system architecture

• Software Stack

• Directories

• Data Model

• TrueTime

– J. C. Corbett, J. Dean, M. Epstein, et al. “Spanner: Google’s

Globally-Distributed Database”. In Procs. of the 10th Symp. on Operating System Design and Implementation (OSDI), 2012.

• http://doi.org/10.1145/2491245

12.3 Google Spanner

(48)

• Key features are…

– Schematized, semi-relational (tabular) data model – SQL-like query interface

– Enables transaction serialization via global timestamps

• Uses the novel TrueTime API to accomplish concurrency control

• Acknowledges clock uncertainty and guarantees a strict bound on it

• Uses GPS devices and atomic clocks to get accurate time

12.3 Key Features

(49)

• Servers work on two levels

– Universe: Spanner deployment

– Zones: analogues to deployment of BigTable servers (units of physical isolation)

12.3 Server Configuration

(50)

(key:string, timestamp:int64) → string

• Back End: Colossus (successor of GFS)

• To support replication:

– each spanserver implements a Paxos State

Machine on top of each tablet, and the state machine stores metadata and logs of its tablet.

• Set of replicas is collectively a Paxos group

12.3 Spanserver Software Stack

(51)

12.3 Spanserver Software Stack

(52)

• A directory is a set of contiguous keys sharing a common prefix

– Smallest unit of data placement

– Smallest unit to define replication properties

• Each directory might can be sharded into fragments, if it grows too large

12.3 Directories

(53)

12.3 Data Model

Implications of INTERLEAVE:

hierarchy

(54)

• Novel API distributing a globally synchronized proper time

– Leverages hardware features like GPS and atomic clocks

– Implemented via the TrueTime API

12.3 TrueTime

Method Returns

TT.now() TTinterval: [earliest, latest]

TT.after(t) True if t has passed

TT.before(t) True if t has not arrived

TTinterval is guaranteed to contain the absolute time during which TT.now() was invoked

(55)

12.3 TrueTime Implementation

timeslave daemon on each machine

set of time master machines per datacenter

The majority of masters feature GSP receivers with dedicated antennas

Some of the masters feature atomic clocks (“Armageddon”)

._{. .}

(56)

• The daemon polls a variety of masters

– Chosen from nearby datacenters and from far-off datacenters

– “Armageddon masters”

– A daemon’s poll interval is 30 seconds

• Daemon reaches a consensus about correct timestamp

• Between synchronizations, each daemon advertises a slowly increasing time uncertainty (e)

12.3 TrueTime implementation

(57)

• True time allows globally meaningful commit timestamps for distributed transactions

– If A happens-before B, then

• timestamp(A) < timestamp(B)

– A happens-before B, if its effects become visible before B begins, in real time

• Visible: received by client or updates were applied to some replica

• Begins: first request arrived at Spanner server

12.3 Transactions in Spanner

(58)

• Data Management in the Cloud

– Map-Reduce Frameworks – Cloud Storage

– Everything as a Service