Prof. Dr. Wolf-Tilo Balke
Institut für Informationssysteme
Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de
Distributed Data Management
12.0 Trade-Offs in Distributed Databases 12.1 The PAXOS Protocol
12.2 Google Megastore
12.3 Outlook: Google Spanner
12 Advanced Consistency
• For large-scale Web data management features, which are hard to get in RDBMS, may be quite interesting
– Linear scalability and elasticity
• We want to add new machines with only little overhead, performance of DB should increase with each addition
– Full availability
• DB should never fail, and should always accept reads and writes
• DB should be disaster-proof
– Flexibility during development and support for prototyping
• DB should not enforce only design time decisions, but should allow for changes to the data model during runtime
12.0 Trade-Offs in Big Data Storage
• These new requirements are often conflicting with typical RDBMS behavior and features
– We need to find suitable trade-offs
• Example: Amazon Dynamo
– Focus on availability, scalability, and latency, but…
• No guaranteed consistency between replica
– Asynchronous replication with eventual consistency
• Simple key-value store
– No complex data model, no query language
• No traditional transaction support
12.0 Trade-Offs in Big Data Storage
• Example: Google BigTable
– Still focus on availability and scalability, but with a slightly more powerful data model
• One big “table” with multiple attributes
– But…
• No guaranteed consistency between replica
– Replication is asynchronous
– Separation of data mutations and control flow decreases conflicts – Append-only write model decreases conflict potential
• No complex query languages
• No traditional transaction support
12.0 Trade-Offs in Big Data Storage
• Core problems for performance seem to be
replica consistency and transaction support
– What is the exact problem?
12.0 Trade-Offs in Big Data Storage
State: S State: S1 State: S’
State: S2
Partition Recovery partition start partition end
partition mode
Time detect partition?
• There are three common approaches for ensuring replica consistency
– Synchronous Master-Slave Replication
• One master server holds the primary copy
• Each mutation is pushed to all slaves (replicas)
• Mutation is only acknowledged after each slave was mutated successfully
– e.g., 2-Phase Commit Protocol
• Advantage:
– Replicas always consistent, even during failure
– Very suitable for developing proper transaction protocols
• Disadvantage:
– Potentially unable to deal with partitions
– System can be unavailable during failure states, bad scalability?
– Failing master requires expensive repairs
12.0 Trade-Offs in Big Data Storage
– Asynchronous Master-Slave
• Master writes mutations to at least one slave
– Usually using write-ahead logs
– Acknowledge write when log write is acknowledged by slave – For improved consistency, wait for more log acknowledgements
• Propagate mutation to other slaves as soon as possible
– e.g., Google BigTable
• Advantages:
– Can be used for efficient transactions (with minor problems)
• Disadvantages:
– Could result in data loss or inconsistencies
» Loss of master or/and first slave
» Reads during mutation propagation – Repairs after losing master can be complex
» Needs a consensus protocol
12.0 Trade-Offs in Big Data Storage
– Optimistic Replication
• All nodes are homogenous, no masters and no slaves
• Anybody can accept a mutations,
which are then asynchronously pushed to all replicas
– e.g., Amazon Dynamo
• Acknowledgements after the first n writes (n may be 1)
• Advantages:
– Very available – Very low latency
• Disadvantages:
– Consistency problems are quite common
– No transaction protocols can be build on top, because global mutation order is unknown
12.0 Trade-Offs in Bigdata Storage
• Core problem:
When using asynchronous replication, and the system crashes, which state is correct?
– In Master-Slave settings, the master is always correct
• If the master fails, an expensive master failover recovery is necessary
– In homogenous settings,…?!
• There is no perfect system for determining the global order of mutations
– Normal timestamps via internal clocks are too unreliable – Vector clocks help, but are still not good enough
• Any chance to rely on consensus?
12.1 The Paxos Protocol
• The PAXOS Problem
– How to reach consensus/data
consistency in distributed systems
that can tolerate non-malicious failures?
• Original paper:
L. Lamport: “The part-time parliament”. ACM Transactions on Computer Systems, 1998. doi:10.1145/279227.279229
– Paxos made simple:
• http://courses.cs.vt.edu/cs5204/fall10-kafura-
NVC/Papers/FaultTolerance/Paxos-Simple-Lamport.pdf
12.1 The Paxos Protocol
• Actually, Paxos is a family of protocols for solving consensus problems in a
network of unreliable processors
– Unreliable, but not malicious!
– For malicious processors, see Byzantine Agreements
• Consensus protocols are the basis for the state machine replication approach
12.1 The Paxos Protocol
• Why should we care about PAXOS?
– It can only find a consensus on a simple single value…
• This seems rather…useless..?!
– But, transaction logs can be replicated to multiple nodes, and then PAXOS can find a consensus on log entries!
• This can solve the distributed log problem
– i.e. determining a serializable order of operations
• Allows to develop proper log-based transaction and recovery algorithms
12.1 The Paxos Protocol
• Basic problem: find a consensus about the value of some data item
– Safety
• Only values actually proposed by some node may be chosen
• At each time only a single value is chosen
• Only chosen values are replicated to (or learned by) nodes
– Liveness
• Some proposed value is eventually chosen
• If some value has been chosen, each node eventually learns about this choice
12.1 The Paxos Protocol
• Classes of agents
– Proposers
• Take a client request and start a voting phase, then act as coordinator of the voting phase
– Acceptors (or Voters)
• Acceptors vote on the value of an item
• Usually organized in quorums
– i.e. a majority group of acceptors which can reliably vote on an item – Subset of acceptors
– Any two quorums share at least one member
– Learners
• Learners replicate some functionality in the PAXOS protocol for reliability
– e.g., communicate with client, repeating vote results, etc.
• Each node can act as more than one agent
12.1 Paxos Notation
12.1 Paxos Algorithm
Proposer
Proposer
Acceptor
Acceptor
Acceptor
Learner
• Phase 1 (prepare)
– A proposer (acting as the leader) creates a new proposal increasingly numbered by n
• Send “prepare” request to all acceptors
– Acceptor receives a prepare request with number n
• If n is greater than any previous prepare request, it accepts it and promises not to accept any lower numbered
requests
– If the acceptor ever accepted a lower numbered request in the past, return proposal number and value to proposer
• If n is smaller than any previous prepare request, decline
12.1 Paxos Algorithm
• Phase 2 (accept)
– Proposer
• If proposer receives enough promises from acceptors, it sets a value for its proposal
– “Enough” means from a valid majority of acceptors (quorum)
– If any acceptor had accepted a different proposal before, the proposer already got the respective value in the prepare phase
» Use the value of the highest returned proposal
» If no proposals returned by acceptors, choose a value
• Send accept message to all acceptors in quorum with the chosen value
– Acceptor
• Acceptor accepts proposal, if and only if it has not promised to only accept higher proposals
– i.e. accept value, if not already promised to vote on a newer proposal – Registers the value, and sends accept messages to proposer and
all learners
12.1 Paxos Algorithm
• A value is chosen at proposal
number n, if and only if a majority of acceptors accept that value in phase 2
– As reported by learners
• No choice if…
– …multiple proposers send conflicting prepare messages – …there is no quorum of responses
– In that case: restart with a higher proposal number
• Eventually, there will be a concensus
12.1 Definition of “chosen”
• Three important properties must hold
– Any proposal number is unique
– Any two set of acceptors (quorums) have at least one acceptor in common
– The value sent out in phase 2 is always the value of the
highest-numbered proposal of all the responses in phase 1
12.1 Paxos Properties
12.1 Paxos by Example
prepare request [n =2 , v=8]
Proposer A Proposer B
Acceptor X Acceptor Y Acceptor Z
prepare request [n =4 , v=5]
[n =2 , v=8]
[n =2 , v=8]
[n = 4 , v=5]
Example taken from http://angus.nyc/2012/paxos-by-example time
12.1 Paxos by Example
prepare response [no previous]
Proposer A Proposer B
Acceptor X Acceptor Y Acceptor Z
prepare response [no previous]
[n =2 , v=8]
[n =2 , v=8]
[n = 4 , v=5]
• Acceptor Z ignores A’s request since it already received higher request from B
• Proposer A sends accept requests with n=2 and value v=8 to acceptors X and Y
– They are ignored! (in the meantime X and Y already promised a proposal with n=4)
12.1 Paxos by Example
Proposer A Proposer B
Acceptor X Acceptor Y Acceptor Z
prepare response [n=2, v=8]
[n =4 , v=5]
[n =4 , v=5]
• Proposer B sends accept request with highest received previous value v=8 to its quorum
12.1 Paxos by Example
Proposer A Proposer B
Acceptor X Acceptor Y Acceptor Z
accept request [n=4, v=8]
[n =4 , v=8]
[n =4 , v=8]
[n = 4 , v=8]
Learner
• If an acceptor receives an accept request with higher or equal number that its highest seen proposal, it sends its value to each learner
• A value is chosen when a learner get messages from a majority of its acceptors
12.1 Paxos by Example
Proposer A Proposer B
Acceptor X Acceptor Y Acceptor Z
accept request [n=4, v=8]
[v=8] [v=8] [v=8]
• There are some options to learn a chosen value
– Whenever an acceptor accepts a proposal, it informs all the learners
– Acceptors may inform a distinguished learner (usually the proposer) and the distinguished
learner broadcast the result
12.1 Learning a Chosen Value
• Chubby lock service
– M. Burrows: “The Chubby lock service for loosely-coupled distributed systems”. In Procs. of the 7th Symp. on Operating Systems Design and Implementation (OSDI), 2006.
http://portal.acm.org/citation.cfm?id=1298487
• Petal: Distributed virtual disks
– E. Lee & C. Thekkath:“Petal: Distributed virtual disks” In Procs. of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS),1996.
doi:10.1145/237090.237157
• Frangipani: A scalable distributed file system
– C. Thekkath, T. Mann & E. Lee: “Frangipani: a scalable distributed file system”. In SIGOPS Operation Systems Review, vol. 31, 1997.
doi:10.1145/269005.266694
12.1 Early applications
• Often, PAXOS is considered to have a very limited scalability
– Some authors argued that its not adaptable to Cloud applications
– But…
• Google Megastore!
– Actually used PAXOS for building consistent scalable data stores ACID transactions!
• Baker, Jason, et al. “Megastore: Providing Scalable, Highly Available Storage for Interactive Services.” CIDR, vol. 11, 2011.
12.1 Using PAXOS in Practice
• Megastore was originally designed to run Google’s AppEngine
– AppEngine developers demanded some more traditional DB features
• Support for (simple) schemas
• Support for secondary indexes
• Support for (simple) joins
• Support for (simple) transactions
– Still, they wanted Google’s infrastructure
• Built on top of BigTable and GFS
• Still, they needed high scalability
12.2 Megastore
• Megastore stores tables in Bigtable
– Megastore is a geoscale structured database – Bigtable is a cluster-level structured storage
12.2 Megastore
• Each schema has a set of tables
– Each table has a set of entities
– Each entity has a set of strongly types properties
• Can be required, optional, or repeated (multi-value)
• So, no 1st normal form…
– Each table needs a (usually composite) primary key
• Each table is either a entity group root table or a child table
– There is a single foreign key between children and their root
– An entity group is a root entity with all its child entities
12.2 Megastore - Schema
• Example schema:
12.2 Megastore - Schema
• Map Megastore schemas to Bigtable
– Each entity is mapped to a single Bigtable row
• Use primary keys to cluster entities which will likely be read and accessed together
• Entity groups will be stored consecutively
– “IN TABLE” command forces an entity into a specific Bigtable
• Bigtable column name = Megastore table name + property name
• In this example: photos are stored close to users
12.2 Megastore - Schema
• You can have two levels of indexes
– Local indexes with each entity group
• Stored in the group, and updated consistently and atomically
– Global index
• Spans all entity groups of a table
• Not necessarily consistent
• Consistent update of global index
• Requires additional latching (index locking)
– Expensive
– Many Web applications don’t need this anyway…
12.2 Megastore - Schema
• Considerations
– We want geo-replication across multiple datacenters
• But also replication within a datacenter
– Data can probably be sharded by user location
• Most read and writes for a data item will go to one specific datacenter
– “Master datacenter”
• Within one data center, there should be no master
– Read and writes can be issued to any replica node – Better performance with respect to availability
12.2 Megastore - Transactions
• Each entity group is independently and synchronously replicated
– Uses a low latency version of Paxos (not original Paxos!)
• Write only require 2 inter-roundtrips, reads require one inter-replica roundtrip
12.2 Megastore - Transactions
• Each region has an app server, the local data center is the local replica
– Transaction management happens within the local replica – Remote changes are pushed to a special replication server
12.2 Megastore - Transactions
• Simple assumption: Each entity group is a “Mini- Database”
– Have serializable ACID semantics within each entity group
• Serializable ACID >> Strict Consistency >> Eventual Consistency
– Megastore uses the MVCC protocol
• “MultiVersion Concurrency Control” with transaction timestamps
– Timestamping can be used because transactions are handled within a single “master” datacenter
12.2 Megastore - Transactions
• What are MVCC transactions?
– Data is not replaced, but instead new versions are written
• Use timestamps to label versions
• Increases concurrency (old versions can be read while work on new versions is ongoing)
• Used by e.g. BigTable
– …but also optionally found in many normal RDBMS
– Transactions with multiversions
• You can use multiversion 2-Phase-Commits
• You can use multiversion timestamp ordering!
– Reading: Select appropriate version of data based on timestamp
12.2 Megastore - Transactions
• Read consistency: Three read levels
– Current: Wait for uncommitted writes, then read last committed values
• We know form the write-ahead-logs what is yet uncommitted
• Might delay a read request, higher latency
– Snapshot: Does not wait, just read the last committed value
• That’s the value if the last transaction known to be successfully committed
– Inconsistent: Just read from any node the last available value
• Might be stale, might be dirty, might be otherwise problematic
12.2 Megastore - Transactions
• Write consistency:
– Current read: Get timestamp and log position of last committed transaction
• Determine the next available log position
– Gather writes into a log entry
– Commit: Use Paxos to achieve consensus for appending the log entry to log
• Use Paxos to settle on the next timestamp and a proper log position
– Finding the next timestamp can be tricky in a concurrent distributed environment…
• Optimistic concurrency: If multiple concurrent writes try to modify the same log position, only one will win!
– The other will realize and retry – All logs will by in-sync
– Write data mutations to Bigtable entities and indexes
• Clean up after successful operation
12.2 Megastore - Transactions
• Use limited consistency guarantees
• Messages (mutation) within an entity group are synchronous
– Use two phase locking protocol – Potentially slow with high latency
• Messages between entity groups are asynchronous
– Use message queues
– Fast but might miss some updates
12.2 Megastore - Transactions
• Operations across entity groups
12.2 Megastore - Transactions
• Timeline for read on local replica A
12.2 Megastore - Transactions
• Timeline for writes
12.2 Megastore - Transactions
• Key Features of Megastore are…
– Schematized semi-relational tables – Replicated ACID transactions
• Only within entity groups, sloppy consistency when spanning multiple groups
• Heavy reliance on Paxos to ensure proper ordering of operations in logs
– Synchronous replication support across data-centers.
– Still, we have…
• Limited scalability
• Lack of query language
• Manual partitioning of the data
12.3 Summary Megastore
• Google‘s Spanner is a highly available global-scale distributed database
– Adresses some of Megastore‘s shortcomings
– Complex system architecture
• Software Stack
• Directories
• Data Model
• TrueTime
– J. C. Corbett, J. Dean, M. Epstein, et al. “Spanner: Google’s
Globally-Distributed Database”. In Procs. of the 10th Symp. on Operating System Design and Implementation (OSDI), 2012.
• http://doi.org/10.1145/2491245
12.3 Google Spanner
• Key features are…
– Schematized, semi-relational (tabular) data model – SQL-like query interface
– Enables transaction serialization via global timestamps
• Uses the novel TrueTime API to accomplish concurrency control
• Acknowledges clock uncertainty and guarantees a strict bound on it
• Uses GPS devices and atomic clocks to get accurate time
12.3 Key Features
• Servers work on two levels
– Universe: Spanner deployment
– Zones: analogues to deployment of BigTable servers (units of physical isolation)
12.3 Server Configuration
(key:string, timestamp:int64) → string
• Back End: Colossus (successor of GFS)
• To support replication:
– each spanserver implements a Paxos State
Machine on top of each tablet, and the state machine stores metadata and logs of its tablet.
• Set of replicas is collectively a Paxos group
12.3 Spanserver Software Stack
12.3 Spanserver Software Stack
• A directory is a set of contiguous keys sharing a common prefix
– Smallest unit of data placement
– Smallest unit to define replication properties
• Each directory might can be sharded into fragments, if it grows too large
12.3 Directories
12.3 Data Model
Implications of INTERLEAVE:
hierarchy
• Novel API distributing a globally synchronized proper time
– Leverages hardware features like GPS and atomic clocks
– Implemented via the TrueTime API
12.3 TrueTime
Method Returns
TT.now() TTinterval: [earliest, latest]
TT.after(t) True if t has passed
TT.before(t) True if t has not arrived
TTinterval is guaranteed to contain the absolute time during which TT.now() was invoked
12.3 TrueTime Implementation
timeslave daemon on each machine
set of time master machines per datacenter
The majority of masters feature GSP receivers with dedicated antennas
Some of the masters feature atomic clocks (“Armageddon”)
.. .
• The daemon polls a variety of masters
– Chosen from nearby datacenters and from far-off datacenters
– “Armageddon masters”
– A daemon’s poll interval is 30 seconds
• Daemon reaches a consensus about correct timestamp
• Between synchronizations, each daemon advertises a slowly increasing time uncertainty (e)
12.3 TrueTime implementation
• True time allows globally meaningful commit timestamps for distributed transactions
– If A happens-before B, then
• timestamp(A) < timestamp(B)
– A happens-before B, if its effects become visible before B begins, in real time
• Visible: received by client or updates were applied to some replica
• Begins: first request arrived at Spanner server
12.3 Transactions in Spanner
• Data Management in the Cloud
– Map-Reduce Frameworks – Cloud Storage
– Everything as a Service