Christoph Lofi José Pinto
Institut für Informationssysteme
Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de
Distributed Data Management
12.3 Google Spanner 13.1 Map & Reduce
13.2 Cloud beyond Storage 13.3 Computing as a Service
– SaaS – PaaS – IaaS
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 2
12.0 The Cloud
12.3 Spanner
• Outline and Key Features
• System Architecture:
– Software Stack – Directories
– Data Model – TrueTime
• Evaluation
• Case Study
[Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J. J., … Woodford, D. (2012). Spanner : Google’s Globally-Distributed Database. In Proceedings of OSDI’12: Tenth Symposium on Operating System Design and Implementation (pp. 251–264). http://doi.org/10.1145/2491245]
12.3 Motivation: Social Network
User posts Friend lists User posts Friend lists User posts Friend lists User posts Friend lists US
Brazil
Russia Spain
San Francisco Seattle
Arizona
Sao Paulo Santiago Buenos Aires
Moscow Berlin Krakow London
Paris Berlin Madrid Lisbon User posts Friend lists
4
x1000
x1000
x1000
x1000
12.3 Spanner
• “We believe it is better to have application
programmers deal with performance problems due to overuse of transactions as bottlenecks
arise, rather than always coding around the lack of
transactions”
12.3 Outline
• Next step from Bigtable in RDBMS path with strong time semantics
• Key Features:
– Temporal Multi-version database
– Externally consistent global write-transactions with synchronous replication.
– Transactions across Datacenters.
– Lock-free read-only transactions.
– Schematized, semi-relational (tabular) data model.
– SQL-like query interface.
12.3 Key Features cont.
– Auto-sharding, auto-rebalancing, automatic failure response.
– Exposes control of data replication and placement to user/application.
– Enables transaction serialization via global timestamps – Acknowledges clock uncertainty and guarantees a
bound on it
– Uses novel TrueTime API to accomplish concurrency control
– Uses GPS devices and Atomic clocks to get accurate time
12.3 Server configuration
Universe: Spanner deployment
Zones: analogues to deployment of BigTable servers (units of physical isolation)
12.3 Spannserver Software Stack
2-phase commit across groups (when needed)
(key:string, timestamp:int64) → string
• Back End: Colossus (successor of GFS)
• To support replication:
– each spanserver implements a Paxos State
Machine on top of each tablet, and the state machine stores meta data and logs of its tablet.
• Set of replicas is collectively a Paxos group
12.3 Spannserver Software Stack
• Leader among replicas in a Paxos group is
chosen and all write requests for replicas in that group initiate at leader.
• At every replica that is a leader each spanserver implements:
– a lock table and
– a transaction manager
12.3 Spannserver Software Stack
• Directory – analogous to bucket in BigTable
– Smallest unit of data placement
– Smallest unit to define replication properties
• Directory might in turn be sharded into Fragments if it grows too large.
12.3 Directories
12.3 Data model
• Query language expanded from SQL.
• Multi-version database: uses a version when storing data in a column (time stamp).
• Supports transactions and provides strong consistency.
• Database can contain unlimited schematized
tables
• Not purely relational:
– Requires rows to have names
– Names are nothing but a set(can be singleton) of primary keys
– In a way, it’s a key value store with primary keys mapped to non-key columns as values
12.3 Data model
12.3 Data model
Implications of Interleave : hierarchy
12.3 TrueTime
• Novel API behind Spanner’s core innovation
• Leverages hardware features like GPS and Atomic Clocks
• Implemented via TrueTime API.
Method Returns
TT.now() TTinterval: [earliest, latest]
TT.after(t) True if t has passed TT.before(t) True if t has not arrived
• “Global wall-clock time” with bounded uncertainty
time
earliest latest
TT.now()
2*ε
17
12.3 TrueTime
• Set of time master server per datacenters and time slave daemon per machines.
• Majority of time masters are GPS fitted and few others are atomic clock fitted (Armageddon
masters).
• Daemon polls variety of masters and reaches a consensus about correct timestamp.
12.3 TrueTime implementation
12.3 TrueTime implementation
12.3 TrueTime implementation
12.3 TrueTime Architecture
Datacenter 1 Datacenter 2 … Datacenter n GPS
timemaster
GPS timemaster
GPS timemaster
Atomic-clock timemaster
GPS timemaster
Client
21
GPS timemaster
Compute reference [earliest, latest] = now ± ε
12.3 TrueTime
• TrueTime uses both GPS and Atomic clocks since they are different failure rates and scenarios.
• Two other boolean methods in API are
– After(t) – returns TRUE if t is definitely passed
– Before(t) – returns TRUE if t is definitely not arrived
• TrueTime uses these methods in concurrency
control and to serialize transactions.
12.3 TrueTime
• After() is used for Paxos Leader Leases
– Uses after(Smax) to check if Smax is passed so that Paxos Leader can abdicate its slaves.
• Paxos Leaders can not assign timestamps(Si) greater than Smax for transactions(Ti) and clients can not see the data commited by transaction Ti till after(Si) is true.
– After(t) – returns TRUE if t is definitely passed
– Before(t) – returns TRUE if t is definitely not arrived
• Replicas maintain a timestamp tsafe which is the
maximum timestamp at which that replica is up to
date.
12.3 Concurrency control
1. Read-Write – requires lock.
2. Read-Only – lock free.
– Requires declaration before start of transaction.
– Reads information that is up to date
3. Snapshot Read – Read information from past by specifying a timestamp or bound
– User specifies specific timestamp from past or
timestamp bound so that data till that point will be read.
12.3 Timestamps
• Strict two-phase locking for write transactions
• Assign timestamp while locks are held
T
Pick s = now()
Acquired locks Release locks
25
12.3 Timestamp Invariants
26
• Timestamp order == commit order
• Timestamp order respects global wall-time order
T2
T3
T4 T1
12.3 Timestamps and TrueTime
T
Pick s = TT.now().latest
Acquired locks Release locks
Wait until TT.now().earliest > s s
average ε
Commit wait
average ε
27
12.3 Commit Wait and Replication
T
Acquired locks Release locks
Start consensus Notify slaves
Commit wait done Pick s
28
Achieve consensus
12.3 Commit Wait and 2-Phase Commit
TC
Acquired locks Release locks
TP1
Acquired locks Release locks
TP2
Acquired locks Release locks
Notify participants of s
Commit wait done Compute s for each
29
Start logging Done logging
Prepared
Compute overall s
Committed
Send s
12.3 Example
30
TP
Remove X from my friend list
Remove myself from X’s friend list sC=6
sP=8
s=8 s=15
Risky post P
s=8
Time <8
[X]
[me]
15
TC T2
[P]
My friends My posts X’s friends
8 []
[]
12.3 Evaluation
• Evaluated for replication, transactions and availability.
• Results on epsilon of TrueTime
• Benchmarked on Spanner System with
– 50 Paxos groups – 250 Directories
– Clients(applicatons) and Zones are at a network distance of 1ms
12.3 Evaluation - Availability
12.3 Evaluation - Epsilon
“…bad CPUs are 6 times more likely than bad clocks…”
12.3 Case Study
• Spanner is currently in production used by Google’s advertising backend F1.
• F1 previously used MySQL that was manually sharded many ways.
• Spanner provides synchronous replication and
automatic failover for F1.
12.3 Case Study cont.
• Enabled F1 to specify data placement via
directories of spanner based on their needs.
• F1 operation latencies measured over 24 hours
12.3 Summary
• Multi-version, scalable, globally distributed and synchronously replicated database.
• Key enabling technology: TrueTime
– Interval-based global time
• First system to distribute data at global scale and support externally consistent distributed
transactions.
• Implementation keypoints: integration of
concurrency control, replication and 2PC
13.1 Map & Reduce
13.2 Cloud beyond Storage 13.3 Computing as a Service
– SaaS – PaaS – IaaS
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 37
13.0 The Cloud
• Just storing massive amounts of data is often not enough!
– Often, we also need to process and transform that data
• Large-Scale Data Processing
– Use thousands of worker nodes within a computation cluster to process large data batches
• But don’t want hassle of managing things
• Map & Reduce provides
– Automatic parallelization & distribution – Fault tolerance
– I/O scheduling
– Monitoring & status updates
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 38
13.1 Map & Reduce
• Initially, implemented by Google for building the Google search index
– i.e. crawling the web, building
inverted word index, computing page rank, etc.
• General framework for parallel high volume data processing
– J. Dean, S. Ghemawat. “MapReduce: Simplified Data
Processing on Large Clusters”, Symp. Operating System Design and Implementation, San Francisco, USA, 2004
– Also available as Open Source implementation as part of Apache Hadoop
• http://hadoop.apache.org/mapreduce/
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 39
13.1 Map & Reduce
• Base idea
– There is a large number of input data, identified by a key
• i.e. input given as key-value pairs
• e.g. all web pages of the internet identified by their URL
– A map operation is a simple function which accepts one input key-value pair
• A map operation runs as a autonomous thread on one single node of a cluster
– Many map jobs can run in parallel on different input keys
• Returns for a single input key-value pair a set of intermediate key-value pairs
– map(key, value) → Set of intermediate (key, value)
• After map job is finished, the node is free to perform another map job for the next input key-value pair
– A central controller distributes map jobs to free nodes
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 40
13.1 Map & Reduce
– After input data is mapped, reduce jobs can start – reduce(key, values) is run for each unique key
emitted by map()
• Each reduce job is also run autonomously on one single node
– Many reduce jobs can run in parallel on different intermediate key groups
• Reduce emits final output of the map-reduce operation
• Each reduce job takes all map tuples with a given key as input
• Generate usually one, but possible more output tuples
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 41
13.1 Map & Reduce
• Each reduce is executed on a set of intermediate map results which have the same key
– To efficiently select that set, the intermediate key- value pairs are usually shuffled
• i.e. just sorted and grouped by their respective key
– After shuffling, reduce input data can be selected by a simple range scan
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 42
13.1 Map & Reduce
• Example: Counting words in documents
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 43
13.1 Map & Reduce
reduce(key, values):
// key: a word;
// values: list of counts result = 0;
for each v in values) result += v;
emit(key, result);
map(key, value):
// key: doc name;
// value: text of doc
for each word w in value:
emit(w, 1);
• Example: Counting words in documents
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 44
13.1 Map & Reduce
doc1: “distributed db and p2p”
distributed 1
db 1
and 1
p2p 1
map 1
and 1
reduce 1
is
a 1
distributed 1
…
distributed 2
db 2
and 2
p2p 1
map 1
reduce 1
is 1
…
doc2: “map and reduce is a distributed processing technique for db”
map(key,value) reduce(key,values)
• Improvement: Combiners
– Combiners are mini-reducers that run in-memory after the map phase
– Used to group rare map keys into larger groups
• e.g. word counts: group multiple extremely rare words under one key (and mark that they are grouped…)
– Used to reduce network and worker scheduling overhead
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 45
13.1 Map & Reduce
• Responsibility of the map and reduce master
• Often, also called scheduler
– Assign Map and Reduce tasks to workers on nodes
• Usually, map tasks are assigned to worker nodes as a batch and not one by one
– Often called a split, i.e. subset of the whole input data
– Split often implemented by a simple hash function with as many buckets as worker nodes
– Full split data is assigned to worker node which starts a map task for each input key-value pair
– Check for node failure
– Check for task completion
– Route map results to reduce tasks
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 46
13.1 Map & Reduce
• Map and Reduce overview
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 47
13.1 Map & Reduce
• Master is responsible for worker node fault tolerance
– Handled via re-execution
• Detect failure via periodic heartbeats
• Re-execute completed + in-progress map tasks
• Re-execute in progress reduce tasks
• Task completion committed through master
– Robust: lost 1600/1800 machines once finished ok
• Master failures are not handled
– Unlikely due to redundant hardware…
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 48
13.1 Map & Reduce
• Showcase: machine usage during web indexing
– Fine granularity tasks: map tasks >> machines
• Minimizes time for fault recovery
• Can pipeline shuffling with map execution
• Better dynamic load balancing
– Showcase uses 200,000 map & 5,000 reduce tasks – Running on 2,000 machines
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 49
13.1 Map & Reduce
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 50 [Source: Bill Howe, UW]
13.1 Google Systems Summary
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 51
Spanner MapReduce
Megastore BigTable
Tenzing
Dremel Pregel
Hadoop
HBase
2004 2005 2006 2007 2008 2009 2010 2011 2012
13.1 Google Systems Summary
• PageRank is one of the major algorithm behind Google Search
– See our wonderful IRWS lecture (No 12)!!
– Key Question: How important is a given website?
• Importance independent of query
– Idea: other pages “vote” for a site by linking to it
• also called “giving credit to”
• Pages with many votes are probably important
– If an important site “votes” for another site, that vote has a higher weight as when an unimportant site votes
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 52
13.1 MR - PageRank
t1
x
t2 t3
• Given page 𝑥 with in-bound links 𝑡
1, … , 𝑡
𝑛, where
– 𝐶(𝑡) is the out-degree of 𝑡
– 𝛼 is probability of random jump – 𝑁 is the total number of
nodes in the graph
– 𝑃𝑅 𝑥 = 𝛼 1
𝑁 + (1 − 𝛼) 𝑖=1𝑛 (𝑃𝑅 𝑡𝑖
𝐶 𝑡𝑖 )
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 53
13.1 MR - PageRank
• Properties of PageRank
– Can be computed iteratively – Effects at each iteration is local
• Sketch of algorithm:
– Start with seed PRi values
– Each page distributes PRi “credit” to all pages it links to
– Each target page adds up “credit” from multiple in- bound links to compute PRi+1
– Iterate until values converge
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 54
13.1 MR - PageRank
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 55
13.1 MR - PageRank
Map Step: Distribute Page Rank “Credits” to link targets
Reduce Step: gather up PageRank “credit” from multiple sources to compute new PageRank value
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 56
• Dryad (Microsoft)
– Relational Algebra
• Pig (Yahoo)
– Near Relational Algebra over MapReduce
• HIVE (Facebook)
– SQL over MapReduce
• Cascading
– University of Wisconsin
• Hbase
– Indexing on HDFS
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig
13.1 MapReduce Contemporaries
• An engine for executing programs on top of
Hadoop.
• It provides a language, Pig Latin, to specify these
programs.
• An Apache open source
project http://pig.apache.org
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 57
13.2 Pig
• Suppose you have user data in one file, website data in
another, and you
need to find the top 5 most visited sites by users aged 18-25
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 58
13.2 Pig: motivation
Load Users Load Pages
Filter by age
Join on name
Group on url Count clicks
Order by clicks
Take top 5
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 59
13.2 In MapReduce
170 lines of code, 4 hours to write
Users = load ‘users’ as (name, age);
Fltrd = filter Users by
age >=18 and age <=25;
Pages = load ‘pages’ as (user, url);
Jnd = join Fltrd by name , Pages by users;
Grpd = group Jnd by url;
Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc;
Top5 = limit Srtd 5;
store Top5 into ‘top5sites’;
9 lines of code, 15 minutes to write
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 60
13.2 In Pig Latin
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 61
13.2 Pig System Overview
Pig Latin program
A = LOAD‘file’ AS (sid, pid, mass, px:double) ; B = LOAD‘file2’ AS (sid, pid, mass,px:double);
C = FILTER A BY px < 1.0;
D = JOIN C BY sid, B BY sid;
STORE g INTO ‘output.txt’;
Pig parser
Pig compiler
Parsed program
Execution plan JOIN
FILTER
LOAD LOAD
DISK A DISK B
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 62
13.2 Pig Performance vs Map-Reduce
How fast is Pig compared to a pure Map-Reduce implementation?
• Atom: Integer, string ,etc.
• Tuple:
– Sequence of fields
– Each field of any type
• Bag
– A collection of tuples
– Not necessarily the same type – Duplicates allowed
• Map:
– String literal keys mapped to any type
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 63
13.2 Data model
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 64
13.2 Pig Latin statement
A = LOAD‘student’ AS (name:chararray, age:int, , gpa:float) ; X = FOREACHA GENERATE name, $2
First Field Second Field Third Field
Data type Chararray Int Float
Positional notation (generated by system) $0 $1 $2 Possible name (assigned by user using a
schema)
name age gpa
• Map-Reduce: Iterative Jobs
– Iterative jobs involve a lot of disk I/O for each repetition
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 65
13.2 Apache Spark motivation
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 66
13.2 Apache Spark motivation
Using Map Reduce for complex jobs, interactive queries and online processing involves lots of disk I/O
Idea: keep more data in memory!
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 67
13.2 Use memory instead of disk
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 68
13.2 In memory data sharing
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 69
13.2 MapReduce Programmability
• Most real applications require multiple MR steps:
– Google indexing pipeline: 21 steps
– Analytics queries (e.g. count clicks & topk) :2-5 steps – Iterative algorithms (e.g. PageRank): 10’s of steps
• Multi step jobs create spaghetti code
– 21 MR steps -> 21 mapper and reducer classes
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 70
13.2 Programmability
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 71
13.2 Performance
[Source: Daytona GraySort benchmark, sortbenchmark.org]
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 72
13.2 Apache Spark
• Open source processing engine.
• Originally developed at UC Berkeley in 2009.
• More than 100 operators for transforming data.
• World record for large-scale on disk sorting.
• Built in support for many data sources (HDFS, RDBMS,S3, Cassandra)
[Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark : Cluster Computing with Working Sets. In HotCloud’10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing(p. 10). doi:10.1007/s00256-009- 0861-0]
[Zaharia, M., Chowdhury, M., Das, T., & Dave, A. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI’12 Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, 2–2. doi:10.1111/j.1095-8649.2005.00662.x]
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 73
13.2 Spark Tools
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 74
13.2 Resilient Distributed Datasets (RDDs)
Write programs in terms of distributed datasets and operations on them
Resilient Distributed Datasets
• Collections of objects spread
across a cluster, stored in RAM or on Disk.
• Built through parallel transformations
• Automatically rebuilt on failure
Operations
• Transformations (e.g. map, filter, groupBy).
• Actions ( e.g. count, collect, save)
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 75
13.2 Working with RDDs
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 76
13.2 Spark and Map Reduce Differences
Hadoop
Map Reduce
Spark
Storage Disk only In-memory or on disk
Operations Map and Reduce Map, Reduce, Join, Sample, etc…
Execution model Batch Batch, interactive, streaming Programming
environments
Java Scala, Java, R, and Python
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 77
Sample case
13.2 Log mining
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 78
13.2 Log mining
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 79
13.2 Log mining
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 80
13.2 Log mining
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 81
13.2 Log mining
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 82
13.2 Log mining
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 83
13.2 Log mining
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 84
13.2 Log mining
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 85
13.2 Log mining
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 86
13.2 Log mining
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 87
13.2 Log mining
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 88
13.2 Log mining
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 89
13.2 Log mining
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 90
13.2 Log mining
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 91
13.2 Log mining
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 92
13.2 Log mining
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 93
13.2 Log mining
Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 94