Distributed Data Management

(1)

Christoph Lofi José Pinto

Institut für Informationssysteme

Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

Distributed Data Management

(2)

12.3 Google Spanner 13.1 Map & Reduce

13.2 Cloud beyond Storage 13.3 Computing as a Service

– SaaS – PaaS – IaaS

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 2

12.0 The Cloud

(3)

12.3 Spanner

• Outline and Key Features

• System Architecture:

– Software Stack – Directories

– Data Model – TrueTime

• Evaluation

• Case Study

[Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J. J., … Woodford, D. (2012). Spanner : Google’s Globally-Distributed Database. In Proceedings of OSDI’12: Tenth Symposium on Operating System Design and Implementation (pp. 251–264). http://doi.org/10.1145/2491245]

(4)

12.3 Motivation: Social Network

User posts Friend lists User posts Friend lists User posts Friend lists User posts Friend lists US

Brazil

Russia Spain

San Francisco Seattle

Arizona

Sao Paulo Santiago Buenos Aires

Moscow Berlin Krakow London

Paris Berlin Madrid Lisbon User posts Friend lists

4

x1000

(5)

12.3 Spanner

• “We believe it is better to have application

programmers deal with performance problems due to overuse of transactions as bottlenecks

arise, rather than always coding around the lack of

transactions”

(6)

12.3 Outline

• Next step from Bigtable in RDBMS path with strong time semantics

• Key Features:

– Temporal Multi-version database

– Externally consistent global write-transactions with synchronous replication.

– Transactions across Datacenters.

– Lock-free read-only transactions.

– Schematized, semi-relational (tabular) data model.

– SQL-like query interface.

(7)

12.3 Key Features cont.

– Auto-sharding, auto-rebalancing, automatic failure response.

– Exposes control of data replication and placement to user/application.

– Enables transaction serialization via global timestamps – Acknowledges clock uncertainty and guarantees a

bound on it

– Uses novel TrueTime API to accomplish concurrency control

– Uses GPS devices and Atomic clocks to get accurate time

(8)

12.3 Server configuration

Universe: Spanner deployment

Zones: analogues to deployment of BigTable servers (units of physical isolation⁾

(9)

12.3 Spannserver Software Stack

2-phase commit across groups (when needed)

(10)

(key:string, timestamp:int64) → string

• Back End: Colossus (successor of GFS)

• To support replication:

– each spanserver implements a Paxos State

Machine on top of each tablet, and the state machine stores meta data and logs of its tablet.

• Set of replicas is collectively a Paxos group

12.3 Spannserver Software Stack

(11)

• Leader among replicas in a Paxos group is

chosen and all write requests for replicas in that group initiate at leader.

• At every replica that is a leader each spanserver implements:

– a lock table and

– a transaction manager

12.3 Spannserver Software Stack

(12)

• Directory – analogous to bucket in BigTable

– Smallest unit of data placement

– Smallest unit to define replication properties

• Directory might in turn be sharded into Fragments if it grows too large.

12.3 Directories

(13)

12.3 Data model

• Query language expanded from SQL.

• Multi-version database: uses a version when storing data in a column (time stamp).

• Supports transactions and provides strong consistency.

• Database can contain unlimited schematized

tables

(14)

• Not purely relational:

– Requires rows to have names

– Names are nothing but a set(can be singleton) of primary keys

– In a way, it’s a key value store with primary keys mapped to non-key columns as values

12.3 Data model

(15)

12.3 Data model

Implications of Interleave : hierarchy

(16)

12.3 TrueTime

• Novel API behind Spanner’s core innovation

• Leverages hardware features like GPS and Atomic Clocks

• Implemented via TrueTime API.

Method Returns

TT.now() TTinterval: [earliest, latest]

TT.after(t) True if t has passed TT.before(t) True if t has not arrived

(17)

• “Global wall-clock time” with bounded uncertainty

time

earliest latest

TT.now()

2*ε

17

12.3 TrueTime

(18)

• Set of time master server per datacenters and time slave daemon per machines.

• Majority of time masters are GPS fitted and few others are atomic clock fitted (Armageddon

masters).

• Daemon polls variety of masters and reaches a consensus about correct timestamp.

12.3 TrueTime implementation

(19)

12.3 TrueTime implementation

(20)

12.3 TrueTime implementation

(21)

12.3 TrueTime Architecture

Datacenter 1 Datacenter 2 … Datacenter n GPS

timemaster

GPS timemaster

Atomic-clock timemaster

GPS timemaster

Client

21

GPS timemaster

Compute reference [earliest, latest] = now ± ε

(22)

12.3 TrueTime

• TrueTime uses both GPS and Atomic clocks since they are different failure rates and scenarios.

• Two other boolean methods in API are

– After(t) – returns TRUE if t is definitely passed

– Before(t) – returns TRUE if t is definitely not arrived

• TrueTime uses these methods in concurrency

control and to serialize transactions.

(23)

12.3 TrueTime

• After() is used for Paxos Leader Leases

– Uses after(Smax) to check if Smax is passed so that Paxos Leader can abdicate its slaves.

• Paxos Leaders can not assign timestamps(Si) greater than Smax for transactions(Ti) and clients can not see the data commited by transaction Ti till after(Si) is true.

– After(t) – returns TRUE if t is definitely passed

– Before(t) – returns TRUE if t is definitely not arrived

• Replicas maintain a timestamp tsafe which is the

maximum timestamp at which that replica is up to

date.

(24)

12.3 Concurrency control

1. Read-Write – requires lock.

2. Read-Only – lock free.

– Requires declaration before start of transaction.

– Reads information that is up to date

3. Snapshot Read – Read information from past by specifying a timestamp or bound

– User specifies specific timestamp from past or

timestamp bound so that data till that point will be read.

(25)

12.3 Timestamps

• Strict two-phase locking for write transactions

• Assign timestamp while locks are held

T

Pick s = now()

Acquired locks Release locks

25

(26)

12.3 Timestamp Invariants

26

• Timestamp order == commit order

• Timestamp order respects global wall-time order

T₂

T₃

T₄ T₁

(27)

12.3 Timestamps and TrueTime

T

Pick s = TT.now().latest

Wait until TT.now().earliest > s s

average ε

Commit wait

average ε

27

(28)

12.3 Commit Wait and Replication

T

Start consensus Notify slaves

Commit wait done Pick s

28

Achieve consensus

(29)

12.3 Commit Wait and 2-Phase Commit

T_C

T_P1

T_P2

Notify participants of s

Commit wait done Compute s for each

29

Start logging Done logging

Prepared

Compute overall s

Committed

Send s

(30)

12.3 Example

30

T_P

Remove X from my friend list

Remove myself from X’s friend list s_C=6

s_P=8

s=8 s=15

Risky post P

s=8

Time <8

[X]

[me]

15

T_C T₂

[P]

My friends My posts X’s friends

8 []

[]

(31)

12.3 Evaluation

• Evaluated for replication, transactions and availability.

• Results on epsilon of TrueTime

• Benchmarked on Spanner System with

– 50 Paxos groups – 250 Directories

– Clients(applicatons) and Zones are at a network distance of 1ms

(32)

12.3 Evaluation - Availability

(33)

12.3 Evaluation - Epsilon

“…bad CPUs are 6 times more likely than bad clocks…”

(34)

12.3 Case Study

• Spanner is currently in production used by Google’s advertising backend F1.

• F1 previously used MySQL that was manually sharded many ways.

• Spanner provides synchronous replication and

automatic failover for F1.

(35)

12.3 Case Study cont.

• Enabled F1 to specify data placement via

directories of spanner based on their needs.

• F1 operation latencies measured over 24 hours

(36)

12.3 Summary

• Multi-version, scalable, globally distributed and synchronously replicated database.

• Key enabling technology: TrueTime

– Interval-based global time

• First system to distribute data at global scale and support externally consistent distributed

transactions.

• Implementation keypoints: integration of

concurrency control, replication and 2PC

(37)

13.1 Map & Reduce

13.2 Cloud beyond Storage 13.3 Computing as a Service

– SaaS – PaaS – IaaS

13.0 The Cloud

(38)

• Just storing massive amounts of data is often not enough!

– Often, we also need to process and transform that data

• Large-Scale Data Processing

– Use thousands of worker nodes within a computation cluster to process large data batches

• But don’t want hassle of managing things

• Map & Reduce provides

– Automatic parallelization & distribution – Fault tolerance

– I/O scheduling

– Monitoring & status updates

13.1 Map & Reduce

(39)

• Initially, implemented by Google for building the Google search index

– i.e. crawling the web, building

inverted word index, computing page rank, etc.

• General framework for parallel high volume data processing

– J. Dean, S. Ghemawat. “MapReduce: Simplified Data

Processing on Large Clusters”, Symp. Operating System Design and Implementation, San Francisco, USA, 2004

– Also available as Open Source implementation as part of Apache Hadoop

• http://hadoop.apache.org/mapreduce/

13.1 Map & Reduce

(40)

• Base idea

– There is a large number of input data, identified by a key

• i.e. input given as key-value pairs

• e.g. all web pages of the internet identified by their URL

– A map operation is a simple function which accepts one input key-value pair

• A map operation runs as a autonomous thread on one single node of a cluster

– Many map jobs can run in parallel on different input keys

• Returns for a single input key-value pair a set of intermediate key-value pairs

– map(key, value) → Set of intermediate (key, value)

• After map job is finished, the node is free to perform another map job for the next input key-value pair

– A central controller distributes map jobs to free nodes

13.1 Map & Reduce

(41)

– After input data is mapped, reduce jobs can start – reduce(key, values) is run for each unique key

emitted by map()

• Each reduce job is also run autonomously on one single node

– Many reduce jobs can run in parallel on different intermediate key groups

• Reduce emits final output of the map-reduce operation

• Each reduce job takes all map tuples with a given key as input

• Generate usually one, but possible more output tuples

13.1 Map & Reduce

(42)

• Each reduce is executed on a set of intermediate map results which have the same key

– To efficiently select that set, the intermediate key- value pairs are usually shuffled

• i.e. just sorted and grouped by their respective key

– After shuffling, reduce input data can be selected by a simple range scan

13.1 Map & Reduce

(43)

• Example: Counting words in documents

13.1 Map & Reduce

reduce(key, values):

// key: a word;

// values: list of counts result = 0;

for each v in values) result += v;

emit(key, result);

map(key, value):

// key: doc name;

// value: text of doc

for each word w in value:

emit(w, 1);

(44)

• Example: Counting words in documents

13.1 Map & Reduce

doc1: “distributed db and p2p”

distributed 1

db 1

and 1

p2p 1

map 1

and 1

reduce 1

is

a 1

distributed 1

…

distributed 2

db 2

and 2

p2p 1

map 1

reduce 1

is 1

…

doc2: “map and reduce is a distributed processing technique for db”

map(key,value) reduce(key,values)

(45)

• Improvement: Combiners

– Combiners are mini-reducers that run in-memory after the map phase

– Used to group rare map keys into larger groups

• e.g. word counts: group multiple extremely rare words under one key (and mark that they are grouped…)

– Used to reduce network and worker scheduling overhead

13.1 Map & Reduce

(46)

• Responsibility of the map and reduce master

• Often, also called scheduler

– Assign Map and Reduce tasks to workers on nodes

• Usually, map tasks are assigned to worker nodes as a batch and not one by one

– Often called a split, i.e. subset of the whole input data

– Split often implemented by a simple hash function with as many buckets as worker nodes

– Full split data is assigned to worker node which starts a map task for each input key-value pair

– Check for node failure

– Check for task completion

– Route map results to reduce tasks

13.1 Map & Reduce

(47)

• Map and Reduce overview

13.1 Map & Reduce

(48)

• Master is responsible for worker node fault tolerance

– Handled via re-execution

• Detect failure via periodic heartbeats

• Re-execute completed + in-progress map tasks

• Re-execute in progress reduce tasks

• Task completion committed through master

– Robust: lost 1600/1800 machines once  finished ok

• Master failures are not handled

– Unlikely due to redundant hardware…

13.1 Map & Reduce

(49)

• Showcase: machine usage during web indexing

– Fine granularity tasks: map tasks >> machines

• Minimizes time for fault recovery

• Can pipeline shuffling with map execution

• Better dynamic load balancing

– Showcase uses 200,000 map & 5,000 reduce tasks – Running on 2,000 machines

13.1 Map & Reduce

(50)

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 50 [Source: Bill Howe, UW]

13.1 Google Systems Summary

(51)

Spanner MapReduce

Megastore BigTable

Tenzing

Dremel Pregel

Hadoop

HBase

2004 2005 2006 2007 2008 2009 2010 2011 2012

13.1 Google Systems Summary

(52)

• PageRank is one of the major algorithm behind Google Search

– See our wonderful IRWS lecture (No 12)!!

– Key Question: How important is a given website?

• Importance independent of query

– Idea: other pages “vote” for a site by linking to it

• also called “giving credit to”

• Pages with many votes are probably important

– If an important site “votes” for another site, that vote has a higher weight as when an unimportant site votes

13.1 MR - PageRank

t₁

x

t₂ t₃

(53)

• Given page 𝑥 with in-bound links 𝑡

₁

, … , 𝑡

_𝑛

, where

– 𝐶(𝑡) is the out-degree of 𝑡

– 𝛼 is probability of random jump – 𝑁 is the total number of

nodes in the graph

– 𝑃𝑅 𝑥 = 𝛼 ¹

𝑁 + (1 − 𝛼) _𝑖=1^𝑛 (^{𝑃𝑅 𝑡}^𝑖

𝐶 𝑡_𝑖 )

13.1 MR - PageRank

(54)

• Properties of PageRank

– Can be computed iteratively – Effects at each iteration is local

• Sketch of algorithm:

– Start with seed PR_i values

– Each page distributes PR_i “credit” to all pages it links to

– Each target page adds up “credit” from multiple in- bound links to compute PR_i+1

– Iterate until values converge

13.1 MR - PageRank

(55)

13.1 MR - PageRank

Map Step: Distribute Page Rank “Credits” to link targets

Reduce Step: gather up PageRank “credit” from multiple sources to compute new PageRank value

(56)

• Dryad (Microsoft)

– Relational Algebra

• Pig (Yahoo)

– Near Relational Algebra over MapReduce

• HIVE (Facebook)

– SQL over MapReduce

• Cascading

– University of Wisconsin

• Hbase

– Indexing on HDFS

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig

13.1 MapReduce Contemporaries

(57)

• An engine for executing programs on top of

Hadoop.

• It provides a language, Pig Latin, to specify these

programs.

• An Apache open source

project http://pig.apache.org

13.2 Pig

(58)

• Suppose you have user data in one file, website data in

another, and you

need to find the top 5 most visited sites by users aged 18-25

13.2 Pig: motivation

Load Users Load Pages

Filter by age

Join on name

Group on url Count clicks

Order by clicks

Take top 5

(59)

13.2 In MapReduce

170 lines of code, 4 hours to write

(60)

Users = load ‘users’ as (name, age);

Fltrd = filter Users by

age >=18 and age <=25;

Pages = load ‘pages’ as (user, url);

Jnd = join Fltrd by name , Pages by users;

Grpd = group Jnd by url;

Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;

Srtd = order Smmd by clicks desc;

Top5 = limit Srtd 5;

store Top5 into ‘top5sites’;

9 lines of code, 15 minutes to write

13.2 In Pig Latin

(61)

13.2 Pig System Overview

Pig Latin program

A = LOAD‘file’ AS (sid, pid, mass, px:double) ; B = LOAD‘file2’ AS (sid, pid, mass,px:double);

C = FILTER A BY px < 1.0;

D = JOIN C BY sid, B BY sid;

STORE g INTO ‘output.txt’;

Pig parser

Pig compiler

Parsed program

Execution plan JOIN

FILTER

LOAD LOAD

DISK A DISK B

(62)

13.2 Pig Performance vs Map-Reduce

How fast is Pig compared to a pure Map-Reduce implementation?

(63)

• Atom: Integer, string ,etc.

• Tuple:

– Sequence of fields

– Each field of any type

• Bag

– A collection of tuples

– Not necessarily the same type – Duplicates allowed

• Map:

– String literal keys mapped to any type

13.2 Data model

(64)

13.2 Pig Latin statement

A = LOAD‘student’ AS (name:chararray, age:int, , gpa:float) ; X = FOREACHA GENERATE name, $2

First Field Second Field Third Field

Data type Chararray Int Float

Positional notation (generated by system) $0 $1 $2 Possible name (assigned by user using a

schema)

name age gpa

(65)

• Map-Reduce: Iterative Jobs

– Iterative jobs involve a lot of disk I/O for each repetition

13.2 Apache Spark motivation

(66)

13.2 Apache Spark motivation

Using Map Reduce for complex jobs, interactive queries and online processing involves lots of disk I/O

Idea: keep more data in memory!

(67)

13.2 Use memory instead of disk

(68)

13.2 In memory data sharing

(69)

13.2 MapReduce Programmability

• Most real applications require multiple MR steps:

– Google indexing pipeline: 21 steps

– Analytics queries (e.g. count clicks & topk) :2-5 steps – Iterative algorithms (e.g. PageRank): 10’s of steps

• Multi step jobs create spaghetti code

– 21 MR steps -> 21 mapper and reducer classes

(70)

13.2 Programmability

(71)

13.2 Performance

[Source: Daytona GraySort benchmark, sortbenchmark.org]

(72)

13.2 Apache Spark

• Open source processing engine.

• Originally developed at UC Berkeley in 2009.

• More than 100 operators for transforming data.

• World record for large-scale on disk sorting.

• Built in support for many data sources (HDFS, RDBMS,S3, Cassandra)

[Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark : Cluster Computing with Working Sets. In HotCloud’10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing(p. 10). doi:10.1007/s00256-009- 0861-0]

[Zaharia, M., Chowdhury, M., Das, T., & Dave, A. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI’12 Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, 2–2. doi:10.1111/j.1095-8649.2005.00662.x]

(73)

13.2 Spark Tools

(74)

13.2 Resilient Distributed Datasets (RDDs)

Write programs in terms of distributed datasets and operations on them

Resilient Distributed Datasets

• Collections of objects spread

across a cluster, stored in RAM or on Disk.

• Built through parallel transformations

• Automatically rebuilt on failure

Operations

• Transformations (e.g. map, filter, groupBy).

• Actions ( e.g. count, collect, save)

(75)

13.2 Working with RDDs

(76)

13.2 Spark and Map Reduce Differences

Hadoop

Map Reduce

Spark

Storage Disk only In-memory or on disk

Operations Map and Reduce Map, Reduce, Join, Sample, etc…

Execution model Batch Batch, interactive, streaming Programming

environments

Java Scala, Java, R, and Python

(77)

Sample case

13.2 Log mining

(78)

13.2 Log mining

(79)

13.2 Log mining

(80)

13.2 Log mining

(81)

13.2 Log mining

(82)

13.2 Log mining

(83)

13.2 Log mining

(84)

13.2 Log mining

(85)

13.2 Log mining

(86)

13.2 Log mining

(87)

13.2 Log mining

(88)

13.2 Log mining

(89)

13.2 Log mining

(90)

13.2 Log mining

(91)

13.2 Log mining

(92)

13.2 Log mining

(93)

13.2 Log mining

(94)