Distributed Data Management

(1)

Christoph Lofi José Pinto

Christian Nieke

Institut für Informationssysteme

Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

Distributed Data Management

(2)

14.1 Cloud beyond Storage 14.2 Computing as a Service

– SaaS – PaaS – IaaS

14.0 The Cloud

(3)

14.1 The Cloud

(4)

• The term “cloud computing” is often seen as a successor of client-server architectures

– Often used as synonym for centralized on-demand pay-what-you-use provisioning of

general computation resources

• e.g. compared to utility providers like electric power grids or water supply

• “Computing as a commodity”

– “Cloud” is used as a metaphor for the Internet

• Users or applications “just use” computation resources provided in the internet instead using local hardware or software

14.1 The Cloud

(5)

• “Computation resources” can mean a lot of things:

• Dynamic access to “raw metal”

– Raw storage space or CPU time

– Fully operational server are provided by the cloud

• Low-level services and platforms

– e.g. runtime platforms like Jave JRE

» User can run application directly on cloud platform

» No own servers or platform software needed – e.g. abstracted storage space like space

within a database or a file system

» This is what we did in the last weeks!

14.1 The Cloud

(6)

• Software services

– i.e. some functionalities required by user software is provided “by the cloud”

» Used via web service remote procedure calls

» e.g. delegate a the rendering of a map in a user applciarion to Google Maps

• Full software functionality

– e.g. rented web applications replacing traditional server or desktop applications

» e.g. rent CRM software online from SalesForce, use Google apps instead of MS Office, etc.

14.1The Cloud

(7)

• Underlying base problem

– Successfully running IT departments and IT

infrastructure can be very difficult and expensive for companies

– High fixed costs

• Acquiring and paying competent IT staff

– “Competent” is often very hard to get…

• Buying and maintaining servers

• Correctly hosting hardware

– Proper power and cooling facilities, network connections, server racks, etc.

• Buying and maintaining software

14.1The Cloud

(8)

– Load and Utilization Issues

• How much hardware resources are

required by each application and / or service?

• How to handle scaling issues?

– What happens if demand increases or declines?

– How to handle spike loads?

– “Digg Effect”

• Traditional data centers are

notoriously underutilized, often idle 85% of the time

– Over provisioning for future growth or spikes – Insufficient capacity planning and sizing

– Improper understanding of scalability requirements etc.

14.1The Cloud

(9)

• Cloud computing centrally unifies computation resources and

provides them on-demand

– Degree of centralization and provision may differ

• Centralize hardware within a department? A company? A number of companies? Globally?

• Provide resources only oneself? To some partners?

To anybody?

• How to compensate resource for resource usage?

– Provide resources by a rental model (e.g. monthly fee)?

– Provide resources metered on what-is-used basis (e.g. similar to electricity or water?)

– Provide resources for free?

14.1The Cloud

(10)

• Usually, three types of clouds are distinguished

– Public Cloud – Private Cloud – Hybrid Cloud

14.1 The Cloud

(11)

– Public Cloud

• “Traditional” cloud computing

• Services and resources are offered via the internet to anybody willing to pay for them

– User just pays for services, usually no acquisition, administration or maintenance of hardware / software necessary

• Services usually provided by off-site 3^rd party providers

– Open for use by general public

• Exist beyond firewall, fully hosted and managed by the vendor

• Customers are individuals, corporations and others

• e.g. Amazon's Web Services and Google AppEngine

• Offers startups and SMB’s quick setup, scalability, flexibility and automated management. Pay as you go model helps startups to start small and go big

– Security and compliance?

– Reliability and privacy concerns hinder the adoption of cloud

• Amazon S3 services were down for 6 hours in 2010

• What will Amazon do with all the data?

14.1 The Cloud

(12)

– Private Cloud

• Cloud computing hardware are within

the premises of a company behind the cooperate firewall

• Resources are only provided internally for various departments

• Private clouds are still fully bought, build, and maintained by the company using it

– But usually not exclusive to single departments!

– Still, costs could be prohibitive and may by far exceed that of public clouds

• Fine grained control over resources

• More secure as they are internal to organization

• Schedule and reshuffle resources based on business demands

• Ideal for apps requiring tight security and regulatory concerns

• Development requires hardware investments and in-house expertise

14.1 The Cloud

(13)

– Hybrid Cloud

• Both private and public cloud services or even non-cloud services are used or offered simultaneously

• “State-of-art” for most companies relying on cloud technology

14.1 The Cloud

(14)

• Properties promised by Cloud computing

– Agility

• Resources are quickly available when needed

– i.e. servers must not be ordered and build, software doesn’t need to be configured and installed, etc.

– Costs

• Capital expenditure is converted to operational expenditure

– Independence

• Services are available everywhere and for any device

14.1 The Cloud

(15)

– Multi-tenancy

• Resources are shared by larger pool of users

• Resources can be centralized which reduces the costs

• Load distribution of users differs

– Peak loads can usually be distributed

– Overall utilization and efficiency of resources is better

– Reliability

• Most cloud services promise durable and reliable resources due to distribution and replication

– Scalability

• If a user needs more resources or performance, it can easily provisioned

14.1 The Cloud

(16)

– Low maintenance

• Cloud services or applications are not installed on user’s machines, but maintained centrally by specialized staff

– Transparency and metering

• Costs for computation resources are directly visible and transparent

• “Pay-what-you-use” models

• Cloud computing generally promises to be beneficial for fast growing startups, SMBs and enterprises alike.

– Cost effective solutions to key business demands – Improved overall efficiency

14.1The Cloud

(17)

• The cloud heavily encourages a self-service model

– Users can simply request the resources they need

14.1The Cloud

(18)

• Anything-as-a-Service

– XaaS=“X as a service”

– In general, cloud providers offer any computation resources “as a service”

– In the long run, all computation needs of a company should be modeled, provided and used as a service

• e.g. in Amazon’s private and public cloud infrastructures:

everything is a service!

14.2 XaaS

(19)

– Services provide a strictly defined functionality with certain guarantees

• Service description and service-level agreement (SLA)

• Services description explains what is offered by the service

• SLA further clarifies the provisioning guarantees

– Often: performance, latency, reliability, availability, etc.

14.2 XaaS

(20)

• Usually, three main resources may be offered “as a service”

– Software as a Service

• SaaS

– Platform as a Service

• PaaS

– Infrastructure as a Service

• IaaS

14.2 XaaS

Server

Infrastructure Platform

Application Client

(21)

• Application Services (services on demand)

– Gmail, GoogleCalender – Payroll, HR, CRM, etc

– Sugar CRM, IBM Lotus Live

• Platform Services (resources on demand)

– Middleware, Integration, Messaging, Information, connectivity etc

– Amazon AWS, Boomi, CastIron, Google AppEngine

• Infrastructure as services (physical assets as services)

– IBM Blue House, VMWare Cloud Edition, Amazon EC2, Microsoft Azure Platform, …

14.2 XaaS

(22)

14.2 XaaS

…?

CLOUD

Individuals Corporations Non-Commercial

Cloud Middle Ware

Storage

Provisioning OS

Provisioning

Network Provisioning

Service(apps)

Provisioning SLA(monitor), Security, Billing,

Payment

Services Storage Network OS Resources

(23)

• Infrastructure as a Service (IaaS)

– Provides raw computation infrastructure, i.e. usually a virtual server

• e.g. see hardware virtualization (VMWare & co.)

• Successor to dedicated server rental

– For the user, a virtual server is similar to a real server

• Has CPU cores, main memory, hard disc space, etc.

• Usually provided as “self-service” raw machine

• User is responsible for installing and maintaining applications like e.g. operating system, databases or server software

• User does not need to buy, host or maintain the actual hardware

14.2 IaaS

(24)

• The IaaS provider can host multiple virtual servers on a single, real machine

– Often, 10-30 virtual severs per real server – Virtualization is used to abstract

server hardware for virtual servers

• Virtual system also often called virtual machines (neutral term) or appliances (usually suggesting preinstalled OS and software)

– Virtualization of hardware is usually handled by a so- called hypervisor,

• e.g. Xen, KVM, VMWare, HyperV, …

14.2 IaaS

(25)

• In short, IaaS is virtualization on multiple hardware machines

– Normal Server

• 1 machine with one OS

– Traditional virtualization

• 1 machine hosting multiple virtual servers

– Distributed Application

• 1 appliance running on multiple machines

– IaaS

• Multiple machines running multiple virtual servers

• Dynamic load balancing between machines

14.2 IaaS

“Normal”

server

“Traditional”

virtualization IaaS

1 many

1many#appliances

#machines

Distributed Appliance

(26)

• Hypervisor is responsible for allocating available resources to VMs

– Dispatch VMs to machines – Relocate VM to balance load – Distribute resources

• Network adaptors, logical discs, RAM, CPU cores, etc…

14.2 IaaS

(27)

• Usually, virtual machines offered by IaaS infrastructures cannot grow arbitrarily big

– Usually capped by actual server size or a smaller server group

• Really big applications are usually deployed in so- called Pods

– Similar to database shards

– Group of machines running one or multiple appliances – Machines within a Pod are very tightly networked

14.2 IaaS

(28)

– i.e. each Pod is a full copy of given virtual machines with full OS and application installed

• Usually, there are multiple copies of a given Pod (and its VMs)

• Each Pod is responsible for a disjoint part of the whole workload

– Pods are usually scattered across availability zones (e.g. data centers or a certain rack)

• Physically separated, usually with own power / network, etc.

14.2 IaaS

(29)

• IaaS Pods

14.2 IaaS

(30)

– Simplified Pod example: GoogleMail

• Multiple Pods, each Pod running on multiple machines with a full and independent installation of Gmail software

• Load balancer decides during user log-in which Pod will handle the user session

– Users are distributed across Pods

• Pods are flexible by using shared GFS file system

14.2 IaaS

(31)

• Mission critical applications should be designed

such that they run in multiple availability zones on multiple Pods

– Cloud control system (CCS) responsible for distribution and replication

14.2 IaaS

(32)

• Pod Architectures

– Each pod consists of multiple machines with mainboards, CPUs, and main memory

– Question: where to put secondary storage?

– Usually, three options

• Storage area network (SAN)

• Direct attached storage (DAS)

• Network attached storage (NAS)

– or…. Storage Service! (e.g. GFS & co.)

14.2 IaaS

(33)

• SAN Pods

– Individual servers don’t have own secondary storage – Storage area network provides shared hard disks

storage for all machines of a Pod – Pro

• All machines have access to the same data

• Allows for dynamic load balancing or migration of appliances

– e.g. VMware vMotion

– Con

• Very very expensive

• Higher latency than direct attached storage

14.2 IaaS

(34)

• SAN Pods

14.2 IaaS

(35)

• DAS Pods

– Each server has its own set of hard drives

– Accessing data from other servers may be difficult – Pro

• Cheap

• Low latency for accessing local data

– Con

• Usually, no shared data access

• Usually, difficult to live-migrate appliances (due to no shared data)

– But: by using clever storage abstractions, common problems can be circumvented

• Use distributed file system or a distributed data store!

– e.g. Apache S3 & SimpleDB, Google GFS & BigTable, Apache HBase &

HFS, etc.

14.2 IaaS

(36)

• DAS Pods

14.2 IaaS

(37)

• IaaS example: Amazon EC2

– Elastic Compute Cloud is one of the core service of the Amazon Cloud Infrastructure

• Public IaaS Cloud

– Customers may rent virtual servers hosted at Amazons Data Centers

• Can freely install OS and applications as needed

– Virtual servers are offered in different sizes and are paid by CPU usage

• Basic storage is offered within the VM, but usually additional storage services are used by application which cost extra

– e.g. S3, SimpleDB, or Dynamo DB

14.2 Amazon EC2

(38)

• Example: t2.micro

– 1.0 GB memory – 1 vCPU units

• 1 virtual core

• 1 vCPU is roughly one 2.5 GHz Xeon core

– No dedicated storage

• Has to use AWS network storage

– Burstable performance: 6 CPU credits per hour

• 1 CPU credit = 1 minute full CPU performance

– Costs $0.013 per hour

• $9,30 per month

– Usually many users start will the small instance, also heavily used for testing

14.2 Amazon EC2

(39)

• Example: m3.xlarge – 15 GB memory – 4 vCPU units

• Total of 13 ECU (Elastic Compute Units)

• 1 ECU is roughly equal to 1.5GHz Xeon core

– 80 GB instance storage on SSD

• More storage via AWS

– Costs $0.28 per hour

• $201 per month

14.2 Amazon EC2

(40)

• Example: i2.8xlarge – 244 GB of memory – 32 vCPU

• Total of 104 ECU units

– 6400 GB of instance storage on SSD – Costs $6.82 per hour

• $4910 per month

14.2 Amazon EC2

(41)

• Rough Estimations (Oct 2009)

– Roughly 40,000 servers

– Uses standard server racks with 16 machines per rack

• Mostly packed with 2U dual-socket Quad-Core Intel Xeons

– Roughly matches the High-Mem Quad XL instance…

– Uses around 8 500GB Raid-0 disks

– Target cost around $2500 per machine in average

– 75% of the machines are US, the remainder in Europe and Asia

– Amazon aims at a utilization rate of 75%

– Very rough guesses state that Amazon may earn

$25,264 per hour with EC2!

• http://cloudscaling.com/blog/cloud-computing/amazons-ec2-generating-220m-annually

14.2 Amazon EC2

(42)

• Platform as a Service (PaaS)

– Provides software platforms on demand

• e.g. runtime engines (JavaVM, .Net Runtime, etc.), storage systems (distributed file system, or databases), web services,

communication services, etc.

– PaaS systems are usually used to develop and host web applications or web services

• User applications run on the provided platform

– In contrast to IaaS, no installation and maintenance of

operation system and server applications necessary

• Centrally managed and maintained

• Services or runtimes are directly usable

14.2 PaaS

(43)

• Google AppEngine provides users a managed Phyton or Java Runtime

– Web applications can be directly hosted in AppEngine

• Just upload you WAR file and you are done…

– Users are billed by resource usage

• Some free resources provided everyday

– 1 GB in- and out traffic, 6.5 hours CPU, 500 MB storage overall

14.2 Google AppEngine

Resource Unit Unit cost

Outgoing Bandwidth GB $0.12

Incoming Bandwidth GB $0.10

CPU Time CPU hours $0.10

Stored Data GB / month $0.15

(44)

• Each application can access system resources up to a fixed maximum

– AppEngine is not fully scalable!

– AppEngine max values (2010)

• CPU: 1730 hours CPU per day; 72 minutes CPU per minute

• Data in or out: 1 TB per day; 10 GB per minute

• Request: 43M web service calls per day, 30K calls per minute

• Data storage: no limit (uses BigTable which can scale in size!!)

14.2 Google AppEngine

(45)

• Amazon Simple DB is data storage system roughly similar to Google BigTable

– http://aws.amazon.com/simpledb

– Simple table-centric database engine

• SimpleDB is directly ready to use

– No user configuration or administration – Accessible via web service

• SimpleDB is highly available, uses flexible schemas, and eventual consistency

– Similar to HBase or BigTable

14.2 Amazon SimpleDB

(46)

– Any application may use SimpleDB for data storage

• Simple web service provided to interact with Simple DB

• Create or delete a table (called domain)

• Put and delete rows

• Query for rows

– Users pay for storage, data transfer, and computation time

• 25 hours computation time (for querying) are free per month

– Later: $0.154 per machine hour in 2009 – Later: $0.140 per machine hour in 2014

• 1 GB of data transfer is free per month

– Later: $0.15 per GB in 2009 – Later: $0.12 per GB in 2014

• 1 Gb of data storage is free per month

– Later: $0.28 per GB in 2009 – Later: $0.25 per GB in 2014

14.2 Amazon Simple DB

(47)

• Software as a Service (SaaS)

– Full applications are offered on-demand

• User just need to consume the software; no installation or maintenance necessary

– All administrative and maintenance tasks are performed by the Cloud provider

• e.g. hosting physical hardware, maintaining platforms,

maintaining software, dealing with security, scalability, etc.

14.2 SaaS

(48)

• Salesforce.com On-Demand CRM software

– Customer-Relationship-Management

• Cooperation with Google Apps in early summer

– Provides simple online services for

• Customer database

• Lead management

• Call center

• Customer portal

• Knowledge Bases

• Email

• Collaboration environments

• Etc.

14.2 SalesForce

(49)

14.2 SalesForce

(50)

14.2 SalesForce

(51)

• Bills per month and user, based on edition

14.2 SalesForce

(52)

• Google Apps

– Provides standard office application on-demand

• i.e. Targeting at the lower-end of

the customer base of Microsoft Office

– MS counters with Office 365

– Google Apps provides

• Email & Groupware

• Spreadsheets

• Documents

• Presentations

• Online Forms

• Drawings

• etc.

14.2 Google Apps

(53)

14.2 Google Apps

(54)

Grid Computing at CERN

Christian Nieke CERN IT-DSS-DT IfIS Braunschweig

(55)

• European Organization for Nuclear Research

– Running the Large Hadron Collider (LHC) – Proton-Proton collider

to create short lived exotic particles

CERN

(56)

• European Organization for Nuclear Research

CERN

(57)

Data Taken by Experiment Detectors

(58)

Reconstruction

Turn RAW data into

physics events

(59)

Reconstruction

(60)

• Easy?

Tracks

(61)

• Not THAT easy actually

RAW Data

(62)

• Comparing to the model

– Simulated architecture of the detector

• Very complex, high precision

• Every sensor, wall and bolt, with their density and material properties

• Up to 10 µm precision

– Monte-Carlo Simulation

• Create random particle decays

• Based on probability according to standard model

• Simulate sensor responses

Simulation

(63)

Data Acquisition

(64)

• Distributed Tier Architecture

Processing in the Grid

Tier-0 (CERN):

•Data recording

•Initial data reconstruction

•Data distribution

Tier-1 (12 centres + Russia):

• Permanent storage

• Re-processing

• Analysis

Tier-2 (~140 centres):

• Simulation

• End-user analysis

• ~ 160 sites, 35 countries

• 300000 cores

• 200 PB of storage

• 2 Million jobs/day

• 10 Gbps links

(65)

• Embarrassingly Parallel

– One event = one collision of bunches of protons – The next event is independent

– One event is about 8MB

We have a lot of very tiny packages of data

Easy to distribute to several (virtual) machines

Processing Event Data

(66)

• Simple Approach for a simple problem

– Multicore hypervisors run one virtual machine per core

– Scheduler starts a pilot job

• Load image of OS, libraries, configurations(remote storage, security tokens)

• Load shared data sets (e.g. detector geometry, several GB)

• Fetch job requests, load specific data, run job, repeat

Batch Processing

(67)

• Under consideration, but:

– Network storage services

• Security

• Specific configurations

• Payment for transfers into / out of the cloud

– Costs

• CERN is still cheaper than Amazon (non-profit)

• But the gap is closing

– Political reasons

• Site provide resources to “buy into” the CERN cooperation

Why not a Cloud Provider?

(68)

• Traditional Approach

– Archiving on tape

• Low cost (medium, energy, fault tolerance)

– Online data in distributed disk storage

• EOS distributed file system

– Large namespace (500GB in memory) – Security and authentication

– Interfaces to FUSE, http, WebDAV, …

– Based on xRoot transport protocol for redirection, failover, locality- awareness, …

– Object Stores coming up

Storage

(69)

• What do you pay?

– CERN (and other sites) provide computing resources to the experiments

– Payment per CPU second

• But not every CPU second is worth the same!

• CPU seconds are scaled by performance of the computing node

Accounting

≠

(70)

• Active Benchmark: HepSpec06

– Intended to represent typical experiment workload – Expensive to perform

• ~8 hours on every machine

– Requires empty hypervisor

• Test once at commissioning

Benchmark is sometimes not up to date with actual configuration

Benchmarking

(71)

• Passive Benchmark

– Use actual workload as benchmark – “For free” from existing logs

– Can be repeated at any time

– BUT: Requires minimal amount of observed jobs

• Cold start problem

Benchmarking

(72)

• Simple in Theory

– Embarrassingly parallel problem – Mature technologies

• Tapes, disks, virtual machines, …

• But the Devil is in the Details…

– Accounting – Politics

– Security

Summary

(73)

• Deductive Databases

• Information Retrieval

• Seminar: Linked Open Data

Next Semester

(74)

Distributed Data Management

Thank you for your attention!