Christoph Lofi José Pinto
Christian Nieke
Institut für Informationssysteme
Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de
Distributed Data Management
14.1 Cloud beyond Storage 14.2 Computing as a Service
– SaaS – PaaS – IaaS
14.0 The Cloud
14.1 The Cloud
• The term “cloud computing” is often seen as a successor of client-server architectures
– Often used as synonym for centralized on-demand pay-what-you-use provisioning of
general computation resources
• e.g. compared to utility providers like electric power grids or water supply
• “Computing as a commodity”
– “Cloud” is used as a metaphor for the Internet
• Users or applications “just use” computation resources provided in the internet instead using local hardware or software
14.1 The Cloud
• “Computation resources” can mean a lot of things:
• Dynamic access to “raw metal”
– Raw storage space or CPU time
– Fully operational server are provided by the cloud
• Low-level services and platforms
– e.g. runtime platforms like Jave JRE
» User can run application directly on cloud platform
» No own servers or platform software needed – e.g. abstracted storage space like space
within a database or a file system
» This is what we did in the last weeks!
14.1 The Cloud
• Software services
– i.e. some functionalities required by user software is provided “by the cloud”
» Used via web service remote procedure calls
» e.g. delegate a the rendering of a map in a user applciarion to Google Maps
• Full software functionality
– e.g. rented web applications replacing traditional server or desktop applications
» e.g. rent CRM software online from SalesForce, use Google apps instead of MS Office, etc.
14.1The Cloud
• Underlying base problem
– Successfully running IT departments and IT
infrastructure can be very difficult and expensive for companies
– High fixed costs
• Acquiring and paying competent IT staff
– “Competent” is often very hard to get…
• Buying and maintaining servers
• Correctly hosting hardware
– Proper power and cooling facilities, network connections, server racks, etc.
• Buying and maintaining software
14.1The Cloud
– Load and Utilization Issues
• How much hardware resources are
required by each application and / or service?
• How to handle scaling issues?
– What happens if demand increases or declines?
– How to handle spike loads?
– “Digg Effect”
• Traditional data centers are
notoriously underutilized, often idle 85% of the time
– Over provisioning for future growth or spikes – Insufficient capacity planning and sizing
– Improper understanding of scalability requirements etc.
14.1The Cloud
• Cloud computing centrally unifies computation resources and
provides them on-demand
– Degree of centralization and provision may differ
• Centralize hardware within a department? A company? A number of companies? Globally?
• Provide resources only oneself? To some partners?
To anybody?
• How to compensate resource for resource usage?
– Provide resources by a rental model (e.g. monthly fee)?
– Provide resources metered on what-is-used basis (e.g. similar to electricity or water?)
– Provide resources for free?
14.1The Cloud
• Usually, three types of clouds are distinguished
– Public Cloud – Private Cloud – Hybrid Cloud
14.1 The Cloud
– Public Cloud
• “Traditional” cloud computing
• Services and resources are offered via the internet to anybody willing to pay for them
– User just pays for services, usually no acquisition, administration or maintenance of hardware / software necessary
• Services usually provided by off-site 3rd party providers
– Open for use by general public
• Exist beyond firewall, fully hosted and managed by the vendor
• Customers are individuals, corporations and others
• e.g. Amazon's Web Services and Google AppEngine
• Offers startups and SMB’s quick setup, scalability, flexibility and automated management. Pay as you go model helps startups to start small and go big
– Security and compliance?
– Reliability and privacy concerns hinder the adoption of cloud
• Amazon S3 services were down for 6 hours in 2010
• What will Amazon do with all the data?
14.1 The Cloud
– Private Cloud
• Cloud computing hardware are within
the premises of a company behind the cooperate firewall
• Resources are only provided internally for various departments
• Private clouds are still fully bought, build, and maintained by the company using it
– But usually not exclusive to single departments!
– Still, costs could be prohibitive and may by far exceed that of public clouds
• Fine grained control over resources
• More secure as they are internal to organization
• Schedule and reshuffle resources based on business demands
• Ideal for apps requiring tight security and regulatory concerns
• Development requires hardware investments and in-house expertise
14.1 The Cloud
– Hybrid Cloud
• Both private and public cloud services or even non-cloud services are used or offered simultaneously
• “State-of-art” for most companies relying on cloud technology
14.1 The Cloud
• Properties promised by Cloud computing
– Agility
• Resources are quickly available when needed
– i.e. servers must not be ordered and build, software doesn’t need to be configured and installed, etc.
– Costs
• Capital expenditure is converted to operational expenditure
– Independence
• Services are available everywhere and for any device
14.1 The Cloud
– Multi-tenancy
• Resources are shared by larger pool of users
• Resources can be centralized which reduces the costs
• Load distribution of users differs
– Peak loads can usually be distributed
– Overall utilization and efficiency of resources is better
– Reliability
• Most cloud services promise durable and reliable resources due to distribution and replication
– Scalability
• If a user needs more resources or performance, it can easily provisioned
14.1 The Cloud
– Low maintenance
• Cloud services or applications are not installed on user’s machines, but maintained centrally by specialized staff
– Transparency and metering
• Costs for computation resources are directly visible and transparent
• “Pay-what-you-use” models
• Cloud computing generally promises to be beneficial for fast growing startups, SMBs and enterprises alike.
– Cost effective solutions to key business demands – Improved overall efficiency
14.1The Cloud
• The cloud heavily encourages a self-service model
– Users can simply request the resources they need
14.1The Cloud
• Anything-as-a-Service
– XaaS=“X as a service”
– In general, cloud providers offer any computation resources “as a service”
– In the long run, all computation needs of a company should be modeled, provided and used as a service
• e.g. in Amazon’s private and public cloud infrastructures:
everything is a service!
14.2 XaaS
– Services provide a strictly defined functionality with certain guarantees
• Service description and service-level agreement (SLA)
• Services description explains what is offered by the service
• SLA further clarifies the provisioning guarantees
– Often: performance, latency, reliability, availability, etc.
14.2 XaaS
• Usually, three main resources may be offered “as a service”
– Software as a Service
• SaaS
– Platform as a Service
• PaaS
– Infrastructure as a Service
• IaaS
14.2 XaaS
Server
Infrastructure Platform
Application Client
• Application Services (services on demand)
– Gmail, GoogleCalender – Payroll, HR, CRM, etc
– Sugar CRM, IBM Lotus Live
• Platform Services (resources on demand)
– Middleware, Integration, Messaging, Information, connectivity etc
– Amazon AWS, Boomi, CastIron, Google AppEngine
• Infrastructure as services (physical assets as services)
– IBM Blue House, VMWare Cloud Edition, Amazon EC2, Microsoft Azure Platform, …
14.2 XaaS
14.2 XaaS
…?
CLOUD
Individuals Corporations Non-Commercial
Cloud Middle Ware
Storage
Provisioning OS
Provisioning
Network Provisioning
Service(apps)
Provisioning SLA(monitor), Security, Billing,
Payment
Services Storage Network OS Resources
• Infrastructure as a Service (IaaS)
– Provides raw computation infrastructure, i.e. usually a virtual server
• e.g. see hardware virtualization (VMWare & co.)
• Successor to dedicated server rental
– For the user, a virtual server is similar to a real server
• Has CPU cores, main memory, hard disc space, etc.
• Usually provided as “self-service” raw machine
• User is responsible for installing and maintaining applications like e.g. operating system, databases or server software
• User does not need to buy, host or maintain the actual hardware
14.2 IaaS
• The IaaS provider can host multiple virtual servers on a single, real machine
– Often, 10-30 virtual severs per real server – Virtualization is used to abstract
server hardware for virtual servers
• Virtual system also often called virtual machines (neutral term) or appliances (usually suggesting preinstalled OS and software)
– Virtualization of hardware is usually handled by a so- called hypervisor,
• e.g. Xen, KVM, VMWare, HyperV, …
14.2 IaaS
• In short, IaaS is virtualization on multiple hardware machines
– Normal Server
• 1 machine with one OS
– Traditional virtualization
• 1 machine hosting multiple virtual servers
– Distributed Application
• 1 appliance running on multiple machines
– IaaS
• Multiple machines running multiple virtual servers
• Dynamic load balancing between machines
14.2 IaaS
“Normal”
server
“Traditional”
virtualization IaaS
1 many
1many#appliances
#machines
Distributed Appliance
• Hypervisor is responsible for allocating available resources to VMs
– Dispatch VMs to machines – Relocate VM to balance load – Distribute resources
• Network adaptors, logical discs, RAM, CPU cores, etc…
14.2 IaaS
• Usually, virtual machines offered by IaaS infrastructures cannot grow arbitrarily big
– Usually capped by actual server size or a smaller server group
• Really big applications are usually deployed in so- called Pods
– Similar to database shards
– Group of machines running one or multiple appliances – Machines within a Pod are very tightly networked
14.2 IaaS
– i.e. each Pod is a full copy of given virtual machines with full OS and application installed
• Usually, there are multiple copies of a given Pod (and its VMs)
• Each Pod is responsible for a disjoint part of the whole workload
– Pods are usually scattered across availability zones (e.g. data centers or a certain rack)
• Physically separated, usually with own power / network, etc.
14.2 IaaS
• IaaS Pods
14.2 IaaS
– Simplified Pod example: GoogleMail
• Multiple Pods, each Pod running on multiple machines with a full and independent installation of Gmail software
• Load balancer decides during user log-in which Pod will handle the user session
– Users are distributed across Pods
• Pods are flexible by using shared GFS file system
14.2 IaaS
• Mission critical applications should be designed
such that they run in multiple availability zones on multiple Pods
– Cloud control system (CCS) responsible for distribution and replication
14.2 IaaS
• Pod Architectures
– Each pod consists of multiple machines with mainboards, CPUs, and main memory
– Question: where to put secondary storage?
– Usually, three options
• Storage area network (SAN)
• Direct attached storage (DAS)
• Network attached storage (NAS)
– or…. Storage Service! (e.g. GFS & co.)
14.2 IaaS
• SAN Pods
– Individual servers don’t have own secondary storage – Storage area network provides shared hard disks
storage for all machines of a Pod – Pro
• All machines have access to the same data
• Allows for dynamic load balancing or migration of appliances
– e.g. VMware vMotion
– Con
• Very very expensive
• Higher latency than direct attached storage
14.2 IaaS
• SAN Pods
14.2 IaaS
• DAS Pods
– Each server has its own set of hard drives
– Accessing data from other servers may be difficult – Pro
• Cheap
• Low latency for accessing local data
– Con
• Usually, no shared data access
• Usually, difficult to live-migrate appliances (due to no shared data)
– But: by using clever storage abstractions, common problems can be circumvented
• Use distributed file system or a distributed data store!
– e.g. Apache S3 & SimpleDB, Google GFS & BigTable, Apache HBase &
HFS, etc.
14.2 IaaS
• DAS Pods
14.2 IaaS
• IaaS example: Amazon EC2
– Elastic Compute Cloud is one of the core service of the Amazon Cloud Infrastructure
• Public IaaS Cloud
– Customers may rent virtual servers hosted at Amazons Data Centers
• Can freely install OS and applications as needed
– Virtual servers are offered in different sizes and are paid by CPU usage
• Basic storage is offered within the VM, but usually additional storage services are used by application which cost extra
– e.g. S3, SimpleDB, or Dynamo DB
14.2 Amazon EC2
• Example: t2.micro
– 1.0 GB memory – 1 vCPU units
• 1 virtual core
• 1 vCPU is roughly one 2.5 GHz Xeon core
– No dedicated storage
• Has to use AWS network storage
– Burstable performance: 6 CPU credits per hour
• 1 CPU credit = 1 minute full CPU performance
– Costs $0.013 per hour
• $9,30 per month
– Usually many users start will the small instance, also heavily used for testing
14.2 Amazon EC2
• Example: m3.xlarge – 15 GB memory – 4 vCPU units
• Total of 13 ECU (Elastic Compute Units)
• 1 ECU is roughly equal to 1.5GHz Xeon core
– 80 GB instance storage on SSD
• More storage via AWS
– Costs $0.28 per hour
• $201 per month
14.2 Amazon EC2
• Example: i2.8xlarge – 244 GB of memory – 32 vCPU
• Total of 104 ECU units
– 6400 GB of instance storage on SSD – Costs $6.82 per hour
• $4910 per month
14.2 Amazon EC2
• Rough Estimations (Oct 2009)
– Roughly 40,000 servers
– Uses standard server racks with 16 machines per rack
• Mostly packed with 2U dual-socket Quad-Core Intel Xeons
– Roughly matches the High-Mem Quad XL instance…
– Uses around 8 500GB Raid-0 disks
– Target cost around $2500 per machine in average
– 75% of the machines are US, the remainder in Europe and Asia
– Amazon aims at a utilization rate of 75%
– Very rough guesses state that Amazon may earn
$25,264 per hour with EC2!
• http://cloudscaling.com/blog/cloud-computing/amazons-ec2-generating-220m-annually
14.2 Amazon EC2
• Platform as a Service (PaaS)
– Provides software platforms on demand
• e.g. runtime engines (JavaVM, .Net Runtime, etc.), storage systems (distributed file system, or databases), web services,
communication services, etc.
– PaaS systems are usually used to develop and host web applications or web services
• User applications run on the provided platform
– In contrast to IaaS, no installation and maintenance of
operation system and server applications necessary
• Centrally managed and maintained
• Services or runtimes are directly usable
14.2 PaaS
• Google AppEngine provides users a managed Phyton or Java Runtime
– Web applications can be directly hosted in AppEngine
• Just upload you WAR file and you are done…
– Users are billed by resource usage
• Some free resources provided everyday
– 1 GB in- and out traffic, 6.5 hours CPU, 500 MB storage overall
14.2 Google AppEngine
Resource Unit Unit cost
Outgoing Bandwidth GB $0.12
Incoming Bandwidth GB $0.10
CPU Time CPU hours $0.10
Stored Data GB / month $0.15
• Each application can access system resources up to a fixed maximum
– AppEngine is not fully scalable!
– AppEngine max values (2010)
• CPU: 1730 hours CPU per day; 72 minutes CPU per minute
• Data in or out: 1 TB per day; 10 GB per minute
• Request: 43M web service calls per day, 30K calls per minute
• Data storage: no limit (uses BigTable which can scale in size!!)
14.2 Google AppEngine
• Amazon Simple DB is data storage system roughly similar to Google BigTable
– http://aws.amazon.com/simpledb
– Simple table-centric database engine
• SimpleDB is directly ready to use
– No user configuration or administration – Accessible via web service
• SimpleDB is highly available, uses flexible schemas, and eventual consistency
– Similar to HBase or BigTable
14.2 Amazon SimpleDB
– Any application may use SimpleDB for data storage
• Simple web service provided to interact with Simple DB
• Create or delete a table (called domain)
• Put and delete rows
• Query for rows
– Users pay for storage, data transfer, and computation time
• 25 hours computation time (for querying) are free per month
– Later: $0.154 per machine hour in 2009 – Later: $0.140 per machine hour in 2014
• 1 GB of data transfer is free per month
– Later: $0.15 per GB in 2009 – Later: $0.12 per GB in 2014
• 1 Gb of data storage is free per month
– Later: $0.28 per GB in 2009 – Later: $0.25 per GB in 2014
14.2 Amazon Simple DB
• Software as a Service (SaaS)
– Full applications are offered on-demand
• User just need to consume the software; no installation or maintenance necessary
– All administrative and maintenance tasks are performed by the Cloud provider
• e.g. hosting physical hardware, maintaining platforms,
maintaining software, dealing with security, scalability, etc.
14.2 SaaS
• Salesforce.com On-Demand CRM software
– Customer-Relationship-Management
• Cooperation with Google Apps in early summer
– Provides simple online services for
• Customer database
• Lead management
• Call center
• Customer portal
• Knowledge Bases
• Collaboration environments
• Etc.
14.2 SalesForce
14.2 SalesForce
14.2 SalesForce
• Bills per month and user, based on edition
14.2 SalesForce
• Google Apps
– Provides standard office application on-demand
• i.e. Targeting at the lower-end of
the customer base of Microsoft Office
– MS counters with Office 365
– Google Apps provides
• Email & Groupware
• Spreadsheets
• Documents
• Presentations
• Online Forms
• Drawings
• etc.
14.2 Google Apps
14.2 Google Apps
Grid Computing at CERN
Christian Nieke CERN IT-DSS-DT IfIS Braunschweig
• European Organization for Nuclear Research
– Running the Large Hadron Collider (LHC) – Proton-Proton collider
to create short lived exotic particles
CERN
• European Organization for Nuclear Research
CERN
Data Taken by Experiment Detectors
Reconstruction
Turn RAW data into
physics events
Reconstruction
• Easy?
Tracks
• Not THAT easy actually
RAW Data
• Comparing to the model
– Simulated architecture of the detector
• Very complex, high precision
• Every sensor, wall and bolt, with their density and material properties
• Up to 10 µm precision
– Monte-Carlo Simulation
• Create random particle decays
• Based on probability according to standard model
• Simulate sensor responses
Simulation
Data Acquisition
• Distributed Tier Architecture
Processing in the Grid
Tier-0 (CERN):
•Data recording
•Initial data reconstruction
•Data distribution
Tier-1 (12 centres + Russia):
• Permanent storage
• Re-processing
• Analysis
Tier-2 (~140 centres):
• Simulation
• End-user analysis
• ~ 160 sites, 35 countries
• 300000 cores
• 200 PB of storage
• 2 Million jobs/day
• 10 Gbps links
• Embarrassingly Parallel
– One event = one collision of bunches of protons – The next event is independent
– One event is about 8MB
We have a lot of very tiny packages of data
Easy to distribute to several (virtual) machines
Processing Event Data
• Simple Approach for a simple problem
– Multicore hypervisors run one virtual machine per core
– Scheduler starts a pilot job
• Load image of OS, libraries, configurations(remote storage, security tokens)
• Load shared data sets (e.g. detector geometry, several GB)
• Fetch job requests, load specific data, run job, repeat
Batch Processing
• Under consideration, but:
– Network storage services
• Security
• Specific configurations
• Payment for transfers into / out of the cloud
– Costs
• CERN is still cheaper than Amazon (non-profit)
• But the gap is closing
– Political reasons
• Site provide resources to “buy into” the CERN cooperation
Why not a Cloud Provider?
• Traditional Approach
– Archiving on tape
• Low cost (medium, energy, fault tolerance)
– Online data in distributed disk storage
• EOS distributed file system
– Large namespace (500GB in memory) – Security and authentication
– Interfaces to FUSE, http, WebDAV, …
– Based on xRoot transport protocol for redirection, failover, locality- awareness, …
– Object Stores coming up
Storage
• What do you pay?
– CERN (and other sites) provide computing resources to the experiments
– Payment per CPU second
• But not every CPU second is worth the same!
• CPU seconds are scaled by performance of the computing node
Accounting
≠
• Active Benchmark: HepSpec06
– Intended to represent typical experiment workload – Expensive to perform
• ~8 hours on every machine
– Requires empty hypervisor
• Test once at commissioning
Benchmark is sometimes not up to date with actual configuration
Benchmarking
• Passive Benchmark
– Use actual workload as benchmark – “For free” from existing logs
– Can be repeated at any time
– BUT: Requires minimal amount of observed jobs
• Cold start problem
Benchmarking
• Simple in Theory
– Embarrassingly parallel problem – Mature technologies
• Tapes, disks, virtual machines, …
• But the Devil is in the Details…
– Accounting – Politics
– Security
Summary
• Deductive Databases
• Information Retrieval
• Seminar: Linked Open Data
Next Semester
Distributed Data Management
Thank you for your attention!