Distributed Data Management

(1)

Prof. Dr. Wolf-Tilo Balke

Institut für Informationssysteme

Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

Distributed Data Management

(2)

13.1 Map & Reduce 13.2 Cloud

13.3 Computing as a Service

– SaaS – PaaS – IaaS

13.0 The Cloud

(3)

• Just storing massive amounts of data is not enough!

– Often, we also need to process and transform that data

• Large-Scale Data Processing

– Use thousands of worker nodes within a computation cluster to process large data batches

• Preferably without the hassle of managing things

• Map & Reduce provides

– Automatic parallelization & distribution – Fault tolerance

– I/O scheduling

– Monitoring & status updates

13.1 Map & Reduce

(4)

• Initially, implemented by Google for building the Google search index

– i.e. crawling the Web, building

inverted word index, computing page rank, etc.

• General framework for parallel high volume data processing

– J. Dean, S. Ghemawat. “MapReduce: Simplified Data

Processing on Large Clusters”, Symp. Operating System Design and Implementation, San Francisco, USA, 2004

– Also available as Open Source implementation as part of Apache Hadoop

• http://hadoop.apache.org/mapreduce/

13.1 Map & Reduce

(5)

• Base idea

– There is a large amount of input data, identified by a key

• i.e. input given as key-value pairs

• e.g. all web pages of the internet identified by their URL

– A map operation is a simple function which accepts one key-value pair as input

• A map operation runs as autonomous thread on one single node of a cluster

– Many map jobs can run in parallel with different input keys

• Returns for a single input key-value pair a set of intermediate key-value pairs

– map(key, value) → Set of intermediate (key, value)

• After map job is finished, the node is free to perform another map job for the next input key-value pair

– A central controller distributes map jobs to free nodes

13.1 Map & Reduce

(6)

– After input data is mapped, reduce jobs can start – reduce(key, values) is run for each unique key

emitted by map()

• Each reduce job is also run autonomously on one single node

– Many reduce jobs can run in parallel on different intermediate key groups

• Reduce emits final output of the map-reduce operation

– Each reduce job…

• Takes all map tuples with a given key as input

• Generates usually one, sometimes more output tuples

13.1 Map & Reduce

(7)

• Each reduce is executed on a set of intermediate map results which have the same key

– To efficiently select this set, the intermediate key-value pairs are usually shuffled

• i.e. sorted and grouped by their respective key

– After shuffling, reduce input data can be selected by a simple range scan

13.1 Map & Reduce

(8)

• Example: Counting words in documents

13.1 Map & Reduce

reduce(key, values):

// key: a word;

// values: list of counts result = 0;

for each v in values) result += v;

emit(key, result);

map(key, value):

// key: doc name;

// value: text of doc

for each word w in value:

emit(w, 1);

(9)

• Example: Counting words in documents

13.1 Map & Reduce

doc1: “distributed db and p2p”

distributed 1

db 1

and 1

p2p 1

map 1

and 1

reduce 1

is

a 1

distributed 1

distributed 2

db 2

and 2

p2p 1

map 1

reduce 1

is 1

…

doc2: “map and reduce is a distributed processing technique for db”

map(key,value) reduce(key,values)

(10)

• Improvement: Combiners

– Combiners are mini-reducers that run in-memory after the map phase

– Used to group rare map keys into larger groups

• e.g. word counts: group multiple extremely rare words under one key (and mark that they are grouped…)

– Used to reduce network and worker scheduling overhead

13.1 Map & Reduce

(11)

• Responsibility of the map and reduce master

• Often called scheduler

– Assign Map and Reduce tasks to workers on nodes

• Usually, map tasks are assigned to worker nodes as a batch and not one by one

– Often called a split, i.e. a subset of the whole input data – Splits are often implemented by a simple hash function

with as many buckets as worker nodes

– Full split data is assigned to some worker node, which starts a map task for each input key-value pair

– Check for node failure

– Check for task completion

– Route map results to reduce tasks

13.1 Map & Reduce

(12)

• Map and Reduce overview

13.1 Map & Reduce

(13)

• Master is responsible for worker node fault tolerance

– Handled via re-execution

• Detect failure via periodic heartbeats

• Re-execute completed + in-progress map tasks

• Re-execute in progress reduce tasks

• Task completion committed through master

– Robust: lost 1600/1800 machines once  finished ok

• Master failures are not handled

– Unlikely due to redundant hardware…

13.1 Map & Reduce

(14)

• Showcase: machine usage during web indexing

– Fine granularity tasks: map tasks >> machines

• Minimizes time for fault recovery

• Can pipeline shuffling with map execution

• Better dynamic load balancing

– Showcase uses 200,000 map & 5,000 reduce tasks – Running on 2,000 machines

13.1 Map & Reduce

(15)

• PageRank is one of the major algorithm behind Google Search

– See our wonderful IRWS lecture (No 12)!!

– Key Question: How important is a given website?

• Importance independent of query

– Idea: other pages “vote” for a site by linking to it

• also called “giving credit to”

• Pages with many votes are probably important

– If an important site “votes” for another site, that vote has a higher weight as when an unimportant site votes

13.1 MR - PageRank

t₁

x

(16)

• Given page 𝑥 with in-bound links 𝑡

₁

, … , 𝑡

_𝑛

, where

– 𝐶(𝑡) is the out-degree of 𝑡

– 𝛼 is probability of random jump – 𝑁 is the total number of

nodes in the graph

– 𝑃𝑅 𝑥 = 𝛼

¹

𝑁

+ (1 − 𝛼)

_𝑖=1^𝑛

(

^{𝑃𝑅 𝑡}^𝑖

𝐶 𝑡_𝑖

)

13.1 MR - PageRank

(17)

• Properties of PageRank

– Can be computed iteratively – Effects at each iteration is local

• Sketch of algorithm:

– Start with seed PR

_i

values

– Each page distributes PR

_i

“credit” to all pages it links to

– Each target page adds up “credit” from multiple in- bound links to compute PR

_i+1

– Iterate until values converge

13.1 MR - PageRank

(18)

13.1 MR - PageRank

Map Step: Distribute Page Rank “Credits” to link targets

Reduce Step: gather up PageRank “credit” from multiple sources to compute new PageRank value

(19)

• Dryad (Microsoft)

– Relational Algebra

• Pig (Yahoo)

– Near Relational Algebra over MapReduce

• HIVE (Facebook)

– SQL over MapReduce

• Cascading

– University of Wisconsin

• Hbase

– Indexing on HDFS

13.1 MapReduce Contemporaries

(20)

• An engine for executing programs on top of Hadoop.

• It provides a language, Pig Latin, to specify these programs.

• An Apache open source project http://pig.apache.org

13.1 Pig

(21)

• Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited sites by users aged 18-25

13.1 Pig: Motivation

Load Users Load Pages

Filter by age

Join on name

Group on url Count clicks

Order by clicks

(22)

13.1 In MapReduce

170 lines of code, 4 hours to write

(23)

Users = load ‘users’ as (name, age);

Fltrd = filter Users by

age >=18 and age <=25;

Pages = load ‘pages’ as (user, url);

Jnd = join Fltrd by name , Pages by users;

Grpd = group Jnd by url;

Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;

Srtd = order Smmd by clicks desc;

Top5 = limit Srtd 5;

store Top5 into ‘top5sites’;

9 lines of code, 15 minutes to write

13.1 In Pig Latin

(24)

13.1 Pig System Overview

Pig Latin program

A = LOAD‘file’ AS (sid, pid, mass, px:double) ; B = LOAD‘file2’ AS (sid, pid, mass,px:double);

C = FILTER A BY px < 1.0;

D = JOIN C BY sid, B BY sid;

STORE g INTO ‘output.txt’;

Pig parser

Pig compiler

Parsed program

Execution plan JOIN

FILTER

LOAD LOAD

DISK A DISK B

(25)

13.1 Comparing Performance

How fast is Pig compared to a pure Map-Reduce implementation?

(26)

• Atom:

– Integer, string ,etc.

• Tuple:

– Sequence of fields

– Each field of any type

• Bag:

– Collection of tuples, not necessarily of the same type – Duplicates are allowed

• Map:

– String literal keys mapped to any type

13.1 Data Model

(27)

13.1 Pig Latin Statement

A = LOAD‘student’ AS (name:chararray, age:int, , gpa:float) ; X = FOREACHA GENERATE name, $2

First Field Second Field Third Field

Data type Chararray Int Float

Positional notation (generated by system) $0 $1 $2 Possible name (assigned by user using a

schema)

name age gpa

(28)

• Map-Reduce: Iterative Jobs

– Iterative jobs involve a lot of disk I/O for each repetition

13.1 Apache Spark Motivation

(29)

13.1 Apache Spark Motivation

Using Map Reduce for complex jobs, interactive queries and online processing involves lots of disk I/O

Idea: keep more data in memory!

(30)

13.1 Use Memory instead of Disk

(31)

13.1 In-Memory Data Sharing

(32)

• Most real applications require multiple MR steps:

– Google indexing pipeline: 21 steps

– Analytics queries (e.g. count clicks & topk) :2-5 steps – Iterative algorithms (e.g. PageRank): 10’s of steps

• Multi step jobs create spaghetti code

– 21 MR steps -> 21 mapper and reducer classes

13.1 Programmability

(33)

13.1 Programmability

(34)

13.1 Performance

[Source: Daytona GraySort benchmark, sortbenchmark.org]

(35)

• Open source processing engine.

• Originally developed at UC Berkeley in 2009.

• More than 100 operators for transforming data.

• World record for large-scale on disk sorting.

• Built in support for many data sources (HDFS, RDBMS,S3, Cassandra)

13.1 Apache Spark

[Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark : Cluster Computing with Working Sets. In HotCloud’10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing(p. 10). doi:10.1007/s00256-009- 0861-0]

[Zaharia, M., Chowdhury, M., Das, T., & Dave, A. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI’12 Proceedings of the 9th USENIX Conference on Networked Systems Design

(36)

13.1 Spark Tools

(37)

• Write programs in terms of distributed datasets and operations on them

– Resilient Distributed Datasets (RDDs)

• Collections of objects spread across a cluster, stored in RAM or on Disk.

• Built through parallel transformations

• Automatically rebuilt on failure – Operations

• Transformations (e.g. map, filter, groupBy).

• Actions ( e.g. count, collect, save)

13.1 Resilient Distributed Datasets

(38)

13.1 Working with RDDs

(39)

13.1 Spark vs. Map Reduce

Hadoop

Map Reduce

Spark

Storage Disk only In-memory or on disk

Operations Map and Reduce Map, Reduce, Join, Sample, etc…

Execution model Batch Batch, interactive, streaming Programming

environments

Java Scala, Java, R, and Python

(40)

• The term “cloud computing” is often seen as a successor of client-server architectures

– Often used as synonym for centralized, on-demand, pay-what-you-use provisioning of

general computation resources

• Comparable to utility providers like electric power grids or water supply

• “Computing as a commodity”

– “Cloud” is used as a metaphor for the Internet

• Users or applications “just use” computation resources provided in the Internet instead of using local hardware or software

13.2 The Cloud

(41)

• “Computation resources” can mean a lot of things:

• Dynamic access to “raw metal”

– Raw storage space or CPU time

– Fully operational server are provided by the cloud

• Low-level services and platforms

– e.g. runtime platforms like Jave JRE

» User can run application directly on a cloud platform

» No own servers or platform software is needed – e.g. abstracted storage space like space

within a database or a file system

» This is what we did in the last weeks!

13.2 The Cloud

(42)

• Software services

– i.e. some functionalities required by user software is provided “by the cloud”

» Used via web service remote procedure calls

» e.g. delegate the rendering of a map in some user application to Google Maps

• Full software functionality

– e.g. rented Web applications replacing traditional server or desktop applications

» e.g. rent CRM software online from SalesForce, use Google apps instead of MS Office, etc.

13.2 The Cloud

(43)

• Underlying base problem…

– Successfully running IT departments and IT infrastructures can be very difficult and expensive for companies

– High fixed costs

• Acquiring and paying competent IT staff

– “Competent” is often very hard to get…

• Buying and maintaining servers

• Correctly hosting hardware

– Proper power and cooling facilities, network connections, server racks, etc.

• Buying and maintaining software

13.2 The Cloud

(44)

– Load and Utilization Issues

• How much hardware resources are

required by each application and/or service?

• How to handle scaling issues?

– What happens if demand increases or declines?

– How to handle spike loads?

– “Digg Effect”

• Traditional data centers are

notoriously underutilized, often idle 85% of the time

– Over provisioning for future growth or spikes – Insufficient capacity planning and sizing

– Improper understanding of scalability requirements etc.

13.2 The Cloud

(45)

• Cloud computing centrally unifies computation resources and

provides them on-demand

– Degree of centralization and provision may differ

• Centralize hardware within a department? A company? A number of companies? Globally?

• Provide resources only oneself? To some partners?

To anybody?

• How to compensate providers for resource usage?

– Provide resources with a rental model (e.g. monthly fee)?

– Provide resources metered on what-is-used basis (e.g. similar to electricity or water?)

– Provide resources for free?

13.2 The Cloud

(46)

• Usually, three types of clouds are distinguished

– Public Clouds – Private Clouds – Hybrid Clouds

13.2 The Cloud

(47)

– Public Clouds

• “Traditional” cloud computing

• Services and resources are offered via the Internet to anybody willing to pay for them

– User just pays for services, usually no acquisition, administration or maintenance of hardware / software necessary

• Services usually provided by off-site 3^rd-party providers

– Open for use by general public

• Exist beyond firewall, fully hosted and managed by the vendor

• Customers are individuals, corporations and others

• e.g. Amazon's Web Services and Google AppEngine

• Offers start-ups and SMB’s quick setup, scalability, flexibility, and automated management. Pay as you go model helps start-ups to start small and go big

– Security and compliance?

– Reliability and privacy concerns hinder the adoption of cloud

• Amazon S3 services were down for 6 hours in 2010

• What will Amazon do with all the data?

13.2 The Cloud

(48)

– Private Clouds

• Cloud computing hardware are within the premises of a company behind the cooperate firewall

• Resources are only provided internally for various departments

• Private clouds are still fully bought, build, and maintained by the company using it

– But usually not exclusive to single departments – Still, costs could be prohibitive and may by far

exceed that of public clouds

• Fine grained control over resources

• More secure as they are internal to organization

• Schedule and reshuffle resources based on business demands

• Ideal for apps requiring tight security and regulatory concerns

• Development requires hardware investments and in-house expertise

13.2 The Cloud

(49)

– Hybrid Clouds

• Both, private and public cloud services, or even non-cloud services are used or offered simultaneously

• “State-of-art” for most companies relying on cloud technology

13.2 The Cloud

(50)

• Properties promised by cloud computing

– Agility

• Resources are quickly available when needed

– i.e. servers must not be ordered and build, software doesn’t need to be configured and installed, etc.

– Costs

• Capital expenditure is converted to operational expenditure

– Independence

• Services are available everywhere and for any device

13.2 The Cloud

(51)

– Multi-tenancy

• Resources are shared by larger pool of users

• Resources can be centralized which reduces the costs

• Load distribution of users differs

– Peak loads can usually be distributed

– Overall utilization and efficiency of resources is better

– Reliability

• Most cloud services promise durable and reliable resources due to distribution and replication

– Scalability

• If a user needs more resources or performance, it can easily provisioned

13.2 The Cloud

(52)

– Low maintenance

• Cloud services or applications are not installed on user’s machines, but maintained centrally by specialized staff

– Transparency and metering

• Costs for computation resources are directly visible and transparent

• “Pay-what-you-use” models

• Cloud computing generally promises to be beneficial for fast growing start-ups, SMBs and enterprises alike

– Cost effective solutions to key business demands – Improved overall efficiency

13.2 The Cloud

(53)

• The cloud encourages a self-service model

– Users can simply request the resources they need

13.2 The Cloud

(54)

• Anything-as-a-Service

– XaaS=“X as a service”

– In general, cloud providers offer any computation resources “as a service”

– In the long run, all computation needs of a company should be modeled, provided and used “as a service”

• e.g. in Amazon’s private and public cloud infrastructures:

everything is a service!

13.3 XaaS

(55)

– Services provide a strictly defined functionality with certain guarantees

• Service description and service-level agreements (SLAs)

• Services description explains what is offered by the service

• SLAs further clarify the provisioning guarantees

– Often: performance, latency, reliability, availability, etc.

13.3 XaaS

(56)

• Usually, three main resources may be offered

“as a service”

– Software as a Service

• SaaS

– Platform as a Service

• PaaS

– Infrastructure as a Service

• IaaS

13.3 XaaS

Server

Infrastructure Platform

Application

Client

(57)

• Application Services (services on demand)

– Gmail, GoogleCalender – Payroll, HR, CRM, etc

– Sugar CRM, IBM Lotus Live

• Platform Services (resources on demand)

– Middleware, Integration, Messaging, Information, connectivity etc

– Amazon AWS, Boomi, CastIron, Google AppEngine

• Infrastructure as services (physical assets as services)

– IBM Blue House, VMWare Cloud Edition, Amazon EC2, Microsoft Azure Platform, …

13.3 XaaS

(58)

13.3 XaaS

…?

CLOUD

Individuals Corporations Non-Commercial

Cloud Middle Ware

Storage

Provisioning OS

Provisioning

Network Provisioning

Service(apps)

Provisioning SLA(monitor), Security, Billing,

Payment

Services Storage Network OS Resources

(59)

• Infrastructure as a Service (IaaS)

– Provides raw computation infrastructure, i.e. usually a virtual server

• e.g. see hardware virtualization (VMWare & co.)

• Successor to dedicated server rental

– For the user, a virtual server is similar to a real server

• Has CPU cores, main memory, hard disc space, etc.

• Usually provided as “self-service” raw machine

• User is responsible for installing and maintaining applications like e.g., operating system, databases, or server software

• User does not need to buy, host, or maintain the actual hardware

13.3 IaaS

(60)

• The IaaS provider can host multiple virtual servers on a single, real machine

– Often, 10-30 virtual severs per real server – Virtualization is used to abstract

server hardware for virtual servers

• Virtual system also often called virtual machines (neutral term) or appliances (usually suggesting preinstalled OS and software)

– Virtualization of hardware is usually handled by a so- called hypervisor

• e.g., Xen, KVM, VMWare, HyperV, …

13.3 IaaS

(61)

• In short, IaaS is a virtualization on multiple hardware machines

– Normal Server

• 1 machine with one OS

– Traditional virtualization

• 1 machine hosting multiple virtual servers

– Distributed Application

• 1 appliance running on multiple machines

– IaaS

• Multiple machines running multiple virtual servers

• Dynamic load balancing between machines

13.3 IaaS

“Normal”

server

“Traditional”

virtualization IaaS

1 many

1many#appliances

#machines

Distributed Appliance

(62)

• Hypervisor is responsible for allocating available resources to VMs

– Dispatch VMs to machines – Relocate VM to balance load – Distribute resources

• Network adaptors, logical discs, RAM, CPU cores, etc…

13.3 IaaS

(63)

• Usually, virtual machines offered by IaaS

infrastructures cannot grow arbitrarily big

– Capped by the actual server size or the size of a smaller server group

• Really big applications are usually deployed in so-called Pods

– Similar to database shards

– Group of machines running one or multiple appliances – Machines within a Pod are very tightly networked

13.3 IaaS

(64)

– i.e. each Pod is a full copy of given virtual machines with full OS and application installed

• Usually, there are multiple copies of a given Pod (and its VMs)

• Each Pod is responsible for a disjoint part of the whole workload

– Pods are usually scattered across availability zones (e.g. data centers or a certain rack)

• Physically separated, usually with own power / network, etc.

13.3 IaaS

(65)

• IaaS Pods

13.3 IaaS

(66)

– Simplified Pod example: GoogleMail

• Multiple Pods, each Pod running on multiple machines with a full and independent installation of Gmail software

• Load balancer decides during user log-in which Pod will handle the user session

– Users are distributed across Pods

• Pods are flexible by using shared GFS file system

13.3 IaaS

(67)

• Mission critical applications should be designed such that they run in multiple availability

zones on multiple Pods

– Cloud control system (CCS) responsible for distribution and replication

13.3 IaaS

(68)

• Pod Architectures

– Each pod consists of multiple machines with mainboards, CPUs, and main memory

– Question: where to put secondary storage?

– Usually, three options

• Storage area network (SAN)

• Direct attached storage (DAS)

• Network attached storage (NAS)

– or…. Storage Service! (e.g. GFS & co.)

13.3 IaaS

(69)

• SAN Pods

– Individual servers don’t have own secondary storage – Storage area network provides shared hard disks

storage for all machines of a Pod – Pro

• All machines have access to the same data

• Allows for dynamic load balancing or migration of appliances

– e.g. VMware vMotion

– Con

• Very very expensive

• Higher latency than direct attached storage

13.3 IaaS

(70)

• SAN Pods

13.3 IaaS

(71)

• DAS Pods

– Each server has its own set of hard drives

– Accessing data from other servers may be difficult – Pro

• Cheap

• Low latency for accessing local data

– Con

• Usually, no shared data access

• Usually, difficult to live-migrate appliances (due to no shared data)

– But: by using clever storage abstractions, common problems can be circumvented

• Use distributed file system or a distributed data store!

– e.g. Amazon S3 & SimpleDB, Google GFS & BigTable, Apache HBase &

HFS, etc.

13.3 IaaS

(72)

• DAS Pods

13.3 IaaS

(73)

• IaaS example: Amazon EC2

– Elastic Compute Cloud is one of the core service of the Amazon Cloud Infrastructure

• Public IaaS Cloud

– Customers may rent virtual servers hosted at Amazons Data Centers

• Can freely install OS and applications as needed

– Virtual servers are offered in different sizes and are paid by CPU usage

• Basic storage is offered within the VM, but usually additional storage services are used by application which cost extra

– e.g. S3, SimpleDB, or Dynamo DB

13.3 Amazon EC2

(74)

• Example: t2.micro

– 1.0 GB memory – 1 vCPU units

• 1 virtual core

• 1 vCPU is roughly one 2.5 GHz Xeon core

– No dedicated storage

• Has to use AWS network storage

– Burstable performance: 6 CPU credits per hour

• 1 CPU credit = 1 minute full CPU performance

– Costs $0.013 per hour

• $9,30 per month

– Usually many users start will the small instance, also heavily used for testing

13.3 Amazon EC2

(75)

• Example: m3.xlarge – 15 GB memory – 4 vCPU units

• Total of 13 ECU (Elastic Compute Units)

• 1 ECU is roughly equal to 1.5GHz Xeon core

– 80 GB instance storage on SSD

• More storage via AWS

– Costs $0.28 per hour

• $201 per month

13.3 Amazon EC2

(76)

• Example: i2.8xlarge – 244 GB of memory – 32 vCPU

• Total of 104 ECU units

– 6400 GB of instance storage on SSD – Costs $6.82 per hour

• $4910 per month

13.3 Amazon EC2

(77)

• Rough Estimations (Oct 2009)

– Roughly 40,000 servers

– Uses standard server racks with 16 machines per rack

• Mostly packed with 2U dual-socket Quad-Core Intel Xeons

– Roughly matches the High-Mem Quad XL instance…

– Uses around 8 500GB Raid-0 disks

– Target cost around $2500 per machine in average

– 75% of the machines are US, the remainder in Europe and Asia

– Amazon aims at a utilization rate of 75%

– Very rough guesses state that Amazon may earn

$25,264 per hour with EC2!

• http://cloudscaling.com/blog/cloud-computing/amazons-ec2-generating-220m-annually

13.3 Amazon EC2

(78)

• Platform as a Service (PaaS)

– Provides software platforms on demand

• e.g. runtime engines (JavaVM, .Net Runtime, etc.), storage systems (distributed file system, or databases), web services,

communication services, etc.

– PaaS systems are usually used to develop and host web applications or web services

• User applications run on the provided platform

– In contrast to IaaS, no installation and maintenance of

operation system and server applications necessary

• Centrally managed and maintained

• Services or runtimes are directly usable

13.3 PaaS

(79)

• Google AppEngine provides users a managed Phyton or Java Runtime

– Web applications can be directly hosted in AppEngine

• Just upload you WAR file and you are done…

– Users are billed by resource usage

• Some free resources provided everyday

– 1 GB in- and out traffic, 6.5 hours CPU, 500 MB storage overall

13.3 Google AppEngine

Resource Unit Unit cost

Outgoing Bandwidth GB $0.12

Incoming Bandwidth GB $0.10

CPU Time CPU hours $0.10

Stored Data GB / month $0.15

(80)

• Each application can access system resources up to a fixed maximum

– AppEngine is not fully scalable!

– AppEngine max values (2010)

• CPU: 1730 hours CPU per day; 72 minutes CPU per minute

• Data in or out: 1 TB per day; 10 GB per minute

• Request: 43M web service calls per day, 30K calls per minute

• Data storage: no limit (uses BigTable which can scale in size!!)

13.3 Google AppEngine

(81)

• Amazon Simple DB is data storage system roughly similar to Google BigTable

– http://aws.amazon.com/simpledb

– Simple table-centric database engine

• SimpleDB is directly ready to use

– No user configuration or administration – Accessible via web service

• SimpleDB is highly available, uses flexible schemas, and eventual consistency

– Similar to HBase or BigTable

13.3 Amazon SimpleDB

(82)

– Any application may use SimpleDB for data storage

• Simple web service provided to interact with Simple DB

• Create or delete a table (called domain)

• Put and delete rows

• Query for rows

– Users pay for storage, data transfer, and computation time

• 25 hours computation time (for querying) are free per month

– Later: $0.154 per machine hour in 2009 – Later: $0.140 per machine hour in 2014

• 1 GB of data transfer is free per month

– Later: $0.15 per GB in 2009 – Later: $0.12 per GB in 2014

• 1 Gb of data storage is free per month

– Later: $0.28 per GB in 2009 – Later: $0.25 per GB in 2014

13.3 Amazon SimpleDB

(83)

• Software as a Service (SaaS)

– Full applications are offered on-demand

• User just need to consume the software; no installation or maintenance necessary

– All administrative and maintenance tasks are performed by the Cloud provider

• e.g. hosting physical hardware, maintaining platforms,

maintaining software, dealing with security, scalability, etc.

13.3 SaaS

(84)

• Salesforce.com On-Demand CRM software

– Customer-Relationship-Management

• Cooperation with Google Apps in early summer

– Provides simple online services for

• Customer database

• Lead management

• Call center

• Customer portal

• Knowledge Bases

• Email

• Collaboration environments

• Etc.

13.3 SalesForce

(85)

13.3 SalesForce

(86)

13.3 SalesForce

(87)

• Bills per month and user, based on edition

13.3 SalesForce

(88)