Prof. Dr. Wolf-Tilo Balke
Institut für Informationssysteme
Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de
Distributed Data Management
13.1 Map & Reduce 13.2 Cloud
13.3 Computing as a Service
– SaaS – PaaS – IaaS
13.0 The Cloud
• Just storing massive amounts of data is not enough!
– Often, we also need to process and transform that data
• Large-Scale Data Processing
– Use thousands of worker nodes within a computation cluster to process large data batches
• Preferably without the hassle of managing things
• Map & Reduce provides
– Automatic parallelization & distribution – Fault tolerance
– I/O scheduling
– Monitoring & status updates
13.1 Map & Reduce
• Initially, implemented by Google for building the Google search index
– i.e. crawling the Web, building
inverted word index, computing page rank, etc.
• General framework for parallel high volume data processing
– J. Dean, S. Ghemawat. “MapReduce: Simplified Data
Processing on Large Clusters”, Symp. Operating System Design and Implementation, San Francisco, USA, 2004
– Also available as Open Source implementation as part of Apache Hadoop
• http://hadoop.apache.org/mapreduce/
13.1 Map & Reduce
• Base idea
– There is a large amount of input data, identified by a key
• i.e. input given as key-value pairs
• e.g. all web pages of the internet identified by their URL
– A map operation is a simple function which accepts one key-value pair as input
• A map operation runs as autonomous thread on one single node of a cluster
– Many map jobs can run in parallel with different input keys
• Returns for a single input key-value pair a set of intermediate key-value pairs
– map(key, value) → Set of intermediate (key, value)
• After map job is finished, the node is free to perform another map job for the next input key-value pair
– A central controller distributes map jobs to free nodes
13.1 Map & Reduce
– After input data is mapped, reduce jobs can start – reduce(key, values) is run for each unique key
emitted by map()
• Each reduce job is also run autonomously on one single node
– Many reduce jobs can run in parallel on different intermediate key groups
• Reduce emits final output of the map-reduce operation
– Each reduce job…
• Takes all map tuples with a given key as input
• Generates usually one, sometimes more output tuples
13.1 Map & Reduce
• Each reduce is executed on a set of intermediate map results which have the same key
– To efficiently select this set, the intermediate key-value pairs are usually shuffled
• i.e. sorted and grouped by their respective key
– After shuffling, reduce input data can be selected by a simple range scan
13.1 Map & Reduce
• Example: Counting words in documents
13.1 Map & Reduce
reduce(key, values):
// key: a word;
// values: list of counts result = 0;
for each v in values) result += v;
emit(key, result);
map(key, value):
// key: doc name;
// value: text of doc
for each word w in value:
emit(w, 1);
• Example: Counting words in documents
13.1 Map & Reduce
doc1: “distributed db and p2p”
distributed 1
db 1
and 1
p2p 1
map 1
and 1
reduce 1
is
a 1
distributed 1
distributed 2
db 2
and 2
p2p 1
map 1
reduce 1
is 1
…
doc2: “map and reduce is a distributed processing technique for db”
map(key,value) reduce(key,values)
• Improvement: Combiners
– Combiners are mini-reducers that run in-memory after the map phase
– Used to group rare map keys into larger groups
• e.g. word counts: group multiple extremely rare words under one key (and mark that they are grouped…)
– Used to reduce network and worker scheduling overhead
13.1 Map & Reduce
• Responsibility of the map and reduce master
• Often called scheduler
– Assign Map and Reduce tasks to workers on nodes
• Usually, map tasks are assigned to worker nodes as a batch and not one by one
– Often called a split, i.e. a subset of the whole input data – Splits are often implemented by a simple hash function
with as many buckets as worker nodes
– Full split data is assigned to some worker node, which starts a map task for each input key-value pair
– Check for node failure
– Check for task completion
– Route map results to reduce tasks
13.1 Map & Reduce
• Map and Reduce overview
13.1 Map & Reduce
• Master is responsible for worker node fault tolerance
– Handled via re-execution
• Detect failure via periodic heartbeats
• Re-execute completed + in-progress map tasks
• Re-execute in progress reduce tasks
• Task completion committed through master
– Robust: lost 1600/1800 machines once finished ok
• Master failures are not handled
– Unlikely due to redundant hardware…
13.1 Map & Reduce
• Showcase: machine usage during web indexing
– Fine granularity tasks: map tasks >> machines
• Minimizes time for fault recovery
• Can pipeline shuffling with map execution
• Better dynamic load balancing
– Showcase uses 200,000 map & 5,000 reduce tasks – Running on 2,000 machines
13.1 Map & Reduce
• PageRank is one of the major algorithm behind Google Search
– See our wonderful IRWS lecture (No 12)!!
– Key Question: How important is a given website?
• Importance independent of query
– Idea: other pages “vote” for a site by linking to it
• also called “giving credit to”
• Pages with many votes are probably important
– If an important site “votes” for another site, that vote has a higher weight as when an unimportant site votes
13.1 MR - PageRank
t1
x
• Given page 𝑥 with in-bound links 𝑡
1, … , 𝑡
𝑛, where
– 𝐶(𝑡) is the out-degree of 𝑡
– 𝛼 is probability of random jump – 𝑁 is the total number of
nodes in the graph
– 𝑃𝑅 𝑥 = 𝛼
1𝑁
+ (1 − 𝛼)
𝑖=1𝑛(
𝑃𝑅 𝑡𝑖𝐶 𝑡𝑖
)
13.1 MR - PageRank
• Properties of PageRank
– Can be computed iteratively – Effects at each iteration is local
• Sketch of algorithm:
– Start with seed PR
ivalues
– Each page distributes PR
i“credit” to all pages it links to
– Each target page adds up “credit” from multiple in- bound links to compute PR
i+1– Iterate until values converge
13.1 MR - PageRank
13.1 MR - PageRank
Map Step: Distribute Page Rank “Credits” to link targets
Reduce Step: gather up PageRank “credit” from multiple sources to compute new PageRank value
• Dryad (Microsoft)
– Relational Algebra
• Pig (Yahoo)
– Near Relational Algebra over MapReduce
• HIVE (Facebook)
– SQL over MapReduce
• Cascading
– University of Wisconsin
• Hbase
– Indexing on HDFS
13.1 MapReduce Contemporaries
• An engine for executing programs on top of Hadoop.
• It provides a language, Pig Latin, to specify these programs.
• An Apache open source project http://pig.apache.org
13.1 Pig
• Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited sites by users aged 18-25
13.1 Pig: Motivation
Load Users Load Pages
Filter by age
Join on name
Group on url Count clicks
Order by clicks
13.1 In MapReduce
170 lines of code, 4 hours to write
Users = load ‘users’ as (name, age);
Fltrd = filter Users by
age >=18 and age <=25;
Pages = load ‘pages’ as (user, url);
Jnd = join Fltrd by name , Pages by users;
Grpd = group Jnd by url;
Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc;
Top5 = limit Srtd 5;
store Top5 into ‘top5sites’;
9 lines of code, 15 minutes to write
13.1 In Pig Latin
13.1 Pig System Overview
Pig Latin program
A = LOAD‘file’ AS (sid, pid, mass, px:double) ; B = LOAD‘file2’ AS (sid, pid, mass,px:double);
C = FILTER A BY px < 1.0;
D = JOIN C BY sid, B BY sid;
STORE g INTO ‘output.txt’;
Pig parser
Pig compiler
Parsed program
Execution plan JOIN
FILTER
LOAD LOAD
DISK A DISK B
13.1 Comparing Performance
How fast is Pig compared to a pure Map-Reduce implementation?
• Atom:
– Integer, string ,etc.
• Tuple:
– Sequence of fields
– Each field of any type
• Bag:
– Collection of tuples, not necessarily of the same type – Duplicates are allowed
• Map:
– String literal keys mapped to any type
13.1 Data Model
13.1 Pig Latin Statement
A = LOAD‘student’ AS (name:chararray, age:int, , gpa:float) ; X = FOREACHA GENERATE name, $2
First Field Second Field Third Field
Data type Chararray Int Float
Positional notation (generated by system) $0 $1 $2 Possible name (assigned by user using a
schema)
name age gpa
• Map-Reduce: Iterative Jobs
– Iterative jobs involve a lot of disk I/O for each repetition
13.1 Apache Spark Motivation
13.1 Apache Spark Motivation
Using Map Reduce for complex jobs, interactive queries and online processing involves lots of disk I/O
Idea: keep more data in memory!
13.1 Use Memory instead of Disk
13.1 In-Memory Data Sharing
• Most real applications require multiple MR steps:
– Google indexing pipeline: 21 steps
– Analytics queries (e.g. count clicks & topk) :2-5 steps – Iterative algorithms (e.g. PageRank): 10’s of steps
• Multi step jobs create spaghetti code
– 21 MR steps -> 21 mapper and reducer classes
13.1 Programmability
13.1 Programmability
13.1 Performance
[Source: Daytona GraySort benchmark, sortbenchmark.org]
• Open source processing engine.
• Originally developed at UC Berkeley in 2009.
• More than 100 operators for transforming data.
• World record for large-scale on disk sorting.
• Built in support for many data sources (HDFS, RDBMS,S3, Cassandra)
13.1 Apache Spark
[Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark : Cluster Computing with Working Sets. In HotCloud’10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing(p. 10). doi:10.1007/s00256-009- 0861-0]
[Zaharia, M., Chowdhury, M., Das, T., & Dave, A. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI’12 Proceedings of the 9th USENIX Conference on Networked Systems Design
13.1 Spark Tools
• Write programs in terms of distributed datasets and operations on them
– Resilient Distributed Datasets (RDDs)
• Collections of objects spread across a cluster, stored in RAM or on Disk.
• Built through parallel transformations
• Automatically rebuilt on failure – Operations
• Transformations (e.g. map, filter, groupBy).
• Actions ( e.g. count, collect, save)
13.1 Resilient Distributed Datasets
13.1 Working with RDDs
13.1 Spark vs. Map Reduce
Hadoop
Map Reduce
Spark
Storage Disk only In-memory or on disk
Operations Map and Reduce Map, Reduce, Join, Sample, etc…
Execution model Batch Batch, interactive, streaming Programming
environments
Java Scala, Java, R, and Python
• The term “cloud computing” is often seen as a successor of client-server architectures
– Often used as synonym for centralized, on-demand, pay-what-you-use provisioning of
general computation resources
• Comparable to utility providers like electric power grids or water supply
• “Computing as a commodity”
– “Cloud” is used as a metaphor for the Internet
• Users or applications “just use” computation resources provided in the Internet instead of using local hardware or software
13.2 The Cloud
• “Computation resources” can mean a lot of things:
• Dynamic access to “raw metal”
– Raw storage space or CPU time
– Fully operational server are provided by the cloud
• Low-level services and platforms
– e.g. runtime platforms like Jave JRE
» User can run application directly on a cloud platform
» No own servers or platform software is needed – e.g. abstracted storage space like space
within a database or a file system
» This is what we did in the last weeks!
13.2 The Cloud
• Software services
– i.e. some functionalities required by user software is provided “by the cloud”
» Used via web service remote procedure calls
» e.g. delegate the rendering of a map in some user application to Google Maps
• Full software functionality
– e.g. rented Web applications replacing traditional server or desktop applications
» e.g. rent CRM software online from SalesForce, use Google apps instead of MS Office, etc.
13.2 The Cloud
• Underlying base problem…
– Successfully running IT departments and IT infrastructures can be very difficult and expensive for companies
– High fixed costs
• Acquiring and paying competent IT staff
– “Competent” is often very hard to get…
• Buying and maintaining servers
• Correctly hosting hardware
– Proper power and cooling facilities, network connections, server racks, etc.
• Buying and maintaining software
13.2 The Cloud
– Load and Utilization Issues
• How much hardware resources are
required by each application and/or service?
• How to handle scaling issues?
– What happens if demand increases or declines?
– How to handle spike loads?
– “Digg Effect”
• Traditional data centers are
notoriously underutilized, often idle 85% of the time
– Over provisioning for future growth or spikes – Insufficient capacity planning and sizing
– Improper understanding of scalability requirements etc.
13.2 The Cloud
• Cloud computing centrally unifies computation resources and
provides them on-demand
– Degree of centralization and provision may differ
• Centralize hardware within a department? A company? A number of companies? Globally?
• Provide resources only oneself? To some partners?
To anybody?
• How to compensate providers for resource usage?
– Provide resources with a rental model (e.g. monthly fee)?
– Provide resources metered on what-is-used basis (e.g. similar to electricity or water?)
– Provide resources for free?
13.2 The Cloud
• Usually, three types of clouds are distinguished
– Public Clouds – Private Clouds – Hybrid Clouds
13.2 The Cloud
– Public Clouds
• “Traditional” cloud computing
• Services and resources are offered via the Internet to anybody willing to pay for them
– User just pays for services, usually no acquisition, administration or maintenance of hardware / software necessary
• Services usually provided by off-site 3rd-party providers
– Open for use by general public
• Exist beyond firewall, fully hosted and managed by the vendor
• Customers are individuals, corporations and others
• e.g. Amazon's Web Services and Google AppEngine
• Offers start-ups and SMB’s quick setup, scalability, flexibility, and automated management. Pay as you go model helps start-ups to start small and go big
– Security and compliance?
– Reliability and privacy concerns hinder the adoption of cloud
• Amazon S3 services were down for 6 hours in 2010
• What will Amazon do with all the data?
13.2 The Cloud
– Private Clouds
• Cloud computing hardware are within the premises of a company behind the cooperate firewall
• Resources are only provided internally for various departments
• Private clouds are still fully bought, build, and maintained by the company using it
– But usually not exclusive to single departments – Still, costs could be prohibitive and may by far
exceed that of public clouds
• Fine grained control over resources
• More secure as they are internal to organization
• Schedule and reshuffle resources based on business demands
• Ideal for apps requiring tight security and regulatory concerns
• Development requires hardware investments and in-house expertise
13.2 The Cloud
– Hybrid Clouds
• Both, private and public cloud services, or even non-cloud services are used or offered simultaneously
• “State-of-art” for most companies relying on cloud technology
13.2 The Cloud
• Properties promised by cloud computing
– Agility
• Resources are quickly available when needed
– i.e. servers must not be ordered and build, software doesn’t need to be configured and installed, etc.
– Costs
• Capital expenditure is converted to operational expenditure
– Independence
• Services are available everywhere and for any device
13.2 The Cloud
– Multi-tenancy
• Resources are shared by larger pool of users
• Resources can be centralized which reduces the costs
• Load distribution of users differs
– Peak loads can usually be distributed
– Overall utilization and efficiency of resources is better
– Reliability
• Most cloud services promise durable and reliable resources due to distribution and replication
– Scalability
• If a user needs more resources or performance, it can easily provisioned
13.2 The Cloud
– Low maintenance
• Cloud services or applications are not installed on user’s machines, but maintained centrally by specialized staff
– Transparency and metering
• Costs for computation resources are directly visible and transparent
• “Pay-what-you-use” models
• Cloud computing generally promises to be beneficial for fast growing start-ups, SMBs and enterprises alike
– Cost effective solutions to key business demands – Improved overall efficiency
13.2 The Cloud
• The cloud encourages a self-service model
– Users can simply request the resources they need
13.2 The Cloud
• Anything-as-a-Service
– XaaS=“X as a service”
– In general, cloud providers offer any computation resources “as a service”
– In the long run, all computation needs of a company should be modeled, provided and used “as a service”
• e.g. in Amazon’s private and public cloud infrastructures:
everything is a service!
13.3 XaaS
– Services provide a strictly defined functionality with certain guarantees
• Service description and service-level agreements (SLAs)
• Services description explains what is offered by the service
• SLAs further clarify the provisioning guarantees
– Often: performance, latency, reliability, availability, etc.
13.3 XaaS
• Usually, three main resources may be offered
“as a service”
– Software as a Service
• SaaS
– Platform as a Service
• PaaS
– Infrastructure as a Service
• IaaS
13.3 XaaS
Server
Infrastructure Platform
Application
Client
• Application Services (services on demand)
– Gmail, GoogleCalender – Payroll, HR, CRM, etc
– Sugar CRM, IBM Lotus Live
• Platform Services (resources on demand)
– Middleware, Integration, Messaging, Information, connectivity etc
– Amazon AWS, Boomi, CastIron, Google AppEngine
• Infrastructure as services (physical assets as services)
– IBM Blue House, VMWare Cloud Edition, Amazon EC2, Microsoft Azure Platform, …
13.3 XaaS
13.3 XaaS
…?
CLOUD
Individuals Corporations Non-Commercial
Cloud Middle Ware
Storage
Provisioning OS
Provisioning
Network Provisioning
Service(apps)
Provisioning SLA(monitor), Security, Billing,
Payment
Services Storage Network OS Resources
• Infrastructure as a Service (IaaS)
– Provides raw computation infrastructure, i.e. usually a virtual server
• e.g. see hardware virtualization (VMWare & co.)
• Successor to dedicated server rental
– For the user, a virtual server is similar to a real server
• Has CPU cores, main memory, hard disc space, etc.
• Usually provided as “self-service” raw machine
• User is responsible for installing and maintaining applications like e.g., operating system, databases, or server software
• User does not need to buy, host, or maintain the actual hardware
13.3 IaaS
• The IaaS provider can host multiple virtual servers on a single, real machine
– Often, 10-30 virtual severs per real server – Virtualization is used to abstract
server hardware for virtual servers
• Virtual system also often called virtual machines (neutral term) or appliances (usually suggesting preinstalled OS and software)
– Virtualization of hardware is usually handled by a so- called hypervisor
• e.g., Xen, KVM, VMWare, HyperV, …
13.3 IaaS
• In short, IaaS is a virtualization on multiple hardware machines
– Normal Server
• 1 machine with one OS
– Traditional virtualization
• 1 machine hosting multiple virtual servers
– Distributed Application
• 1 appliance running on multiple machines
– IaaS
• Multiple machines running multiple virtual servers
• Dynamic load balancing between machines
13.3 IaaS
“Normal”
server
“Traditional”
virtualization IaaS
1 many
1many#appliances
#machines
Distributed Appliance
• Hypervisor is responsible for allocating available resources to VMs
– Dispatch VMs to machines – Relocate VM to balance load – Distribute resources
• Network adaptors, logical discs, RAM, CPU cores, etc…
13.3 IaaS
• Usually, virtual machines offered by IaaS
infrastructures cannot grow arbitrarily big
– Capped by the actual server size or the size of a smaller server group
• Really big applications are usually deployed in so-called Pods
– Similar to database shards
– Group of machines running one or multiple appliances – Machines within a Pod are very tightly networked
13.3 IaaS
– i.e. each Pod is a full copy of given virtual machines with full OS and application installed
• Usually, there are multiple copies of a given Pod (and its VMs)
• Each Pod is responsible for a disjoint part of the whole workload
– Pods are usually scattered across availability zones (e.g. data centers or a certain rack)
• Physically separated, usually with own power / network, etc.
13.3 IaaS
• IaaS Pods
13.3 IaaS
– Simplified Pod example: GoogleMail
• Multiple Pods, each Pod running on multiple machines with a full and independent installation of Gmail software
• Load balancer decides during user log-in which Pod will handle the user session
– Users are distributed across Pods
• Pods are flexible by using shared GFS file system
13.3 IaaS
• Mission critical applications should be designed such that they run in multiple availability
zones on multiple Pods
– Cloud control system (CCS) responsible for distribution and replication
13.3 IaaS
• Pod Architectures
– Each pod consists of multiple machines with mainboards, CPUs, and main memory
– Question: where to put secondary storage?
– Usually, three options
• Storage area network (SAN)
• Direct attached storage (DAS)
• Network attached storage (NAS)
– or…. Storage Service! (e.g. GFS & co.)
13.3 IaaS
• SAN Pods
– Individual servers don’t have own secondary storage – Storage area network provides shared hard disks
storage for all machines of a Pod – Pro
• All machines have access to the same data
• Allows for dynamic load balancing or migration of appliances
– e.g. VMware vMotion
– Con
• Very very expensive
• Higher latency than direct attached storage
13.3 IaaS
• SAN Pods
13.3 IaaS
• DAS Pods
– Each server has its own set of hard drives
– Accessing data from other servers may be difficult – Pro
• Cheap
• Low latency for accessing local data
– Con
• Usually, no shared data access
• Usually, difficult to live-migrate appliances (due to no shared data)
– But: by using clever storage abstractions, common problems can be circumvented
• Use distributed file system or a distributed data store!
– e.g. Amazon S3 & SimpleDB, Google GFS & BigTable, Apache HBase &
HFS, etc.
13.3 IaaS
• DAS Pods
13.3 IaaS
• IaaS example: Amazon EC2
– Elastic Compute Cloud is one of the core service of the Amazon Cloud Infrastructure
• Public IaaS Cloud
– Customers may rent virtual servers hosted at Amazons Data Centers
• Can freely install OS and applications as needed
– Virtual servers are offered in different sizes and are paid by CPU usage
• Basic storage is offered within the VM, but usually additional storage services are used by application which cost extra
– e.g. S3, SimpleDB, or Dynamo DB
13.3 Amazon EC2
• Example: t2.micro
– 1.0 GB memory – 1 vCPU units
• 1 virtual core
• 1 vCPU is roughly one 2.5 GHz Xeon core
– No dedicated storage
• Has to use AWS network storage
– Burstable performance: 6 CPU credits per hour
• 1 CPU credit = 1 minute full CPU performance
– Costs $0.013 per hour
• $9,30 per month
– Usually many users start will the small instance, also heavily used for testing
13.3 Amazon EC2
• Example: m3.xlarge – 15 GB memory – 4 vCPU units
• Total of 13 ECU (Elastic Compute Units)
• 1 ECU is roughly equal to 1.5GHz Xeon core
– 80 GB instance storage on SSD
• More storage via AWS
– Costs $0.28 per hour
• $201 per month
13.3 Amazon EC2
• Example: i2.8xlarge – 244 GB of memory – 32 vCPU
• Total of 104 ECU units
– 6400 GB of instance storage on SSD – Costs $6.82 per hour
• $4910 per month
13.3 Amazon EC2
• Rough Estimations (Oct 2009)
– Roughly 40,000 servers
– Uses standard server racks with 16 machines per rack
• Mostly packed with 2U dual-socket Quad-Core Intel Xeons
– Roughly matches the High-Mem Quad XL instance…
– Uses around 8 500GB Raid-0 disks
– Target cost around $2500 per machine in average
– 75% of the machines are US, the remainder in Europe and Asia
– Amazon aims at a utilization rate of 75%
– Very rough guesses state that Amazon may earn
$25,264 per hour with EC2!
• http://cloudscaling.com/blog/cloud-computing/amazons-ec2-generating-220m-annually
13.3 Amazon EC2
• Platform as a Service (PaaS)
– Provides software platforms on demand
• e.g. runtime engines (JavaVM, .Net Runtime, etc.), storage systems (distributed file system, or databases), web services,
communication services, etc.
– PaaS systems are usually used to develop and host web applications or web services
• User applications run on the provided platform
– In contrast to IaaS, no installation and maintenance of
operation system and server applications necessary
• Centrally managed and maintained
• Services or runtimes are directly usable
13.3 PaaS
• Google AppEngine provides users a managed Phyton or Java Runtime
– Web applications can be directly hosted in AppEngine
• Just upload you WAR file and you are done…
– Users are billed by resource usage
• Some free resources provided everyday
– 1 GB in- and out traffic, 6.5 hours CPU, 500 MB storage overall
13.3 Google AppEngine
Resource Unit Unit cost
Outgoing Bandwidth GB $0.12
Incoming Bandwidth GB $0.10
CPU Time CPU hours $0.10
Stored Data GB / month $0.15
• Each application can access system resources up to a fixed maximum
– AppEngine is not fully scalable!
– AppEngine max values (2010)
• CPU: 1730 hours CPU per day; 72 minutes CPU per minute
• Data in or out: 1 TB per day; 10 GB per minute
• Request: 43M web service calls per day, 30K calls per minute
• Data storage: no limit (uses BigTable which can scale in size!!)
13.3 Google AppEngine
• Amazon Simple DB is data storage system roughly similar to Google BigTable
– http://aws.amazon.com/simpledb
– Simple table-centric database engine
• SimpleDB is directly ready to use
– No user configuration or administration – Accessible via web service
• SimpleDB is highly available, uses flexible schemas, and eventual consistency
– Similar to HBase or BigTable
13.3 Amazon SimpleDB
– Any application may use SimpleDB for data storage
• Simple web service provided to interact with Simple DB
• Create or delete a table (called domain)
• Put and delete rows
• Query for rows
– Users pay for storage, data transfer, and computation time
• 25 hours computation time (for querying) are free per month
– Later: $0.154 per machine hour in 2009 – Later: $0.140 per machine hour in 2014
• 1 GB of data transfer is free per month
– Later: $0.15 per GB in 2009 – Later: $0.12 per GB in 2014
• 1 Gb of data storage is free per month
– Later: $0.28 per GB in 2009 – Later: $0.25 per GB in 2014
13.3 Amazon SimpleDB
• Software as a Service (SaaS)
– Full applications are offered on-demand
• User just need to consume the software; no installation or maintenance necessary
– All administrative and maintenance tasks are performed by the Cloud provider
• e.g. hosting physical hardware, maintaining platforms,
maintaining software, dealing with security, scalability, etc.
13.3 SaaS
• Salesforce.com On-Demand CRM software
– Customer-Relationship-Management
• Cooperation with Google Apps in early summer
– Provides simple online services for
• Customer database
• Lead management
• Call center
• Customer portal
• Knowledge Bases
• Collaboration environments
• Etc.