• Keine Ergebnisse gefunden

Distributed Data Management

N/A
N/A
Protected

Academic year: 2021

Aktie "Distributed Data Management"

Copied!
88
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Prof. Dr. Wolf-Tilo Balke

Institut für Informationssysteme

Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

Distributed Data Management

(2)

13.1 Map & Reduce 13.2 Cloud

13.3 Computing as a Service

SaaSPaaSIaaS

13.0 The Cloud

(3)

Just storing massive amounts of data is not enough!

– Often, we also need to process and transform that data

Large-Scale Data Processing

Use thousands of worker nodes within a computation cluster to process large data batches

Preferably without the hassle of managing things

Map & Reduce provides

– Automatic parallelization & distribution – Fault tolerance

– I/O scheduling

– Monitoring & status updates

13.1 Map & Reduce

(4)

Initially, implemented by Google for building the Google search index

i.e. crawling the Web, building

inverted word index, computing page rank, etc.

• General framework for parallel high volume data processing

J. Dean, S. Ghemawat. “MapReduce: Simplified Data

Processing on Large Clusters”, Symp. Operating System Design and Implementation, San Francisco, USA, 2004

Also available as Open Source implementation as part of Apache Hadoop

• http://hadoop.apache.org/mapreduce/

13.1 Map & Reduce

(5)

Base idea

There is a large amount of input data, identified by a key

i.e. input given as key-value pairs

e.g. all web pages of the internet identified by their URL

A map operation is a simple function which accepts one key-value pair as input

A map operation runs as autonomous thread on one single node of a cluster

Many map jobs can run in parallel with different input keys

Returns for a single input key-value pair a set of intermediate key-value pairs

map(key, value) Set of intermediate (key, value)

After map job is finished, the node is free to perform another map job for the next input key-value pair

A central controller distributes map jobs to free nodes

13.1 Map & Reduce

(6)

– After input data is mapped, reduce jobs can start – reduce(key, values) is run for each unique key

emitted by map()

• Each reduce job is also run autonomously on one single node

Many reduce jobs can run in parallel on different intermediate key groups

• Reduce emits final output of the map-reduce operation

– Each reduce job…

• Takes all map tuples with a given key as input

• Generates usually one, sometimes more output tuples

13.1 Map & Reduce

(7)

Each reduce is executed on a set of intermediate map results which have the same key

– To efficiently select this set, the intermediate key-value pairs are usually shuffled

i.e. sorted and grouped by their respective key

– After shuffling, reduce input data can be selected by a simple range scan

13.1 Map & Reduce

(8)

Example: Counting words in documents

13.1 Map & Reduce

reduce(key, values):

// key: a word;

// values: list of counts result = 0;

for each v in values) result += v;

emit(key, result);

map(key, value):

// key: doc name;

// value: text of doc

for each word w in value:

emit(w, 1);

(9)

Example: Counting words in documents

13.1 Map & Reduce

doc1: “distributed db and p2p”

distributed 1

db 1

and 1

p2p 1

map 1

and 1

reduce 1

is

a 1

distributed 1

distributed 2

db 2

and 2

p2p 1

map 1

reduce 1

is 1

doc2: “map and reduce is a distributed processing technique for db”

map(key,value) reduce(key,values)

(10)

Improvement: Combiners

Combiners are mini-reducers that run in-memory after the map phase

– Used to group rare map keys into larger groups

• e.g. word counts: group multiple extremely rare words under one key (and mark that they are grouped…)

Used to reduce network and worker scheduling overhead

13.1 Map & Reduce

(11)

Responsibility of the map and reduce master

Often called scheduler

Assign Map and Reduce tasks to workers on nodes

Usually, map tasks are assigned to worker nodes as a batch and not one by one

Often called a split, i.e. a subset of the whole input data Splits are often implemented by a simple hash function

with as many buckets as worker nodes

Full split data is assigned to some worker node, which starts a map task for each input key-value pair

Check for node failure

Check for task completion

Route map results to reduce tasks

13.1 Map & Reduce

(12)

• Map and Reduce overview

13.1 Map & Reduce

(13)

Master is responsible for worker node fault tolerance

– Handled via re-execution

• Detect failure via periodic heartbeats

• Re-execute completed + in-progress map tasks

• Re-execute in progress reduce tasks

• Task completion committed through master

– Robust: lost 1600/1800 machines once  finished ok

• Master failures are not handled

– Unlikely due to redundant hardware…

13.1 Map & Reduce

(14)

Showcase: machine usage during web indexing

– Fine granularity tasks: map tasks >> machines

• Minimizes time for fault recovery

• Can pipeline shuffling with map execution

• Better dynamic load balancing

– Showcase uses 200,000 map & 5,000 reduce tasks – Running on 2,000 machines

13.1 Map & Reduce

(15)

PageRank is one of the major algorithm behind Google Search

See our wonderful IRWS lecture (No 12)!!

Key Question: How important is a given website?

Importance independent of query

– Idea: other pages “vote” for a site by linking to it

also called “giving credit to”

Pages with many votes are probably important

– If an important site “votes” for another site, that vote has a higher weight as when an unimportant site votes

13.1 MR - PageRank

t1

x

(16)

• Given page 𝑥 with in-bound links 𝑡

1

, … , 𝑡

𝑛

, where

– 𝐶(𝑡) is the out-degree of 𝑡

– 𝛼 is probability of random jump – 𝑁 is the total number of

nodes in the graph

– 𝑃𝑅 𝑥 = 𝛼

1

𝑁

+ (1 − 𝛼)

𝑖=1𝑛

(

𝑃𝑅 𝑡𝑖

𝐶 𝑡𝑖

)

13.1 MR - PageRank

(17)

• Properties of PageRank

– Can be computed iteratively – Effects at each iteration is local

• Sketch of algorithm:

Start with seed PR

i

values

Each page distributes PR

i

“credit” to all pages it links to

– Each target page adds up “credit” from multiple in- bound links to compute PR

i+1

– Iterate until values converge

13.1 MR - PageRank

(18)

13.1 MR - PageRank

Map Step: Distribute Page Rank “Credits” to link targets

Reduce Step: gather up PageRank “credit” from multiple sources to compute new PageRank value

(19)

• Dryad (Microsoft)

– Relational Algebra

• Pig (Yahoo)

– Near Relational Algebra over MapReduce

• HIVE (Facebook)

– SQL over MapReduce

• Cascading

– University of Wisconsin

• Hbase

– Indexing on HDFS

13.1 MapReduce Contemporaries

(20)

An engine for executing programs on top of Hadoop.

It provides a language, Pig Latin, to specify these programs.

• An Apache open source project http://pig.apache.org

13.1 Pig

(21)

• Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited sites by users aged 18-25

13.1 Pig: Motivation

Load Users Load Pages

Filter by age

Join on name

Group on url Count clicks

Order by clicks

(22)

13.1 In MapReduce

170 lines of code, 4 hours to write

(23)

Users = load ‘users’ as (name, age);

Fltrd = filter Users by

age >=18 and age <=25;

Pages = load ‘pages’ as (user, url);

Jnd = join Fltrd by name , Pages by users;

Grpd = group Jnd by url;

Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;

Srtd = order Smmd by clicks desc;

Top5 = limit Srtd 5;

store Top5 into ‘top5sites’;

9 lines of code, 15 minutes to write

13.1 In Pig Latin

(24)

13.1 Pig System Overview

Pig Latin program

A = LOAD‘file’ AS (sid, pid, mass, px:double) ; B = LOAD‘file2’ AS (sid, pid, mass,px:double);

C = FILTER A BY px < 1.0;

D = JOIN C BY sid, B BY sid;

STORE g INTO ‘output.txt’;

Pig parser

Pig compiler

Parsed program

Execution plan JOIN

FILTER

LOAD LOAD

DISK A DISK B

(25)

13.1 Comparing Performance

How fast is Pig compared to a pure Map-Reduce implementation?

(26)

• Atom:

– Integer, string ,etc.

• Tuple:

– Sequence of fields

– Each field of any type

• Bag:

– Collection of tuples, not necessarily of the same type – Duplicates are allowed

• Map:

– String literal keys mapped to any type

13.1 Data Model

(27)

13.1 Pig Latin Statement

A = LOAD‘student’ AS (name:chararray, age:int, , gpa:float) ; X = FOREACHA GENERATE name, $2

First Field Second Field Third Field

Data type Chararray Int Float

Positional notation (generated by system) $0 $1 $2 Possible name (assigned by user using a

schema)

name age gpa

(28)

• Map-Reduce: Iterative Jobs

– Iterative jobs involve a lot of disk I/O for each repetition

13.1 Apache Spark Motivation

(29)

13.1 Apache Spark Motivation

Using Map Reduce for complex jobs, interactive queries and online processing involves lots of disk I/O

Idea: keep more data in memory!

(30)

13.1 Use Memory instead of Disk

(31)

13.1 In-Memory Data Sharing

(32)

• Most real applications require multiple MR steps:

– Google indexing pipeline: 21 steps

– Analytics queries (e.g. count clicks & topk) :2-5 steps – Iterative algorithms (e.g. PageRank): 10’s of steps

• Multi step jobs create spaghetti code

– 21 MR steps -> 21 mapper and reducer classes

13.1 Programmability

(33)

13.1 Programmability

(34)

13.1 Performance

[Source: Daytona GraySort benchmark, sortbenchmark.org]

(35)

• Open source processing engine.

• Originally developed at UC Berkeley in 2009.

• More than 100 operators for transforming data.

• World record for large-scale on disk sorting.

• Built in support for many data sources (HDFS, RDBMS,S3, Cassandra)

13.1 Apache Spark

[Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark : Cluster Computing with Working Sets. In HotCloud’10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing(p. 10). doi:10.1007/s00256-009- 0861-0]

[Zaharia, M., Chowdhury, M., Das, T., & Dave, A. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI’12 Proceedings of the 9th USENIX Conference on Networked Systems Design

(36)

13.1 Spark Tools

(37)

Write programs in terms of distributed datasets and operations on them

– Resilient Distributed Datasets (RDDs)

• Collections of objects spread across a cluster, stored in RAM or on Disk.

• Built through parallel transformations

• Automatically rebuilt on failure – Operations

• Transformations (e.g. map, filter, groupBy).

• Actions ( e.g. count, collect, save)

13.1 Resilient Distributed Datasets

(38)

13.1 Working with RDDs

(39)

13.1 Spark vs. Map Reduce

Hadoop

Map Reduce

Spark

Storage Disk only In-memory or on disk

Operations Map and Reduce Map, Reduce, Join, Sample, etc…

Execution model Batch Batch, interactive, streaming Programming

environments

Java Scala, Java, R, and Python

(40)

The term “cloud computing” is often seen as a successor of client-server architectures

– Often used as synonym for centralized, on-demand, pay-what-you-use provisioning of

general computation resources

Comparable to utility providers like electric power grids or water supply

• “Computing as a commodity”

“Cloud” is used as a metaphor for the Internet

• Users or applications “just use” computation resources provided in the Internet instead of using local hardware or software

13.2 The Cloud

(41)

“Computation resources” can mean a lot of things:

Dynamic access to “raw metal”

Raw storage space or CPU time

Fully operational server are provided by the cloud

Low-level services and platforms

e.g. runtime platforms like Jave JRE

» User can run application directly on a cloud platform

» No own servers or platform software is needed e.g. abstracted storage space like space

within a database or a file system

» This is what we did in the last weeks!

13.2 The Cloud

(42)

Software services

i.e. some functionalities required by user software is provided “by the cloud”

» Used via web service remote procedure calls

» e.g. delegate the rendering of a map in some user application to Google Maps

Full software functionality

e.g. rented Web applications replacing traditional server or desktop applications

» e.g. rent CRM software online from SalesForce, use Google apps instead of MS Office, etc.

13.2 The Cloud

(43)

Underlying base problem…

Successfully running IT departments and IT infrastructures can be very difficult and expensive for companies

High fixed costs

Acquiring and paying competent IT staff

“Competent” is often very hard to get…

Buying and maintaining servers

Correctly hosting hardware

Proper power and cooling facilities, network connections, server racks, etc.

Buying and maintaining software

13.2 The Cloud

(44)

Load and Utilization Issues

• How much hardware resources are

required by each application and/or service?

• How to handle scaling issues?

What happens if demand increases or declines?

How to handle spike loads?

“Digg Effect”

Traditional data centers are

notoriously underutilized, often idle 85% of the time

Over provisioning for future growth or spikes Insufficient capacity planning and sizing

Improper understanding of scalability requirements etc.

13.2 The Cloud

(45)

Cloud computing centrally unifies computation resources and

provides them on-demand

– Degree of centralization and provision may differ

• Centralize hardware within a department? A company? A number of companies? Globally?

• Provide resources only oneself? To some partners?

To anybody?

• How to compensate providers for resource usage?

Provide resources with a rental model (e.g. monthly fee)?

Provide resources metered on what-is-used basis (e.g. similar to electricity or water?)

Provide resources for free?

13.2 The Cloud

(46)

• Usually, three types of clouds are distinguished

Public CloudsPrivate CloudsHybrid Clouds

13.2 The Cloud

(47)

Public Clouds

“Traditional” cloud computing

Services and resources are offered via the Internet to anybody willing to pay for them

User just pays for services, usually no acquisition, administration or maintenance of hardware / software necessary

Services usually provided by off-site 3rd-party providers

Open for use by general public

Exist beyond firewall, fully hosted and managed by the vendor

Customers are individuals, corporations and others

e.g. Amazon's Web Services and Google AppEngine

Offers start-ups and SMB’s quick setup, scalability, flexibility, and automated management. Pay as you go model helps start-ups to start small and go big

Security and compliance?

Reliability and privacy concerns hinder the adoption of cloud

Amazon S3 services were down for 6 hours in 2010

What will Amazon do with all the data?

13.2 The Cloud

(48)

Private Clouds

Cloud computing hardware are within the premises of a company behind the cooperate firewall

Resources are only provided internally for various departments

Private clouds are still fully bought, build, and maintained by the company using it

But usually not exclusive to single departments Still, costs could be prohibitive and may by far

exceed that of public clouds

Fine grained control over resources

More secure as they are internal to organization

Schedule and reshuffle resources based on business demands

Ideal for apps requiring tight security and regulatory concerns

Development requires hardware investments and in-house expertise

13.2 The Cloud

(49)

Hybrid Clouds

• Both, private and public cloud services, or even non-cloud services are used or offered simultaneously

• “State-of-art” for most companies relying on cloud technology

13.2 The Cloud

(50)

Properties promised by cloud computing

Agility

• Resources are quickly available when needed

i.e. servers must not be ordered and build, software doesn’t need to be configured and installed, etc.

Costs

• Capital expenditure is converted to operational expenditure

Independence

• Services are available everywhere and for any device

13.2 The Cloud

(51)

Multi-tenancy

• Resources are shared by larger pool of users

Resources can be centralized which reduces the costs

Load distribution of users differs

Peak loads can usually be distributed

Overall utilization and efficiency of resources is better

Reliability

• Most cloud services promise durable and reliable resources due to distribution and replication

Scalability

• If a user needs more resources or performance, it can easily provisioned

13.2 The Cloud

(52)

Low maintenance

Cloud services or applications are not installed on user’s machines, but maintained centrally by specialized staff

Transparency and metering

Costs for computation resources are directly visible and transparent

“Pay-what-you-use” models

• Cloud computing generally promises to be beneficial for fast growing start-ups, SMBs and enterprises alike

– Cost effective solutions to key business demands – Improved overall efficiency

13.2 The Cloud

(53)

The cloud encourages a self-service model

– Users can simply request the resources they need

13.2 The Cloud

(54)

Anything-as-a-Service

– XaaS=“X as a service”

– In general, cloud providers offer any computation resources “as a service”

– In the long run, all computation needs of a company should be modeled, provided and used “as a service”

• e.g. in Amazon’s private and public cloud infrastructures:

everything is a service!

13.3 XaaS

(55)

Services provide a strictly defined functionality with certain guarantees

Service description and service-level agreements (SLAs)

• Services description explains what is offered by the service

• SLAs further clarify the provisioning guarantees

Often: performance, latency, reliability, availability, etc.

13.3 XaaS

(56)

• Usually, three main resources may be offered

“as a service”

Software as a Service

• SaaS

Platform as a Service

• PaaS

Infrastructure as a Service

• IaaS

13.3 XaaS

Server

Infrastructure Platform

Application

Client

(57)

Application Services (services on demand)

– Gmail, GoogleCalender – Payroll, HR, CRM, etc

– Sugar CRM, IBM Lotus Live

Platform Services (resources on demand)

– Middleware, Integration, Messaging, Information, connectivity etc

Amazon AWS, Boomi, CastIron, Google AppEngine

Infrastructure as services (physical assets as services)

IBM Blue House, VMWare Cloud Edition, Amazon EC2, Microsoft Azure Platform, …

13.3 XaaS

(58)

13.3 XaaS

…?

CLOUD

Individuals Corporations Non-Commercial

Cloud Middle Ware

Storage

Provisioning OS

Provisioning

Network Provisioning

Service(apps)

Provisioning SLA(monitor), Security, Billing,

Payment

Services Storage Network OS Resources

(59)

Infrastructure as a Service (IaaS)

– Provides raw computation infrastructure, i.e. usually a virtual server

e.g. see hardware virtualization (VMWare & co.)

Successor to dedicated server rental

For the user, a virtual server is similar to a real server

Has CPU cores, main memory, hard disc space, etc.

Usually provided as “self-service” raw machine

User is responsible for installing and maintaining applications like e.g., operating system, databases, or server software

User does not need to buy, host, or maintain the actual hardware

13.3 IaaS

(60)

The IaaS provider can host multiple virtual servers on a single, real machine

– Often, 10-30 virtual severs per real server – Virtualization is used to abstract

server hardware for virtual servers

Virtual system also often called virtual machines (neutral term) or appliances (usually suggesting preinstalled OS and software)

– Virtualization of hardware is usually handled by a so- called hypervisor

• e.g., Xen, KVM, VMWare, HyperV, …

13.3 IaaS

(61)

In short, IaaS is a virtualization on multiple hardware machines

– Normal Server

1 machine with one OS

– Traditional virtualization

1 machine hosting multiple virtual servers

– Distributed Application

1 appliance running on multiple machines

– IaaS

Multiple machines running multiple virtual servers

Dynamic load balancing between machines

13.3 IaaS

“Normal”

server

“Traditional”

virtualization IaaS

1 many

1many#appliances

#machines

Distributed Appliance

(62)

Hypervisor is responsible for allocating available resources to VMs

– Dispatch VMs to machines – Relocate VM to balance load – Distribute resources

• Network adaptors, logical discs, RAM, CPU cores, etc…

13.3 IaaS

(63)

• Usually, virtual machines offered by IaaS

infrastructures cannot grow arbitrarily big

– Capped by the actual server size or the size of a smaller server group

• Really big applications are usually deployed in so-called Pods

– Similar to database shards

– Group of machines running one or multiple appliances – Machines within a Pod are very tightly networked

13.3 IaaS

(64)

i.e. each Pod is a full copy of given virtual machines with full OS and application installed

• Usually, there are multiple copies of a given Pod (and its VMs)

Each Pod is responsible for a disjoint part of the whole workload

Pods are usually scattered across availability zones (e.g. data centers or a certain rack)

• Physically separated, usually with own power / network, etc.

13.3 IaaS

(65)

• IaaS Pods

13.3 IaaS

(66)

Simplified Pod example: GoogleMail

• Multiple Pods, each Pod running on multiple machines with a full and independent installation of Gmail software

• Load balancer decides during user log-in which Pod will handle the user session

Users are distributed across Pods

• Pods are flexible by using shared GFS file system

13.3 IaaS

(67)

Mission critical applications should be designed such that they run in multiple availability

zones on multiple Pods

Cloud control system (CCS) responsible for distribution and replication

13.3 IaaS

(68)

Pod Architectures

– Each pod consists of multiple machines with mainboards, CPUs, and main memory

– Question: where to put secondary storage?

– Usually, three options

Storage area network (SAN)

Direct attached storage (DAS)

Network attached storage (NAS)

or…. Storage Service! (e.g. GFS & co.)

13.3 IaaS

(69)

SAN Pods

– Individual servers don’t have own secondary storage – Storage area network provides shared hard disks

storage for all machines of a Pod – Pro

• All machines have access to the same data

• Allows for dynamic load balancing or migration of appliances

e.g. VMware vMotion

Con

• Very very expensive

• Higher latency than direct attached storage

13.3 IaaS

(70)

SAN Pods

13.3 IaaS

(71)

DAS Pods

– Each server has its own set of hard drives

– Accessing data from other servers may be difficult – Pro

Cheap

Low latency for accessing local data

Con

Usually, no shared data access

Usually, difficult to live-migrate appliances (due to no shared data)

– But: by using clever storage abstractions, common problems can be circumvented

Use distributed file system or a distributed data store!

e.g. Amazon S3 & SimpleDB, Google GFS & BigTable, Apache HBase &

HFS, etc.

13.3 IaaS

(72)

DAS Pods

13.3 IaaS

(73)

IaaS example: Amazon EC2

Elastic Compute Cloud is one of the core service of the Amazon Cloud Infrastructure

• Public IaaS Cloud

– Customers may rent virtual servers hosted at Amazons Data Centers

• Can freely install OS and applications as needed

– Virtual servers are offered in different sizes and are paid by CPU usage

• Basic storage is offered within the VM, but usually additional storage services are used by application which cost extra

e.g. S3, SimpleDB, or Dynamo DB

13.3 Amazon EC2

(74)

Example: t2.micro

– 1.0 GB memory – 1 vCPU units

1 virtual core

1 vCPU is roughly one 2.5 GHz Xeon core

– No dedicated storage

Has to use AWS network storage

– Burstable performance: 6 CPU credits per hour

1 CPU credit = 1 minute full CPU performance

Costs $0.013 per hour

$9,30 per month

– Usually many users start will the small instance, also heavily used for testing

13.3 Amazon EC2

(75)

Example: m3.xlarge – 15 GB memory – 4 vCPU units

• Total of 13 ECU (Elastic Compute Units)

• 1 ECU is roughly equal to 1.5GHz Xeon core

– 80 GB instance storage on SSD

• More storage via AWS

Costs $0.28 per hour

• $201 per month

13.3 Amazon EC2

(76)

Example: i2.8xlarge – 244 GB of memory – 32 vCPU

• Total of 104 ECU units

– 6400 GB of instance storage on SSD – Costs $6.82 per hour

• $4910 per month

13.3 Amazon EC2

(77)

Rough Estimations (Oct 2009)

– Roughly 40,000 servers

– Uses standard server racks with 16 machines per rack

• Mostly packed with 2U dual-socket Quad-Core Intel Xeons

Roughly matches the High-Mem Quad XL instance…

Uses around 8 500GB Raid-0 disks

Target cost around $2500 per machine in average

75% of the machines are US, the remainder in Europe and Asia

– Amazon aims at a utilization rate of 75%

– Very rough guesses state that Amazon may earn

$25,264 per hour with EC2!

http://cloudscaling.com/blog/cloud-computing/amazons-ec2-generating-220m-annually

13.3 Amazon EC2

(78)

Platform as a Service (PaaS)

– Provides software platforms on demand

e.g. runtime engines (JavaVM, .Net Runtime, etc.), storage systems (distributed file system, or databases), web services,

communication services, etc.

– PaaS systems are usually used to develop and host web applications or web services

User applications run on the provided platform

– In contrast to IaaS, no installation and maintenance of

operation system and server applications necessary

Centrally managed and maintained

Services or runtimes are directly usable

13.3 PaaS

(79)

Google AppEngine provides users a managed Phyton or Java Runtime

Web applications can be directly hosted in AppEngine

Just upload you WAR file and you are done…

Users are billed by resource usage

Some free resources provided everyday

1 GB in- and out traffic, 6.5 hours CPU, 500 MB storage overall

13.3 Google AppEngine

Resource Unit Unit cost

Outgoing Bandwidth GB $0.12

Incoming Bandwidth GB $0.10

CPU Time CPU hours $0.10

Stored Data GB / month $0.15

(80)

• Each application can access system resources up to a fixed maximum

– AppEngine is not fully scalable!

AppEngine max values (2010)

• CPU: 1730 hours CPU per day; 72 minutes CPU per minute

• Data in or out: 1 TB per day; 10 GB per minute

• Request: 43M web service calls per day, 30K calls per minute

• Data storage: no limit (uses BigTable which can scale in size!!)

13.3 Google AppEngine

(81)

• Amazon Simple DB is data storage system roughly similar to Google BigTable

– http://aws.amazon.com/simpledb

– Simple table-centric database engine

• SimpleDB is directly ready to use

No user configuration or administration Accessible via web service

• SimpleDB is highly available, uses flexible schemas, and eventual consistency

Similar to HBase or BigTable

13.3 Amazon SimpleDB

(82)

– Any application may use SimpleDB for data storage

Simple web service provided to interact with Simple DB

Create or delete a table (called domain)

Put and delete rows

Query for rows

Users pay for storage, data transfer, and computation time

25 hours computation time (for querying) are free per month

Later: $0.154 per machine hour in 2009 Later: $0.140 per machine hour in 2014

1 GB of data transfer is free per month

Later: $0.15 per GB in 2009 Later: $0.12 per GB in 2014

1 Gb of data storage is free per month

Later: $0.28 per GB in 2009 Later: $0.25 per GB in 2014

13.3 Amazon SimpleDB

(83)

Software as a Service (SaaS)

Full applications are offered on-demand

• User just need to consume the software; no installation or maintenance necessary

– All administrative and maintenance tasks are performed by the Cloud provider

• e.g. hosting physical hardware, maintaining platforms,

maintaining software, dealing with security, scalability, etc.

13.3 SaaS

(84)

• Salesforce.com On-Demand CRM software

– Customer-Relationship-Management

Cooperation with Google Apps in early summer

– Provides simple online services for

Customer database

Lead management

Call center

Customer portal

Knowledge Bases

Email

Collaboration environments

Etc.

13.3 SalesForce

(85)

13.3 SalesForce

(86)

13.3 SalesForce

(87)

• Bills per month and user, based on edition

13.3 SalesForce

(88)

• Data Warehousing and Data Mining Techniques

Next Semester

Referenzen

ÄHNLICHE DOKUMENTE

• If an acceptor receives an accept request with higher or equal number that its highest seen proposal, it sends its value to each learner. • A value is chosen when a learner

• Both private and public cloud services or even non-cloud services are used or offered simultaneously. • “State-of-art” for most companies relying on cloud

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 4.. Book: P2P Systems and applications,

Distributed Data Management – Wolf-Tilo Balke – Christoph Lofi – IfIS – TU Braunschweig 3.. 2.0

Distributed Data Management – Wolf-Tilo Balke – Christoph Lofi – IfIS – TU Braunschweig 2..

Distributed Data Management – Wolf-Tilo Balke – Christoph Lofi – IfIS – TU Braunschweig..

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 18 EN 13.2. 2.2 HD – Basic Architecture.. • Each surface is divided

• Vorteil: beim Ausfall einer Platte wird der Be- trieb nicht beeinträchtigt, Lese-Operationen können auf beide Platten verteilt werden.. • Nachteil: es wird für die virtuelle