• Keine Ergebnisse gefunden

Data Management Peer-to-Peer

N/A
N/A
Protected

Academic year: 2021

Aktie "Data Management Peer-to-Peer"

Copied!
58
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Wolf-Tilo Balke Sascha Tönnies

Institut für Informationssysteme

Technische Universität Braunschweig

Peer-to-Peer

Data Management

(2)

1. Reliability in Distributed Hash Tables

2. Storage Load Balancing in Distributed Hash Tables

1. Power of Two Choices 2. Virtual Server

3. Content Distribution

1. Swarming 2. Bit Torent

11. Content Distribution

(3)

11.1 “Stabilize” Function

The Stabilize Function corrects inconsistent connections

Remember:

Periodically done by each node n

n asks its successor for its predecessor p n checks if p equals n

n also periodically refreshes random finger x

by (re)locating successor

Successor-List to find new successor

If successor is not reachable use next node in successor-list Start stabilize function

(4)

11.1 Reliability of Data in Chord

Original

No Reliability of data

Recommendation

Use of Successor-List

The reliability of data is an application task

Replicate inserted data to the next f other nodes

Chord informs application of arriving or failing nodes

(5)

11.1 Properties

• Advantages

After failure of a node its successor has the data already stored

• Disadvantages

Node stores f intervals

More data load

After breakdown of a node

Find new successor

Replicate data to next node

More message overhead at breakdown

(6)

11.1 Multiple Nodes in One Interval

Fixed positive number f

Indicates how many nodes have to act within one interval at least

Procedure

First node takes a random position

A new node is assigned to any existing node

Node is announced to all other nodes in same interval

9 10 1

2 3

4 5

6 7 8

(7)

11. 1Multiple Nodes in One Interval

• Effects of algorithm

Reliability of data

Better load balancing Higher security

12 45 67 910

(8)

11.1 Reliability of Data

• Insertion

Copy of documents

Always necessary for replication

Less additional expenses

Nodes have only to store pointers to nodes from the same interval

Nodes store only data of one interval

(9)

11.1 Reliability of Data

• Reliability

Failure: no copy of data needed

Data are already stored within same interval

Use stabilization procedure to correct fingers

As in original Chord

12 45 67 910

(10)

11.1 Properties

• Advantages

Failure: no copy of data needed

Rebuild intervals with neighbors only if critical Requests can be answered by f different nodes

• Disadvantages

Less number of intervals as in original Chord

Solution: Virtual Servers

(11)

11.1 Fault Tolerance

• Replication

Each data item is replicated K times

K replicas are stored on different nodes

• Redundancy

Each data item is split into M fragments

K redundant fragments are computed

Use of an "erasure-code“ (see e.g. V. Pless: Introduction to the Theory of Error- Correcting Codes. Wiley-Interscience, 1998)

Any M fragments allow to reconstruct the original data

For each fragment we compute its key

(12)

11.2 Storage Load Balancing in DHT

Suitable hash function (easy to compute, few collisions)

Standard assumption 1: uniform key distribution

Every node with equal load No load balancing is needed

Standard assumption 2: equal distribution

Nodes across address space Data across nodes

But is this assumption justifiable?

Analysis of distribution of data using simulation

(13)

11. 2 Storage Load Balancing in DHT

• Analysis of distribution of data

• Example

Parameters

4,096 nodes

500,000 documents

Optimum

~122 documents per node

•  No optimal distribution in Chord without load

Optimal distribution of documents across nodes

(14)

11.2 Storage Load Balancing in DHT

Number of nodes without storing any document

Parameters

4,096 nodes

100,000 to 1,000,000 documents

Some nodes without any load

Why is the load unbalanced?

We need load balancing to keep the complexity of DHT management low

(15)

11.2 Definitions

• Definitions

System with N nodes

The load is optimally balanced,

Load of each node is around 1/N of the total load.

A node is overloaded (heavy)

Node has a significantly higher load compared to the optimal distribution of load.

Else the node is light

(16)

11.2 Load Balancing Algorithms

Problem

Significant difference in the load of nodes

Several techniques to ensure an equal data distribution

Power of Two Choices (Byers et. al, 2003) Virtual Servers (Rao et. al, 2003)

Thermal-Dissipation-based Approach (Rieche et. al, 2004) A Simple Address-Space and Item Balancing (Karger et. al,

2004)

(17)

11.2 Overview

Algorithms

Power of Two Choices (Byers et. al, 2003) Virtual Servers (Rao et. al, 2003)

John Byers, Jeffrey Considine, and Michael Mitzen-

macher: “Simple Load Balancing for Distributed Hash Tables“ in Second International Workshop on Peer- to-Peer Systems (IPTPS), Berkeley, CA, USA, 2003.

(18)

11.2 Power of Two Choices

• Idea

One hash function for all nodes

h0

Multiple hash functions for data

h1, h2, h3, …hd

• Two options

Data is stored at one node only Data is stored at one node &

other nodes store a pointer

(19)

11.2 Power of Two Choices

Inserting Data

Results of all hash functions are calculated

h1(x), h2(x), h3(x), …hd(x)

Data is stored on the retrieved node with the lowest load Alternative: other nodes store pointer

The owner of the item has to insert the document periodically

Prevent removal of data after a timeout (soft state)

(20)

11.2 Power of Two Choices

• Retrieving

Without pointers

Results of all hash functions are calculated

Request all of the possible nodes in parallel

One node will answer

With pointers

Request only one of the possible nodes.

Node can forward the request directly to the final node

(21)

11.2 Power of Two Choices

• Advantages

Simple

• Disadvantages

Message overhead at inserting data With pointers

Additional administration of pointers lead to even more load

Without pointers

(22)

11.2 Overview

Algorithms

Power of Two Choices (Byers et. al, 2003) Virtual Servers (Rao et. al, 2003)

Ananth Rao, Karthik Lakshminarayanan, Sonesh

Surana, Richard Karp, and Ion Stoica “Load Balancing in Structured P2P Systems” in Second International Workshop on Peer-to-Peer Systems (IPTPS),

Berkeley, CA, USA, 2003.

(23)

11.2 Virtual Server

• Each node is responsible for several intervals

"Virtual server"

• Example

Chord

Chord Ring

(24)

11.2 Rules

• Rules for transferring a virtual server

From heavy node to light node

1. The transfer of an virtual server makes the receiving node not heavy

2. The virtual server is the lightest virtual server that makes the heavy node light

3. If there is no virtual server whose transfer can make a node light, the heaviest virtual server from this

node would be transferred

(25)

11.2 Virtual Server

• Each node is responsible for several intervals

log (n) virtual servers

• Load balancing

Different possibilities to change servers

One-to-one

One-to-many

Many-to-many

Copy of an interval is like

removing and inserting a Chord Ring

(26)

L L L

L H L

H

11.2 Scheme 1: One-to-One

• One-to-One

Light node picks a random ID

Contacts the node x responsible for it Accepts load if x is heavy

(27)

L1

L2

L3 H3

H1

D1

L5

11.2 Scheme 2: One-to-Many

One-to-Many

Light nodes report their load information to directories

Heavy node H gets this information by contacting a directory H contacts the light node which can accept the excess load

(28)

H3

H2 H1

D1

D2

L4 L1

L2 L3

L4 L5

11.2 Scheme 2: Many-to-Many

Many-to-Many

Many heavy and light nodes rendezvous at each step

Directories periodically compute the transfer schedule and report it back to the nodes, which then do the actual transfer

(29)

11.2 Virtual Server

• Advantages

Easy shifting of load

Whole Virtual Servers are shifted

• Disadvantages

Increased administrative and messages overhead

Maintenance of all Finger-Tables

Much load is shifted

[Rao 2003]

(30)

11.2 Simulation

Scenario

4,096 nodes (comparison with other measurements) 100,000 to 1,000,000 documents

Chord

m= 22 bits.

Consequently, 222 = 4,194,304 nodes and documents

Hash function

sha-1 (mod 2m) random

Analysis

(31)

Power of Two Choices

+ Simple

+ Lower load

Nodes w/o load

11.2 Results

Without load balancing

+ Simple + Original

Bad load balancing

Virtual server

+ No nodes w/o load Higher max. load than

Power of Two Choices

(32)

11.3 Content Distribution

Sometimes large amounts of data have to be distributed over networks

Software updates, video on demand, etc.

Early approaches: Napster/Gnutella/Fasttrack

Download whole file from one peer

If download fails: repeat search, resume download from alternative source

Issues

No load distribution

Poor performance due to asymmetric uplink/downlink bandwidth (ADSL)

Low reliability (except for small files)

(33)

11.3 Swarming Approach

• Idea: Chunks

Split large files into small chunks

Identify/protect chunks via hash values

Parallelization

Download different chunks from different sources Utilize upload capacity of multiple sources

0x9A3C 0x7C23 0x194F 0xDE6A

Sources:

(34)

11.3 Swarming Properties

Advantages

Peer failures: no loss of files, only chunks Increased throughput

Strategies

Chunk selection

Avoid scarcity

Best overall availability?

Fairness

Free-Riding

Bandwidth allocation

Systems

BitTorrent

Microsoft Avalanche

(35)

11.3 BitTorrent Overview

• Bittorrent or BitTorrent

Torrent = big stream

Author: Bram Cohen, 2003

Only for file distribution, no search features

• Designed for

Content providers Flash crowds

• Central components

Web server for search

(36)

11.3 BitTorrent

• Definitions

Peers Torrent

Contains metadata about the files

Contains the address of a tracker

Specification of backup trackers possible

Swarm

All peers sharing a torrent are called a swarm

Tracker

Keeps track of which peers are in a swarm

Coordinates communication between the peers

(37)

new leecher

11.3 BitTorrent – Joining a Torrent

Peers divided into:

seeds: have the entire file

leechers: still downloading

data request peer list

torrent

join

1

2 3

4

seed/leecher website

tracker

1. obtain the torrent 2. contact the tracker

3. obtain a peer list (contains seeds & leechers)

(38)

Download sub-pieces in parallel

Verify pieces using hashes

!

11.3 BitTorrent – Exchanging Data

I have leecher A

Advertise received pieces to the entire peer list

Look for the rarest pieces

seed leecher B

leecher C

(39)

11.3 Torrent

• A Torrent file

Passive component

Files are typically fragmented into 256KB pieces Typically hosted on a web server

Metadata file structure

Describes the files in the torrent

URL of tracker

File name

File length

Piece length

SHA-1 hashes of pieces

Allow peers to verify integrity

(40)

11.3 Tracker

Peer cache

IP, port, peer id

State information

Completed Downloading Clients report

status periodically to tracker

Returns random list

50 random leechers/seeds

Client first contacts 20-40 of them

(41)

11.3 Tracker

(42)

11.3 Tracker-less approaches

• Tracker issues

Single point of failure Scalability

Piratebay tracker nearly overloaded (>5 Mio. Peers)

• Decentralized tracker

Replace with DHT (Kademlia)

Does not tackle distributed search Currently not widely used

(43)

11.3 Chunk Selection

Which chunk next?

1. Strict Priority

Finish active chunks

2. Rarest First

Improves availability of rare chunks Delays download of common chunks

3. Random First Chunk

Get first chunk quickly (rarest chunk probably slow to get)

4. Endgame Mode

Send requests for last sub-chunks to all known peers

(44)

11.3 Game Theory

Basic Ideas of Game Theory

Studies situations where players choose different actions in an attempt to maximize their returns

Studies the ways in which strategic interactions among rational

players produce outcomes with respect to the players’ preferences The outcomes might not have been intended by any of them

Game theory offers a general theory of strategic behavior Described in mathematical form

Plays an important role in

Modern economics Decision theory Multi-agent systems

(45)

11.3 Game Theory

• Developed to explain the optimal strategy in two-person interactions.

von Neumann and Morgenstern

Initially: zero-sum games

John Nash

Works in game theory and differential geometry

Nonzero-sum games Nash equilibrium

1994 Nobel Prize in Economics

Harsanyi, Selten

(46)

11.3 Definitions

Games

Situations are treated as games.

Rules

The rules of the game state who can do what And when they can do it.

Player's Strategies

Plan for actions in each possible situation in the game

Player's Payoffs

Is the amount that the player wins or looses in a particular situation

Dominant Strategy

If players best strategy doesn’t depend on what other players do

(47)

11.3 Prisoner's Dilemma

• Famous example of game theory

• A and B are arrested by the police

They are questioned in separate cells

Unable to communicate with each other.

They know how it works

If they both resist interrogation and proclaim their mutual innocence, they will get off with a three year sentence for robbery.

If one of them confesses to the entire string of robberies and the other does not, the confessor will be rewarded with a light, one year sentence and the other will get a severe eight year sentence.

(48)

B

Confess Not Confess

A Confess 4 years each 1 year for A and 8 years for B Not

Confess

8 years for A and

1 year for B 3 years each

11.3 Prisoner's Dilemma

(49)

11.3 A’s Decision Tree

There are two cases to consider

If B Confesses

A

4 Years in Prison

8 Years in Prison Not Confess

Confess

If B Does Not Confess

1 Year in Prison

3 Years in Prison A

Confess Not Confess

Best

Strategy Best

Strategy

The dominant strategy for A is to confess

(50)

11.3 Repeated Games

• A repeated game

Game that the same players play more than once

Differ from one-shot games because people's current actions can depend on the past behavior of other

players.

Cooperation is encouraged

• Book recommendation

“Thinking strategically” by A.Dixit and B Nalebuff

(51)

11.3 Tit for Tat

Tit for tat

Highly effective strategy

An agent using this strategy will initially cooperate

Then respond in kind to an opponent's previous action

If the opponent previously was cooperative, the agent is cooperative.

If not, the agent is not.

Dependent on four conditions

Unless provoked, the agent will always cooperate If provoked, the agent will retaliate

The agent is quick to forgive

(52)

11.3 Choking

Choking

Temporary refusal to upload Downloading occurs as normal Connection is kept open

No Setup costs

TCP congestion control

Choking mechanism

Ensures that nodes cooperate Eliminates the free-rider problem

Cooperation involves uploaded sub-pieces that you have on your peer

Based on game-theoretic concepts

Tit-for-tat strategy in repeated games

(53)

11.3 Unchoking

Periodically calculate data-receiving rates

Upload to (unchoke) the fastest downloaders

Optimistic Unchoking

Each BitTorrent peer has a single “optimistic unchoke” which is uploaded regardless of the current download rate from it

leecher A

seed leecher B

leecher C leecher D

(54)

11.3 Choking Details

BitTorrent Details

A peer always unchokes a fixed number of its peers

Default of 4

Choking decision based on current download rates

Evaluated on a rolling 20-second average

Choking evaluation performed every 10 seconds

Prevents wastage of resources by rapidly choking/unchoking peers

(55)

11.3 Anti-Snubbing

Choking policy

When over a minute has gone by without receiving a

single sub-piece from a particular peer, do not upload to it except as an optimistic unchoke

Problem

A peer might find itself being simultaneously choked by all its peers that it was just downloading from

Download will lag until optimistic unchoke finds better peers

Solution

(56)

11.3 Choking for Seeds

Open issue: upload-only choking

Once download is complete, a peer has no download

rates to use for comparison nor has any need to use them The question is, which nodes to upload to?

Policy

Upload to those with the best upload rate.

Advantages

Ensures that pieces get replicated faster

Peers that have good upload rates are probably not being served by others

(57)

11.3 BitTorrent Summary

Optimized file transfer system

No file search, no fancy GUI, etc.

Very effective

High throughput & scalability

Nearly perfect utilization of bandwidth

Fairness and load distribution not optimal, but good enough

Commercially successful

Distribution of RedHat distribution

BBC evaluates the distribution of TV content (not in real-time)

Centralized

(58)

11.3 Swarming Summary

Solves the problem of efficient file distribution

Scalable

Handles flash crowds

Areas for optimization

Incentive models

Tracker-less approaches

Further endgame improvements

Next step: content streaming

Real-time constraints Chunk order

Referenzen

ÄHNLICHE DOKUMENTE

Die Messages put und leave erhalten kein reply , während die Message get im reply die Adresswerte des Datenhalters (also die zum gesuchten Key korrespondierende IP-Adresse

Jeder Knoten leitet ein Broadcast-Paket mit RangeHash X an alle ihm bekannten Knoten (mit aktualisiertem Range) zwischen seiner ID und X weiter.. Der Startknoten sendet

Basics of peer-to-peer systems: motivation, characteristics, and examples Distributed object location and routing in peer-to-peer systems3. Unstructured

Napster provided a service where they indexed and stored file information that users of Napster made available on their computers for others to download, and the files

ƒ Peer-to-Peer: Anwendungen, die Ressourcen am Rand des Internets ohne feste IP-Adressen ausnutzen Ressourcen: Speicherkapazität, CPU-Zeit, Inhalte, menschliche Präsenz.. Î

Skype uses its Global Index (GI) [6] technology to search for a user. Skype claims that search is distributed and is guaranteed to find a user if it exists and has logged in during

Moreover, it is very likely the design philosophy underlying P2P networks will gain importance in the development of mobile business and ubiquitous computing, particularly when the

To solve the problem of load balancing, several techniques have been developed to ensure an equal data distribution across DHT nodes: Virtual Servers [RLS + 03], Power of Two