• Keine Ergebnisse gefunden

Distributed Data Management

N/A
N/A
Protected

Academic year: 2021

Aktie "Distributed Data Management"

Copied!
97
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Profr. Dr. Wolf-Tilo Balke

Institut für Informationssysteme

Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

Distributed Data Management

(2)

7.0 Introduction

7.1 Graph Model Basics

7.2 Random Graph Models

7.3 Small-World Graph Models 7.4 Scale-Free Graph Models 7.5 Network Examples

7.6 Network Models in P2P

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 2

7.0 Network Models

(3)

• Basic motivation for this lecture

– Can we show that a given P2P

network really has some desired properties?

How can a P2P network be designed that it will, with high probability, show those desired properties?

• Large P2P networks are hard to evaluate

In productive phase, usually no global view of the network is available

– In design phase, no large number of peers is available

7.0 Network Models

(4)

Desirable System properties for P2P

Decentralized and a self-organized network

• No single point of failure or central bottleneck

• Maintaining the network (joining /leaving/ publishing new content) should be performed without any central authority or global view

Scalability

• The network should scale for any (possible large) number of nodes

The structure of the network supports searching and retrieving information efficiently

• Obvious demand in information exchange systems

Reliability despite dynamic changes

• Network should be robust wrt. network and node failures

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 4

Book: P2P Systems and applications, pp 57-77

7.0 Network Models

(5)

• To examine the properties of a P2P network, good models are needed

In this lecture, we focus on graph models for unstructured P2P networks

Allows easy statistical analysis of network propertiesPeers are represented by vertices in a graph

Entries in routing tables are represented by edges of the graph

Peers are ego-centered and do not have global knowledge about all other peers and the data stored at those peers

More complex P2P network protocols require dynamic simulation of networks to evaluate properties

7.0 Network Models

(6)

Outline for this lecture:

Network graph basics

– How can P2P networks be represented as graphs?

– Which properties can networks graphs have?

– What are desirable properties for a P2P graph?

Network models

– Many different network models have been studied during the last 60 years

• Some of them are useful to evaluate or design P2P networks

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 6

7.0 Network Models

(7)

Random Networks

• Simple network model to represent pure P2P networks like Gnutella

Small-World networks

• Naturally occurring networks showing very desirable properties which can be exploited by P2P systems

Scale-Free networks

• “Naturally” occurring networks in large infrastructures, like e.g. the internet or power grids

7.0 Network Models

(8)

A directed graph 𝑮 is defined as a 𝐺 = (𝑽, 𝑬)

– 𝑽: a set of nodes or vertices 𝑽

– 𝑬: a set of directed edges between elements of 𝑽

• 𝑬 ⊆ 𝑽 × 𝑽

For P2P networks, 𝑽 represents the set of peers

– |𝑉| = 𝑛

• 𝑬 represents all directed links in the P2P overlay network

– i.e. the union of entries in the routing table of all peers – If later examples use undirected links, it is assumed that

directed links in both directions exist – 𝐸 = 𝑚

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 8

7.1 Graph Theory

(9)

Node outdegree of a node 𝑣 is denoted 𝐝𝐞𝐠+(𝒗)

i.e., the number of vertices 𝑤 it is connected to by an edge (𝑣, 𝑤)

deg+ 𝑣 = 𝑁 𝑣 = | 𝑤 ∈ 𝑉 𝑣, 𝑤 ∈ 𝐸}|

Node indegree of a node 𝑣 is denoted 𝐝𝐞𝐠(𝒗)

i.e., the number of vertices 𝑤 that are connected to 𝑣 by an edge (𝑤, 𝑣)

deg 𝑣 = | 𝑤 ∈ 𝑉 𝑤, 𝑣 ∈ 𝐸}|

Node degree of a node 𝑣 is denoted 𝐝𝐞𝐠(𝒗) deg 𝑣 = deg+(𝑣) + deg(𝑣)

For undirected graphs, only the node degree is defined

no in- or out degree

Neighbors set of a node 𝑣 is denoted 𝐍(𝒗) 𝑁 𝑣 = 𝑤 ∈ 𝑉 𝑣, 𝑤 ∈ 𝐸}

For every neighbor 𝑤 ∈ 𝑁(𝑣) there exists an edge 𝑣, 𝑤 ∈ 𝐸

7.1 Graph Theory

(10)

Example:

– 𝑉 = 1, 2, 3, 4, 5

– 𝐸 = { 1,5 , 5, 4 , 4,5 , 2,4 , (2,1), (2,3)}

– 𝑁 2 = 1, 3, 4 – 𝑁 4 = {5}

– deg

+

(2) = 3, deg

2 = 0 – deg

+

(4) = 1, deg

4 = 2

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 10

7.1 Graph Theory

5

2

3

1 4

(11)

Path 𝑷(𝒗, 𝒘)

A path 𝑃(𝑣, 𝑤) is a set of vertices {𝑣

0

, 𝑣

1

, … , 𝑣

𝑘

}

with 𝑣

0

= 𝑣 and 𝑣

𝑘

= 𝑤 and 𝑣

𝑖

, 𝑣

𝑖+1

∈ 𝐸 for all (0 ≤ 𝑖 ≤ 𝑘 − 1)

The path length |𝑃(𝑣, 𝑤)| is defined as the number of edges in path P

The distance 𝑑(𝑣, 𝑤) is defined as the shortest path length of any path between 𝑣 and 𝑤

7.1 Graph Theory

A shortest path between v, w with length 4

Thus, distance between A path between v, w with length 6

V W

(12)

• Metrics describing whole graphs:

Connectedness

– A graph is connected, if there is a

path from any node to any other node

k-Connectedness

A graph is k-connected if the removal of 𝑘 − 1 nodes still leaves the graph connected

Bisection width 𝒃𝒔𝒘(𝑮)

– Bisection width of a graph 𝑮 is the minimal number of

edges which must be removed to split the graph into two equally-sized unconnected subgraphs

• Represents the minimal cohesion of the graph

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 12

7.1 Graph Theory

(13)

Graph diameter 𝒅(𝑮)

– Represents the maximum extent (path length) of a graph – The diameter of a graph is the maximal distance of

any pair of vertices

• 𝒅 𝑮 = 𝐦𝐚𝐱 𝒅 𝒗, 𝒘 ; 𝒗, 𝒘 ∈ 𝑽

Average path length 𝒅

𝒂𝒗𝒈

(𝑮)

– The sum of all distances between each pair of nodes

divided by the number of all pairs of nodes in a connected graph

• 𝒅𝒂𝒗𝒈 𝑮 = 𝒊,𝒋 ∈ 𝑽𝒙𝑽 𝒅 𝒊,𝒋

𝒏∗ 𝒏−𝟏

7.1 Graph Theory

(14)

Graph outdegree 𝐝𝐞𝐠 + (𝑮)

– The average outdegree of all nodes of 𝑮

Graph indegree 𝐝𝐞𝐠 (𝑮)

– The average indegree of all nodes of 𝑮

• For undirected graphs, there is just degree 𝐝𝐞𝐠 𝑮

– The average degree of all nodes

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 14

7.1 Graph Theory

(15)

The clustering coefficient 𝐂(𝐯) of vertex 𝑣 in a directed graph is given by

The number of links between the vertices within its neighborhood divided by the number of links that could possibly exist between them

• The number of neighbors of 𝑣 is deg

+

(𝑣)

• The maximum number of connections between all neighboring nodes is deg

+

𝑣 (deg

+

𝑣 − 1)

i.e. each neighbor connected with each other neighbor

Describes how densely the neighbors of a vertex are interconnected

7.1 Graph Theory

(16)

– If 𝑒(𝑁(𝑣)) denotes the actual number of connections that

neighbors of 𝑣 have with each other, the clustering coefficient is 𝑪 𝒗 =

𝒆 𝑵 𝒗

𝒅𝒆𝒈+ 𝒗 (𝒅𝒆𝒈+ 𝒗 −𝟏)

Wasserman, S., and Faust, K. (1994). Social Network Analysis: Methods and Applications. Cambridge:

Cambridge University Press.

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 16

7.1 Graph Theory

V

𝐶 𝑣 = 4 ∗ 2

4 ∗ 3 = 0.66 𝐶 𝑣 = 3 ∗ 2

3 ∗ 2 = 1 𝐶 𝑣 = 0

3 ∗ 2 = 0

V V

Links between neighbors of V

Maximum number of neighbor links (4 neighbors having at most 3 links)

(17)

• Which properties should a good P2P graph have?

Connectedness

• Each node should be reachable

• If not, some information is not accessible to all peers

k-Connectedness with large k

• Removing nodes should not immediately disconnect a graph

Low diameter 𝒅(𝑮)

• Low diameters are necessary to ensure reachability and reduce message load

Low diameter → quicker TTL possible when flooding

7.1 Graph Theory

(18)

Low average path length 𝒅

𝒂𝒗𝒈

(𝑮)

• Most messages should quickly reach their target

Low average node degrees 𝒅𝒆𝒈(𝑮)

• The higher the node degree is, the more node states must be stored at nodes

Increases size of routing tables

High average cluster coefficient

• Densely connected neighborhoods increase the failure-resilience of networks

• Distributed routing possible

See later: Kleinberg Model

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 18

7.1 Graph Theory

(19)

Random graphs provide the easiest model for any network

Simple underlying assumptions

Analyzable with statistical methods

• First family of network models studied (1950s)

Multiple models for generating a random graph have been developed

– Most prominent generation models are

the Erdös-Renyi random graph

the Gilbert random graph

7.2 Random Graphs

(20)

A random graph is usually denoted as 𝒈 𝒏,𝒎

– Random graph with 𝑛 nodes and 𝑚 edges

For simplicity, we just consider undirected graphs

Basic idea for constructing a random graph

– Graph construction starts with 𝑛 vertices without any connections

– 𝑚 edges are added one by one between the vertices using some random system

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 20

7.2 Random Graphs

(21)

Pure peer-to-peer networks like Gnutella 0.4 can be modeled by a random graph

Peers choose their neighbors more or less randomly

• Random bootstrapping, random Ping-Pong

Unfortunately, “real” Gnutella 0.4

networks are usually not really random

– Bootstrapping is not random

• Usage special bootstrap nodes or bootstrap caches

– Ping-Pong strengthens connectedness of neighborhood and favors “strong” nodes

Nodes prefer more popular and stronger nodes

• See later: scale-free networks

7.2 Random Graphs

(22)

The behavior of random graphs is often studied for cases where the number of vertices diverges to infinity, i.e. 𝒏 → ∞

In context of P2P, think of scalability!

– While the number 𝑚 of edges could be fixed, it is usually assumed that 𝑚 grows with 𝑛

• e.g. new nodes in a P2P network will also lead to new connections

• Fixed 𝑚 would quickly lead to mostly unconnected graphs

• Thus, usually 𝒎 is a function of 𝒏

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 22

7.2 Random Graphs

(23)

Erdős-Rényi graphs are the most popular family of random graphs (1959)

There are two predominant models which are equivalent for large graphs

• 𝒈

𝒏,𝒎

models

Based on randomly selecting an instance of all graphs with 𝑛 nodes and 𝑚 edges

• 𝒈

𝒏,𝒑

models

Each possible edge has a certain probability 𝑝 to be added to a graph or not

Also known as Gilbert graphs (1959)

7.2 Erdős-Rényi Graphs

(24)

Constructing 𝒈

𝒏,𝒎

graphs

– Let 𝑮

𝒏,𝒎

be the set of all labeled graphs with 𝑛 nodes and 𝑚 edges

• Labeled graphs: nodes are identifiable

Unlabeled random graphs only consider the “shape” of graphs

• The number of all such graphs is given by the polynomial coefficient |𝑮𝒏,𝒎| = 𝑵

𝒎 =

𝒏 𝒎𝟐

The number of possible edges between 𝑛 nodes is 𝑵 = 𝒏 𝟐

– For generating an instance 𝒈

𝒏,𝒎

, any instance of 𝑮

𝒏,𝒎

is selected with equal probability

Erdős, P.; Rényi, A. (1959). "On Random Graphs. I.". Publicationes Mathematicae 6: 290-297

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 24

7.2 Erdős-Rényi Graphs

(25)

Example: Constructing 𝑔 3,2 graphs

– There are 3 possible 𝑔

3,2

in 𝐺

3,2

• Each graph is selected with the probability

1

3

7.2 Erdős-Rényi Graphs

1 3

2

1 3

2

1 3

2

(26)

• The 𝒈

𝒏,𝒎

model of random graphs is not suitable for actually generating large random graphs

– Extremely high number of possible graphs for given 𝑛 and 𝑚

26

7.2 Erdős-Rényi Graphs

0 1 2 3 4 5 6 7 8 9 10

1 0 1 3 6 10 15 21 28 36 45

2 0 0 3 15 45 105 210 378 630 990

3 0 0 1 20 120 455 1330 3276 7140 14190

4 0 0 0 15 210 1365 5985 20475 58905 148995

5 0 0 0 6 252 3003 20349 98280 376992 1221759

6 0 0 0 1 210 5005 54264 376740 1947792 8145060

7 0 0 0 0 120 6435 116280 1184040 8347680 45379620

8 0 0 0 0 45 6435 203490 3108105 30260340 215553195

9 0 0 0 0 10 5005 293930 6906900 94143280 886163135

10 0 0 0 0 1 3003 352716 13123110 254186856 3190187286

11 0 0 0 0 0 1365 352716 21474180 600805296 10150595910

12 0 0 0 0 0 455 293930 30421755 1251677700 28760021745

13 0 0 0 0 0 105 203490 37442160 2310789600 73006209045

14 0 0 0 0 0 15 116280 40116600 3796297200 166871334960

15 0 0 0 0 0 1 54264 37442160 5567902560 344867425584

16 0 0 0 0 0 0 20349 30421755 7307872110 646626422970

#Nodes

#Edges

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig

(27)

For generative models: use probabilistic 𝒈 𝒏,𝒑 model of Erdős-Rényi random graphs

So-called Gilbert graphs

Gilbert, E.N. (1959). "Random Graphs". Annals of Mathematical Statistics 30: 1141- 1144.

– Number of nodes 𝒏 is fixed

– Each possible edge in 𝑉 × 𝑉 has the fixed probability 𝒑 to be added to the graph

• i.e. underlying assumption is that adding an edge is fully independent of all existing edges

• Larger 𝑝 will generate graphs with more edges, smaller 𝑝 will generate graphs with less edges

7.2 Gilbert Model

(28)

• Both models 𝒈 𝒏,𝒎 and 𝒈 𝒏,𝒑 behave

asymptotically equivalent for large 𝑛

Expected number of edges is 𝑚 = 𝑛

2 𝑝 for large 𝑛

• Law of large number will guarantee equivalence for 𝑝𝑛

2

→∞

– Thus, for large 𝑝𝑛

2

, statements about properties can be made like

• “Property P holds for most graphs in 𝒈

𝒏,𝒑

⇔ “Property P holds for most graphs in 𝒈

𝒏,𝒎= 𝑛

2 𝑝

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 28

7.2 Gilbert Model

(29)

• Randomly generated graphs can be used to approximate properties of large random P2P networks

Many basic properties of random graphs have been

established by Erdős & Rényi 1960 using large 𝒈

𝒏,𝒑

, 𝒏 → ∞

Asymptotical observations

Many properties are directly dependent on the probability 𝒑 (or the number of edges 𝑚)

Graphs show several phase transitions depending on the node/edge ratio,

Each phase transition has a threshold at which certain properties suddenly becomes extremely probable

Before or after the threshold, the probability of a property 𝑃 is either ℙ(𝑷) → 𝟎 or

ℙ(𝑷) → 𝟏 for 𝒏 → ∞

7.2 Random Graph Properties

(30)

Predicting connected components

– For 𝒏 ∗ 𝒑 < 𝟏, a 𝒈

𝒏,𝒑

graph will rarely have any connected components larger than 𝑂(log 𝑛)

• The graph is mainly unconnected, each of its component is very small

• e.g. for a graph 𝒈

𝒏,𝒎

with 150 nodes, this threshold is roughly around 74 edges

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 30

7.2 Random Graph Properties

(31)

7.2 Random Graph Properties

𝒈 ≡ 𝒈 𝒈 ≡ 𝒈

Example Graphs

Statistical prediction: most components will be of logarithmic size w.r.t. to the number of nodes (i.e. will be small)

(32)

Giant Connected Component

– For 𝑛 ∗ 𝑝 = 1, a graph 𝒈

𝒏,𝒑

will very probably have a giant connected component of size in 𝑂(𝑛

23

)

• e.g. for a graph 𝒈

𝒏,𝒎

with 150 nodes, giant components should be observable for 75 edges and more

– Surprisingly, the giant component will appear when the average node degree is 1!

– For 𝒏 ∗ 𝒑 > 𝟏, all other components

will be of size 𝑶(𝒍𝒐𝒈 𝒏)

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 32

7.2 Random Graph Properties

(33)

• Example Graphs (giant component appears)

Statistical prediction: for 𝑚 = 75 (𝑛 ∗ 𝑝 = 1), there is a largest component of size ≈ 28

7.2 Random Graph Properties

(34)

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 34

7.2 Random Graph Properties

𝒈

𝟏𝟓𝟎,𝒎=𝟑𝟎𝟎

≡ 𝒈

𝟏𝟓𝟎,𝒑=𝟎.𝟎𝟐𝟔𝟖

𝒈

𝟏𝟓𝟎,𝒎=𝟏𝟓𝟎

≡ 𝒈

𝟏𝟓𝟎,𝒑=𝟎.𝟎𝟏𝟑𝟒

• Example Graphs (other components diminish)

– Statistical prediction: no other component will be large

(35)

Connectedness

– For 𝑝 <

ln 𝑛

𝑛

, the graph will surely contain isolated vertices and will thus be disconnected

– For 𝑝 >

ln 𝑛

𝑛

, the graph will usually be almost connected

• e.g. for a graph 𝒈

𝒏,𝒎

with 150 nodes, this threshold around 374 edges

7.2 Random Graph Properties

(36)

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 36

7.2 Random Graph Properties

𝒈

𝟏𝟓𝟎,𝟑𝟕𝟒

≡ 𝒈

𝟏𝟓𝟎,𝒑=𝟎.𝟎𝟑𝟑

• Example Graphs (connectedness)

Statistical prediction: for 𝑝 = 0.033 = ln 𝑛

𝑛 , the graph is almost surely connected

(37)

Degree Distribution

The node degree of large random graphs can be modeled with Poisson distribution

• Let 𝝀 be a constant 𝝀 = (𝒏 − 𝟏) ∗ 𝒑.

Then the probability distribution of the node degrees 𝒌 = 𝟎, 𝟏, 𝟐, 𝟑, 𝟒, … can be approximated for 𝒏 → ∞ as the Poisson density ℙ 𝑿 = 𝒌 =

𝝀𝒌𝒆−𝝀

𝒌!

7.2 Random Graph Properties

(38)

• This degree distribution falls faster than an exponential

distribution in 𝑑, hence it is not a power-law distribution

• For larger 𝛌, behaves approximately similar to a normal distribution

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 38

7.2 Random Graph Properties

(39)

• Degree Distribution for 𝒈

𝟏𝟓𝟎,𝒑= 𝟏

𝟏𝟓𝟎

and 𝝀 = 𝟏

– 69 edges

7.2 Random Graph Properties

estimated

measured

(40)

• Degree Distribution for 𝒈

𝟏𝟓𝟎,𝒑= 𝟐

𝟏𝟓𝟎

and 𝝀 = 𝟐

– 142 edges

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 40

7.2 Random Graph Properties

estimated

measured

(41)

Diameter

– If 𝑔 is connected, the expected diameter of 𝑔

𝑛,𝑚

is in 𝑂(log 𝑛) with high probability

• i.e. the diameter of a connected random graph grows only logarithmically

• 𝑔𝑛,𝑝 is surely connected for p ≥ ln 𝑛

𝑛 for 𝑛 → ∞

or: 𝑔𝑛,𝑚 is surely connected 𝑚 ≥ (𝑛

2)ln 𝑛

𝑛

7.2 Random Graph Properties

𝑑(𝐺) = 7

(42)

Clustering Coefficient

– The clustering coefficient of a random graph 𝑔

𝑛,𝑝

is with high probability asymptotically equal to 𝒑 for 𝑛 → ∞

This is a rather low clustering coefficient

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 42

7.2 Random Graph Properties

𝒈𝟏𝟎𝟎,𝒑=𝟎.𝟎𝟑 𝐶𝑎𝑣𝑔 ≈ 0.0273

nodes colored by 𝐶

𝒈𝟏𝟎𝟎,𝒑=𝟎.𝟎𝟔 𝐶𝑎𝑣𝑔 ≈ 0.0563

(43)

Observation: “Real and natural” networks are not random, but have some inherent structure

Many “naturally” occurring networks are very robust and efficient

• Social network among people

• Neural networks

• Power lines, the Internet, streets, etc.

• …

What properties do real-life networks have?

• Why are they stable and efficient?

7.3 Small-World Graphs

(44)

First real networks to be studied: Social Networks among people

“Six degrees of Separation”

• First mentioned 1929 by the Hungarian star author Karinthy Frigyes in his short story “Chains”

Claim: all 1½ billion people in the world (sic.) know Frigyes via at most five acquaintances

» Friend-of-a-friend connections Motivated by two examples

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 44

7.3 Small-World Graphs

(45)

Example 1: some 1929 Nobel price laureate,

…knows King Gustav of Sweden…

… who passionately plays tennis and knows a famous tennis champion…

… who is a friend of Frigyes

Example 2: unknown factory worker at a Ford manufacture

Knows his boss, who knows Ford personally, who knows the director of the media house Hearst Publications, who knows the writer Árpád

Pásztor, who is a friend of Frigyes

7.3 Small-World Graphs

(46)

This idea was scientifically examined in 1967:

Sociologist Stanley Milgram, Yale University

– Persons chosen at random in Kansas and Nebraska were asked to deliver a letter to a certain stock broker in Boston, MA

– This was the only information about the target person

Constraint: The letter can only be given to persons one knows on a first name basis (acquaintances)

• 1967: No internet, transportation really expensive and cumbersome, close local communities

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 46

7.3 Small-World Graphs

S. Milgram (1933 - 1984)

(47)

• Letters used in the Milgram experiment

7.3 Small-World Graphs

(48)

• Those letters that reached the target person were only passed on over 6 mediators on average

“6 degrees of separation“

– This was far less than originally assumed!

Thus, social graphs were coined “Small-World Graphs”

• The original experiment was later criticized

– Only 50 persons took part in the original experiment

– Only 5% of letters were actually received by the target person

But,…

– One letter was received within only 4 days – The small world effect was experimentally

observed in a vast variety of other sciences

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 48

7.3 Small-World Graphs

(49)

Interesting trivia “Six Degrees of Kevin Bacon“

– Kevin Bacon once claimed that he's worked with everybody in Hollywood or someone who's worked with them

– College students build a party game out of that statement based on Milgram‘s ideas

• Basic idea:

– Link actors via a minimum number of movies to actor Kevin Bacon

– e.g., Val Kilmer was in “Top Gun” with Tom Cruise, and Tom Cruise was in “A Few Good Men” with Kevin Bacon

– Only approximately 12% of all actors cannot be linked to Bacon

> try: http://oracleofbacon.org/

7.3 Small-World Graphs

(50)

J. Leskovec, E. Horvitz. Worldwide Buzz: Planetary-Scale Views on an Instant- Messaging Network. Proc. International WWW Conference, 2008.

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 50

7.3 Small-World Graphs

Data June 2006

– 4.5 TB of compressed data.

– 245 million users logged in.

– 180 million users engaged in conversations.

– More than 30 billion conversations.

– More than 255 billion exchanged messages.

(51)

• Communication graph

– Edge if the users exchanged at least 1 message – 180 million people

– 1.3 billion edges

– 30 billion conversations

7.3 Small-World Graphs

(52)

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 52

7.3 Small-World Graphs

Average path length 6.6

90% of the people can be reached in <8 hops

(53)

• However, it took a while until such naturally

occurring networks have been formally understood

Erdős–Rényi random graphs are bad models for natural networks

Natural networks often show “hubs”

There a some nodes with very high node degree

Node degree better described by a power-law distribution than a Poisson distribution

Natural networks often show a very high degree of local clustering

High average cluster coefficients

e.g. by local communities, friend cliques, co-worker networks, local transportation networks, etc…

Natural networks often have a low average path length

7.3 Small-World Graphs

(54)

• First models for natural graphs were proposed by Duncan Watts and Steven Strogatz in1998

Watts, D.J.; Strogatz, S.H. (1998). "Collective dynamics of 'small-world' networks.“ Nature 393 (6684): 409–10. doi:10.1038/30918

They examined three real-world networks

The simple neural “brain” network of the roundworm (nematode) Caenorhabditis Elegans

A natural network

Power grids networks

A man-made network

Collaborations networks between movie actors

Semi-natural network

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 54

7.3 Small-World Graphs

(55)

• Watts and Strogatz mainly examined the average path length and the cluster coefficient

– Comparison with equally sized random graphs

• Similar node and edge number

– Result:

Real networks have a much higher degree of local clustering (10x to 1000x higher) than random graphs

• Average path length is more or less similar

7.3 Small-World Graphs

(56)

“Definition” of small-world graphs

“A small world network is a network with a dense local structure and a diameter comparable to a random graph with same numbers of nodes and edges.”

• Additionally: “The node degree is homogenous”

• Watts and Strogatz also proposed the first

generative model for a certain class of small- world- graphs

So called Watts-Strogatz graphs

• There are other small-world classes

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 56

7.3 Watts-Strogatz Graphs

(57)

Properties of Watts-Strogatz graphs

Low average path length

High average clustering coefficients

Homogenously distributed node degrees

• Good model for e.g. social or neural networks and most other “natural” networks

Not a good model for most man-made grid-like networks

Those show power-law distributed node degrees By definition, these are not small-world graphs e.g. internet, airline routes, train lines, etc.

– Watts-Strogatz graphs are between random and scale- free networks

7.3 Watts-Strogatz Graphs

(58)

The generative model (Watts-Strogatz model)

– Graph is denoted as 𝑔_𝑤𝑠

𝑛,𝑘,𝑝

• 𝒏 is the number of nodes (integer)

• 𝒌 is the neighborhood degree (integer)

• 𝒑 is the rewire probability (float in [0. . 1])

Build a ring of 𝒏 vertices and connect each vertex with its 𝒌 clockwise neighbors on the ring

Draw a random number between 0 and 1 for each edge

• Rewire each edge with probability 𝑝: if random number is larger than 𝑝, do nothing. Else rewire.

Rewiring: keep the source vertex of the edge fixed, and choose a new target vertex uniformly at random from all other vertices

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 58

7.3 Watts-Strogatz Graphs

(59)

– For 𝑝 = 0, the resulting network is totally regular, with a clustering coefficient approaching

3

4

for large 𝑘, the diameter is in 𝑂(𝑛)

– For 𝑝 = 1, the resulting network is a kind of a random graph (regular random graph) with a diameter in 𝑂(log 𝑛)

7.3 Watts-Strogatz Graphs

k = 2

(60)

• Comparing Watts-Strogatz Graphs

– 𝑛 = 50, 𝑚 = 150 ≡ 𝑘 = 3

coloring by cluster coefficient

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 60

7.3 Watts-Strogatz Graphs

Erdős-Rényi Graph Watts-Strogatz with 𝑝 = 0.0

(61)

• Comparing Watts-Strogatz Graphs

– 𝑛 = 50, 𝑚 = 150 ≡ 𝑘 = 3

7.3 Watts-Strogatz Graphs

(62)

• Comparing Watts-Strogatz Graphs

– 𝑛 = 50, 𝑚 = 150 ≡ 𝑘 = 3

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 62

7.3 Watts-Strogatz Graphs

Watts-Strogatz with 𝑝 = 0.05 Watts-Strogatz with 𝑝 = 0.1

(63)

Histogram of cluster coefficients

– Single sample

Random

– Generally lower coefficient

Small World

– Homogeneous, higher coefficient

7.3 Watts-Strogatz Graphs

p=0.00 p=0.02 p=0.05 p=0.10 random

Number of Nodes 01020304050

(64)

Histogram of node degrees

– Same sample

Random

– Homogeneous degree

– Higher variance

Small World

– Homogeneous degree

– Low variance

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 64

7.3 Watts-Strogatz Graphs

1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10

p=0.00 p=0.02 p=0.05 p=0.10 random

Node Degree Number of Nodes 01020304050

(65)

Investigating clustering coefficients and average path lengths in dependence of 𝒑

– For a graph with 5000 nodes

– Normalized by the clustering coefficient and the path length at 𝑝 = 0

• Clustering coefficient is still high for small 𝑝, but the

average path length decreases extremely fast due to ‘short cuts’

7.3 Watts-Strogatz Graphs

(66)

• The Watts-Strogatz model explains how a small- world graph can be constructed

– i.e. “How can locally densely connected graphs with shortcuts be constructed?”

But: navigating a small-world can be very difficult!

Assume “Six Degrees of Separation” was true:

route a message to any arbitrary person

• All people on earth would be reachable by just six acquaintances

But which ones?

• Random navigation or flooding won’t help

Exponentially many possibilities!

Solution: Use clues and heuristics to quickly

route the massage into the correct neighborhood!

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 66

7.3 Kleinberg Navigability Model

(67)

Challenging question: how can we find short

paths in a distributed fashion in a small-world?

“Why should arbitrary pairs of strangers be able to find short chains of acquaintances that link them together?”

J.M. Kleinberg, “Navigation in a Small-World”, Nature, 2000

Some routing information is necessary

• Enough but not too much information!

7.3 Kleinberg Navigability Model

(68)

– Nodes see local parts of the network (neighborhood)

• i.e., they route the letter in a decentralized fashion

• In social networks additional information (same profession, address, hobbies, etc.) is used to decide which neighbor is

‘closest‘ to the recipient

– Milgram showed that the first steps of the letter were the geographically largest, while later steps were

closing in on the target area

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 68

7.3 Kleinberg Navigability Model

(69)

A decentralized routing algorithm can be modeled as follows

– Let every node 𝑣 have a position 𝑃𝑜𝑠(𝑣) on a toroidal grid in a d-dimensional space

• 𝑃𝑜𝑠(𝑣) = (𝑥1, 𝑥2, … , 𝑥𝑑) with all 𝑥𝑖 being integers

𝑃𝑜𝑠(𝑣) is 𝑑-dimensional vector

𝑥𝑖(𝑣) is the position of 𝑣 in dimension 𝑖

Every node knows some basic information of the underlying grid structure

• i.e. its own position in the grid, its neighbors, and the target node

no global knowledge, only local information

7.3 Kleinberg Navigability Model

(70)

– Each node hands the message (i.e., letter) to the one neighbor of 𝑣 that is closest to the target 𝑡

– The distance measure 𝑑

𝑀

(𝑣, 𝑤) is given by the Manhattan Distance

• by the sum over the absolute difference

𝑖

|𝑥

𝑖

𝑣 − 𝑥

𝑖

(𝑤)|

• Let the routing algorithm take place on the following network model

– Start with a 𝑑 -dimensional grid

– Add random edges between vertices v and w with a probability of 𝑃 𝑣, 𝑤 ~ 𝑑

𝑀

𝑣, 𝑤

−𝛼

• inverse 𝛼

𝑡ℎ

-power distribution

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 70

7.3 Kleinberg Navigability Models

(71)

• Node 𝑢 is connected to all its neighbors (𝑎, 𝑏, 𝑐, and 𝑑) and has a long-range link to some randomly chosen node 𝑣 with a probability proportional to 𝑑𝑖𝑠𝑡 𝑢, 𝑣

−𝛼

– The higher the distance, the lower the link probability

7.3 Kleinberg Navigability Model

(72)

Theorem: The routing algorithm will find

‘short‘ paths, if and only if 𝜶 = 𝒅

‘short‘ means that arbitrary paths length are in 𝑂(log 𝑛)

– Simulation results on the greedy routing algorithm a 2-dimensional toroidal

grid with 20,000 × 20,000 nodes (averages over

1000 runs)

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 72

7.3 Kleinberg Navigability Model

(73)

• Idea behind the proof is that for any 𝛼 < 𝑑 there are too few random edges to form shortcuts

• For 𝛼 > 𝑑 there are too many random edges, and hence too many choices to which the message could be passed on

The routing will degenerate into a random walk

Kleinberg small-worlds thus provide a way of

building a peer-to-peer overlay network allowing for a simple, greedy, and distributed routing protocol

But: How are nodes mapped to 𝑑-dimensional space such that the distance measurement is meaningful?

7.3 Kleinberg Navigability Model

(74)

Small-World and random graphs show homogenous node degree distributions

For small-world, distribution looks similar to a

normal distribution with 𝜇 = 2𝑘 for non-extreme 𝑝

• The actual model is more complicated

• 𝑘 is the number of neighbors of the initial ring

Random graphs are Poisson distributed

• For larger 𝑚, will also approximate a normal distribution

• But many (especially artificial) real-life networks show extreme node degree distributions

e.g. strong hub-topologies

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 74

7.4 Scale-Free Networks

(75)

In 1999, Albert-László Barabási (Univ. of Notre Dame) crawled

parts of the WWW to investigate its actual structure

The node degree is power-law distributed

• i.e., the probability that a node in the network is connects to k other nodes is 𝑃 𝑘 ~ 𝑘 − 𝛾

(usually with 2 < 𝛾 ≤ 3)

Most nodes have a small degree of around 1 to 2Few nodes have an extremely high node degreeHigh-degree vertices are called ‘hubs‘

Albert-László Barabási. “Linked: How Everything Is Connected to Everything Else and What It Means for Business, Science, and Everyday Life”. Plume.

2003. ISBN 978-0452284395

7.4 Scale-Free Networks

(76)

Definition: Graphs with a power-law node degree distribution form ‘scale-free’ networks

Also called power-law networks

What kind of network model can generate this more realistic degree distribution?

– Barabási–Albert model builds a certain subset of scale-free networks

Albert-László Barabási & Réka Albert."Emergence of scaling in random networks". Science, 1999 doi:10.1126/science.286.5439.509.

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 76

7.4 Scale-Free Networks

(77)

Barabási–Albert model: Basic Idea

– In its simplest form denoted as 𝒈_𝒃𝒂

𝒏,𝒎

𝑛 is the number of nodes in the graph

𝑚 is the number of edges added per time step

The total number of edges is thus 𝑛 ∗ 𝑚

Start with any initial graph of size 𝑛

0

𝑛0 ≥ 2 and degree of any node deg(𝑣) ≥ 1

Often, just 𝑚 connected nodes are used as default initial network

If initial network is not connected, the result network cannot be guaranteed to be connected

Barabási–Albert graph is constructed iteratively by adding new nodes one by one until target size 𝑛 is reached

Represents one time step in a simulated network growth

i.e. Discrete Time Modeling

Add nodes until target size 𝑛 is reached

Each new node is connected to 𝒎 existing nodes

7.4 Barabási–Albert Graphs

(78)

New edges are not added randomly, but favor higher-degree nodes

“The rich get richer“

Preferential attachment to higher-degree nodes

The higher the degree of a possible target node, the higher the probability that the new node will attach to it

Preferential attachment defines the probability

∏(𝒗) for vertex 𝑣 to get an edge to a new node

• In general, is proportional to the node degree, i.e.

∏ 𝒗 ~ 𝐝𝐞𝐠(𝒗)

• Most common definition is

∏ 𝑣 =

deg 𝑣

𝑤∈𝑉 deg(𝑤)

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 78

7.4 Barabási–Albert Graphs

(79)

• Example: 𝒈_𝒃𝒂 𝟓,𝟏

7.4 Barabási–Albert Graphs

𝒕 = 𝟎 𝒕 = 𝟏 − 𝜺

Initial graph Add new node 𝑣3

Probability for connecting any old

node 𝑣 to 𝑣3 is given by ∏ 𝑣 = deg 𝑣

𝑤∈𝑉deg 𝑤

e.g., connect to 𝑣1

Random decision steered by preferential attachment

𝑣1 𝑣2

𝑣1 𝑣2

𝑣3

∏(𝑣2) =1

∏(𝑣1) = 1 2 2

𝑣1 𝑣2

𝑣3 𝒕 = 𝟏

(80)

• Example: 𝒈_𝒃𝒂 𝟓,𝟏

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 80

7.4 Barabási–Albert Graphs

𝒕 = 𝟐 − 𝜺

Add new node 𝑣4

Evaluate preferential attachment

e.g. connect to 𝑣1

∏(𝑣3) =1 4

𝑣1 𝑣2

𝑣3

∏(𝑣2) = 1

∏(𝒗𝟐) = 𝟏 4 𝟐

𝑣4

𝒕 = 𝟑 − 𝜺

Add new node 𝑣5

Evaluate preferential attachment

e.g. connect to 𝑣1

∏(𝑣3) = 1 6

𝑣1 𝑣2

𝑣3

∏(𝑣2) = 1 6

𝑣4

𝑣5

∏(𝒗𝟏) = 𝟏

∏(𝑣4) = 1 𝟐 6

(81)

• Comparing Barabási–Albert Graphs

– 𝑛 = 50, ~50 edges

coloring by node degree

7.4 Barabási–Albert Graphs

(82)

• Comparing Barabási–Albert Graphs

– 𝑛 = 100, ~100 edges

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 82

7.4 Barabási–Albert Graphs

Erdős-Rényi Graph Barabási–Albert Graphs

(83)

• Comparing Barabási–Albert Graphs

– 𝑛 = 100, ~150 edges

7.4 Barabási–Albert Graphs

(84)

Histogram of node coefficients

– Single sample – 100 nodes – 300 edges

Random

– Generally lower degree

Small World

– Homogeneous degree

Scale-Free

– Power-law – Hubs visible

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 84

7.4 Barabási–Albert Graphs

3 5 7 9 11 13 15 17 19 21 23 25 27

Barabási(pa=0.5) Watts-Strogatz(p=0.05) Random

Node Degree Number of Nodes 010203040506070

Dampening factor for decreasing strength of preferential attachment

(85)

Node degree for larger Barabási–Albert graphs

– 200k nodes – 400k edges – Logarithmic

Scale

7.4 Barabási–Albert Graphs

degree

relative frequency

(86)

Histogram of cluster coefficients (𝐶 )

– Same sample

Random

– Low 𝐶

Small World

– Homogeneous high 𝐶

Scale-Free

– Also power-law – Lower than SW

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 86

7.4 Barabási–Albert Graphs

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

Barabási(pa=0.5) Watts-Strogatz(p=0.05) Random

Cluster Coefficient Number of Nodes 010203040506070

(87)

Important property of scale-free networks is robustness against random failures

– Removing a random vertex 𝑣 will likely hit a low-degree node

Expected damage to network is small

A failing high-degree node can severely damage a network

Better fail-safety necessary for high-degree node to ensure overall robustness

Thus, scale-free networks are very sensitive against attacks

– If a malevolent attacks explicitly target the highest degree nodes, the network can easily decompose

Note: random graphs are not resilient against random failures, but also not particularly prone to attacks

– Most vertices more or less have the same degree

7.4 Scale-Free Networks

(88)

Random Graph: 50 nodes, 50 edges

Color by degree

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 88

7.5 Comparing Graphs

Property Value

Connected No

Diameter (conn.) 9

Avg. Path Length 4.39

#Clusters 6

Largest Cluster 39

k-connectedness 0

Avg. Cluster Coeff. 0.033

Avg. Degree 2

(89)

Watts-Strogatz Graph: 50 nodes, 50 edges

7.5 Comparing Graphs

Property Value

Connected No

Diameter (conn.) 35

Avg. Path Length 12.73

#Clusters 2

Largest Cluster 38

k-connectedness 0

Avg. Cluster Coeff. 0

Avg. Degree 2

𝑝 = 0.05

(90)

Barabási-Albert Graph: 50 nodes, 49 edges

Distributed Data Management – Profr. Dr. Wolf-Tilo Balke – IfIS – TU Braunschweig 90

7.5 Comparing Graphs

Property Value

Connected Yes

Diameter 12

Avg. Path Length 5.14

k-connectedness 1

Avg. Cluster Coeff. 0

Avg. Degree 1.96

𝑝𝑎 = 0.8

Referenzen

ÄHNLICHE DOKUMENTE

– Specialized root tablets and metadata tablets are used as an index to look up responsible tablet servers for a given data range. • Clients don’t communicate with

• If an acceptor receives an accept request with higher or equal number that its highest seen proposal, it sends its value to each learner. • A value is chosen when a learner

• Basic storage is offered within the VM, but usually additional storage services are used by application which cost extra.

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 2?. 7.0

– Page renderer service looses connection to the whole partition containing preferred Dynamo node. • Switches to another node from the

– Specialized root tablets and metadata tablets are used as an index to look up responsible tablet servers for a given data range. • Clients don’t communicate with

•  Send accept message to all acceptors in quorum with chosen value.

Distributed Data Management – Christoph Lofi – IfIS – TU Braunschweig 38. 13.1 Map