Data Management Peer-to-Peer

(1)

Wolf-Tilo Balke Sascha Tönnies

Institut für Informationssysteme

Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

Peer-to-Peer

Data Management

(2)

Overview

• Why Peer-to-Peer Databases?

–

Federation

–

Information integration

–

Sensor networks

• P2P Databases

–

Challenges

–

Design Dimensions

• Existing P2P Database systems

–

Edutella: focus on expressivity

–

PIER: focus on scalability

–

Piazza: focus on integration

(3)

1 Motivation

• Peer-to-peer data management might need some database-like functionality

– Complex queries over possibly large volumes of data

• Examples of applications include

– Federation of sources – Information integration – Sensor networks

– „New‟ internet

(4)

1.1 Federation of similar data providers

• Examples

– (Digital) Libraries

– Primary Scientific Data Providers (Gene Databases)

– News Providers

• All nodes offer the same kind of information

• Homogeneous network (fixed schema)

• Non-P2P solutions exist, but not open/scalable

(5)

1.2 Information Integration

• Examples

– Find German professors having published at least three papers at the Conference on

Very Large Databases

– Find introductory database book in German, written by a German professor

– Find all recordings of Mozarts ‚Magic Flute„ with conductors who also once conducted Berliner Philharmoniker

• Very tedious to find with current search engines

• Needs database-like querying capabilities

• Heterogeneous network

– Information from several databases need to be combined

(6)

1.3 Sensor Networks

• Examples

– Network Monitoring:

• network maps

• event detections

• ...

– Car Traffic Monitoring

• Huge amount of nodes

• Low amount of data

• Homogeneous network

Screenshots from project PHI presentation, J. Hellerstein, Berkeley

(7)

Overview

1. Why Peer-to-Peer Databases?

1. Federation

2. Information integration 3. Sensor networks

2. P2P Databases

1. Challenges

2. Design Dimensions

3. Existing P2P Database systems

1. Edutella: focus on expressivity 2. PIER: focus on scalability

3. Piazza: focus on integration

4. HiSbase: focus on scalability for spatial data

(8)

2.1 Challenges of Schema-Based P2P Networks

• Multi-Dimensional Search Space

– DHTs only work for one dimension (one attribute)

• Schema Heterogeneity

– Sources use different database schemas for similar information

• Potentially large result sets

– SELECT * FROM Firewalls.BlockedPackets ...

– Range and Aggregate Queries

• And the usual P2P challenges...

– Trust

– Network Churn

(9)

2.2 Design Dimensions

• Network Properties

– Data Placement

– Topology and Routing

• Data Access

– Data Model

– Query Language

• Integration Mechanism

– Mapping Representation – Mapping Creation

– Integration Method

(10)

2.2 Data Placement

• Placement according to ownership

– Data stays at information source

– Full control of data by owner (access policy, availability, etc.) – More autonomy of single nodes

• Placement according to search strategy

– Data is distributed according to later access mechanism (e.g., DHT)

– No control over data access

– More freedom to optimize query routing

• Additional caching/replication possible

– Essential for load balancing

(11)

2.2 Topology and Routing (1)

• Unstructured Networks

– Flooding as routing algorithm

– Supports arbitrary expressive queries – Agnostic to schema heterogeneity – Inefficient (filtered flooding can help)

• Short-cut networks

– Unstructured, but continuously optimize network connections

– Can develop into regular structures like Small-World networks

– Clustering & filtered flooding reduces query distribution traffic

– Fireworks routing

(12)

2.2 Topology and Routing (2)

• Super-peer networks

– Inherits advantages and disadvantages of unstructured network

– Better efficiency and scaling (but still flooding)

– Good match to distributed databases (super-peers become mediators)

• DHT Networks

– Create separate overlay for each attribute

•

Or use Multidimensional DHTs, e.g. Mercury

– Limited query expressivity

– Suitable for homogeneous schema

(13)

2.2 Topology and Routing - Summary

• Local indexing

– No knowledge about other peers

• Central indexing

– One node holds complete index

• Distributed indexing

– Distributed Hash Tables – Filtered Flooding

– Short-cut networks – Super-peer networks

Doesn‘t scale

Single point of control (and failure)

(14)

2.2 Data Model

•

Fixed set of attributes

– Allows for sophisticated topologies – Inflexible

– Applicability: custom applications

•

Relational model

– Usual database model

– Not designed for distribution

•

XML

– Semi-structured data

•

RDF

– Semantic Web exchange format – Very suitable for distributed data

(15)

2.2 Query Language

• None

– Fixed set of parameterized queries

• Relational query language

– Always subset of SQL

• XML query language

– XPath or XQuery

• RDF Query Language

– SPARQL or its predecessors

– Logic language

(16)

2.2 Mapping Representation

• Declarative

–

Translation between schema elements

–

Distributed database approaches applicable

• Procedural

–

Imperative description how to translate/transform queries and data

• Mapping characteristics

–

Unidirectional or Bidirectional

–

Simple (one-to-one) mapping or complex mappings

• Mapping of objects

–

State equality of objects in different sources

(17)

2.2 Mapping Creation

• Manual

– Users create mappings

– Network distributes mappings and uses them for translation

• Semi-automatic

– System proposes mappings, based on heuristics

• attribute name

• similar data

– User feedback used to validate created mappings

• Automatic

– E.g., probabilistic mapping

– Similar techniques like for semi-automatic mapping

(18)

2.2 Integration Mechanism

• Query Rewriting

– Query is translated to target schema

– Data is translated back to source schema – Most common approach

• Data Rewriting

– Data is replicated to source schema

– Only feasible for small data sets

(19)

2.2 Existing Systems - Typology

• Focus on network scalability

– homogeneous schema – low query expressivity

– DHT as underlying network structure

• Focus on expressivity

– super-peer or unstructured – unlimited query complexity

• Focus on integration

– typically unstructured

– query routing driven by mappings

(20)

2.2 Existing Systems – Overview

List not complete

Name Topology Data

Placement

Data Model

Query Language

Scalability PIER DHT (Bamboo) Distributed Relational SQL subset

RDFPeers DHT (MAAN) Distributed RDF -

Mercury DHT (Symphony) Distributed Tuples -

Expressivity SQPeer Super-peer Owner RDF RQL

PeerDB Unstructured Owner Relational SQL subset

Edutella Super-peer Owner RDF datalog (SQL)

Integration Piazza Unstructured Owner XML XQuery subset GridVine DHT (P-Grid) Distributed RDF -

DRAGO Unstructured Owner Descr. Logics OWL subset

(21)

Overview

1. Why Peer-to-Peer Databases?

1. Federation

2. Information integration 3. Sensor networks

2. P2P Databases

1. Challenges

2. Design Dimensions

3. Existing P2P Database systems

1. Edutella: focus on expressivity 2. PIER: focus on scalability

3. Piazza: focus on integration

4. HiSbase: focus on scalability for spatial data

(22)

3.1 Edutella: Introduction

• Initial Goal: Achieve interoperability between

heterogeneous metadata-driven (e-learning) systems

• Provides metadata only, not the resources

– Resources are fetched via http

• Query Examples

– “Find software engineering course lecture notes for undergraduates in German language”

– “Find an introduction to Enterprise Java Beans for professionals”

– “Find a course in software requirements analysis from a

Swedish university”

(23)

3.1 Query Service

• Provides standardized query/retrieval of RDF metadata stored in distributed RDF repositories

• Query Exchange Language

– Based on Datalog (allows expression of rules) – RDF syntax

– For exchange only

• Adapters to enable QEL (query exchange

language) query processing on diverse backends

(24)

3.1 Query processing

• Parsers/Formatters convert between query languages

• Applications and backends are shielded from communication layer

• Query messages are exchanged in RDF/XML format

• Wrappers available for SQL, RDQL, RQL, and others

Provider Provider Provider Consumer

Application

Edutella Consumer Interface

Query Parser

App.

specific format

EQM

P2P Network

QEL

Edutella Provider Interface Query Formatter

Back-End (Repository)

Rep.

specific format EQM

(25)

3.1 Edutella Topology

• Super-Peers

• Content Providers

• Content Consumers

• Use filtered flooding in super-peer backbone

• HyperCuP topology

for backbone

(26)

3.1 Cayley Graphs

• Graph representing a permutation group G, described by a set of generators

– Regular, vertex-symmetric, recursively decomposable – Optimal routing and broadcast algorithms exist

1

0

1 0 1

0 1

0

1 0 1

0

1

0

1 0 1

0 1

0

1 0 1

0 2 2

2

2 2

2

2 2

2

a b

2 2

d c

2 2

a b 1234

2134

3124

1324 2314 3214

4231

2431

3421

4321 2341 3241

3412

1432

4132 1342

4312

2413

1423

4123 1243

4213

8 1

2

0

1 1

3 0

4 5

7

0

1 1

6 ⁰

2 2

(27)

3.1 Super-peer Topology: HyperCuP

0

1

0 1

1 1

0 2

2 2

2

SP₁

SP₃ SP₄ SP₂ SP₅

SP₇ SP₈ SP₆

 Super-peers are arranged as hypercube

 Broadcast needs n-1 messages, log

₂

(n) hops

 High connectivity, resilient against node failures

SP₁ SP₃

SP₂

SP₇ SP₅

SP₈ SP₆

SP₄

Minimal spanning tree

(28)

3.1 Super-Peer-based Query Routing

• Database fragment summaries

• Index structure and maintenance

• Query Routing

(29)

3.1 Peer Fragment Summaries

Peer1.Doc

Identifier Title Date Format Language

521354021 Csdoi sdofi sfi sfdsf 1948 Book de

593574021 Deor aodfi sdfwe dls 1952 Book de

534536021 Toid sdofij cvcdova 1937 Book de

528943021 Csdo asofdi weor 1916 Book de

529874521 Epodsf csmieo mo 1924 Book de

526983221 Awer fzwe xhzpwf 1959 Book de

Peer2.Doc

Identifier Title Date Language Coverage

1861978766 Eoite odsifj woifj 1993 en Scotland

1394875966 Oewr svonwe 2005 en Wales

1817305606 Psadoifh sdafns dsf 1999 en York

1809239086 Vsd sdfokj sfew 2001 en West Midlands

1345398705 Wdfj vspo sdfp dort 1989 en London

Peer1

Doc.Identifier Doc.Title

Doc.Date[1916-1959]

Doc.Format [Book]

Doc.Language[de]

Peer2

Doc.Date[1989-2005]

Doc.Language[de]

Doc.Coverage[UK]

(30)

0

1 1

0

SP₁

SP₃ SP₄ SP₂ P₁

P₂

P₄ P₃

3.1 Super-peer / Peer Indices

Super-Peer1 SP/P Index Doc.Identifier P₁, P₂

Doc.Title P₁, P₂

Doc.Date[1916-1959]

[1989-2005]

P₁ P₂ Doc.Format [Book] P₁ Doc.Language[de]

[en]

P₁ P₂ Doc.Coverage[UK] P₂ Peer1 Summary

Doc.Date[1916-1959]

Doc.Format [Book]

Doc.Language[de]

Peer2 Summary Doc.Identifier

Doc.Title

Doc.Date[1989-2005]

Doc.Language[en]

Doc.Coverage[UK]

 Peers forward summary to super-peer

(31)

3.1 Super-Peer Fragment Summaries

Doc

Identifier Title Date Format Language Coverage

521354021 Csdoi sdofi sfi sfdsf 1948 Book de

593574021 Deor aodfi sdfwe dls 1952 Book de

534536021 Toid sdofij cvcdova 1937 Book de

528943021 Csdo asofdi weor 1916 Book de

529874521 Epodsf csmieo mo 1924 Book de

526983221 Awer fzwe xhzpwf 1959 Book de

1861978766 Eoite odsifj woifj 1993 en Scotland

1394875966 Oewr svonwe 2005 en Wales

1817305606 Psadoifh sdafns dsf 1999 en York

1809239086 Vsd sdfokj sfew 2001 en West Midlands

1345398705 Wdfj vspo sdfp dort 1989 en London

Super-Peer1 SP1 Summary

Doc.Date[1916-2005]

Doc.Format [Book]

Doc.Language[de, en]

Doc.Coverage[UK]

(32)

3.1 Super-peer/Super-peer Indices

• Naively forwarding is not optimal

0

1 1

0

SP₁

SP3 SP4

SP₂ SP1 Summary

Doc.Date[1916-2005]

Doc.Format [Book]

Doc.Language[de, en]

Doc.Coverage[UK]

Super-Peer2 SP/SP Index

… ^…

Doc.Language[de]

[en]

SP₁ SP₁

… ^…

Doc.Language[de]

[en]

SP₁ SP₁

… ^…

Doc.Language[de]

[en]

SP₂,SP₃ SP₂,SP₃

… ^…

Doc.Language[de]

[en]

SP₂ SP₂

… ^…

(33)

3.1 Super-peer/Super-peer Indices

0

1 1

0

SP1

SP₃ SP₄ SP2

SP1 Summary

…

Doc.Language[de, en]

…

•

Take edge dimension into account

• forward SP/SP index entries only along lower edges

… ^…

Doc.Language[de]

[en]

SP₁(1) SP₁(1)

… ^…

Doc.Language[de]

[en]

SP₁(0) SP₁(0)

… ^…

Doc.Language[de]

[en]

SP₃(0) SP₃(0)

… ^…

Doc.Language[de]

[en]

… ^…

(34)

0

1 1

0

SP₁

SP3 SP4

SP₂ P1

P₂

P4

P₃

3.1 Query Routing

… ^…

Doc.Language[de]

[en]

SP₁(1) SP₁(1)

… ^…

Doc.Language[de]

[en]

SP₃(0) SP₃(0)

… ^…

•

Use SP/P and SP/SP indices as filters

SELECT * FROM Doc WHERE Language=”de“ AND …

Super-Peer1 SP/P Index

… …

Doc.Language[de]

[en]

P₁ P₂

… ^…

(35)

3.1 Application: P2P Digital Library Network

• Large amount of individual DLs

• Autonomous institutions

• Users have to

– find relevant DLs

– search separately on every found DL

• Violates 4th law of Library Science

– “Save the time of the reader”

(Ranganathan, 1931)

blah blah blah

(36)

3.1 DL Search Engine Solution

• Search engine approach

– ‚Crawl„ DLs – Copy Content

– Offer unified collection

• Issues

– Search engine controls content – Proprietary interface

(or just Web crawl)

– Difficult to preserve metadata – Single point of failure

blah blah blah

(37)

3.1 Open Archive Initiative Solution

• Standardize metadata ‚Crawling„ interface

– OAI-PMH (Protocol for Metadata Harvesting)

• Harvesters

– collect metadata from DLs – offer search facilities

• Issues

– No single entry point

– Harvesters control content – Points of failure

– Incentive for Harvester?

blah blah blah

(38)

3.1 From OAI to P2P

• Create „peer wrapper‟ for existing DLs

Super-peer backbone

Digital Libraries

OAI-PMH Interface Content

Providers

(39)

3.1 OAI-P2P – a Digital Library Network

• P2P approach:

– DLs form self-organized network – User queries are distributed

• Advantages

– No dependency on service provider – Each DL still controls its content

– No single point of failure

• 5th law of Library Science:

– “The library is a growing organism”

(Ranganathan, 1931)

blah blah blah

(40)

3.1 Edutella – Discussion

• Efficiently limits query distribution to relevant peers

• Very good scalability in terms of data size

– No data movement required – Little index maintenance efforts

• Flooding limits super-peer backbone scalability

– Will never scale to millions of peers

• Mainly query forwarding

– Initial extension to full query planning exists

• No load-balancing mechanisms

(41)

Overview

1. Why Peer-to-Peer Databases?

1. Federation

2. Information integration 3. Sensor networks

4. „New‟ internet

2. P2P Databases

1. Challenges

2. Design Dimensions

3. Existing P2P Database systems

1. Edutella: focus on expressivity 2. PIER: focus on scalability

3. Piazza: focus on integration

4. HiSbase: focus on scalability for spatial data

(42)

3.2 PIER

• P2P Relational Database

• Foundation: any DHT

• Extended hash interface

– put(namespace, key, value) – get(namespace, key)

– namespace/key combination is used as hash value (DHT Key)

• Extended network capabilities

•

Exploit DHT structure for broadcast

15

0 1

2

3

4

5

6 8 7

9 10 11 12

13 14

Spanning Tree

(43)

3.2 Application: Phi

• Phi: Public Health for the Internet

– Monitor ip network state world-wide – Collect statistics

• Network traffic

• Latency

• …

– Malware alerts

(44)

3.2 Storing and Indexing Tuples

• Storing

– Every tuple needs a synthetic tuple key

– Choose combination of table name and tuple key as DHT key

– Insert complete tuple into DHT using this key

• Indexing

– Additional attribute indexes are built by inserting attribute value/tuple key pairs into the DHT

– Choose combination of attribute name and attribute value as DHT key

– Insert tuple key as DHT value

(45)

3.2 Example

• Sample Database

• Sample tuple : (456, „Critique of pure Reason‟, 1781,

„en‟)

• Storing

– put(Doc, 456, (456, „Critique...‟, „en‟, Philosophy))

• Indexing on „Title‟ and „Date‟ attributes

– put(Doc.Title, „Critique...‟, 456) – put(Doc.Date, „1781‟, 456)

Doc Id Title Date Language

Author DocId PersonId

Person Id

Name Surname

(46)

3.2 PIER Query Plans

• DHT-Scan

– Use index to retrieve tuple key(s) – Use key(s) to retrieve data tuple(s)

• Example

– SELECT Id, Title FROM Doc WHERE Date= „1781‟ AND Lang = „en‟

• Each peer can create a query plan

• One DHT lookup per result tuple

• Filter has to be done on query originator

dht-scan_Subject(Doc, Date=‟1781‟) filter(Lang=‟en‟)

project({Id,Title})

(47)

3.2 Aggregate and Range Queries

• Example

– SELECT COUNT(Id) FROM Doc WHERE Date>„1780‟ AND Date<„1790‟

• Use spanning tree for broadcast

• Aggregate on return

1

1 1

3 1 1

16

(48)

3.2 Join Queries

• Example

– Assume a Person tuple (789, „Kant‟, „Immanuel‟)

– SELECT Id, Title FROM Doc WHERE Author.DocId = Doc.Id AND Author.PersonId = 789

• Approach: Hierarchical Joins

– Use spanning tree for broadcast

– Do local select on peer table fragments – Do local join on each peer

• Improves load balancing

– Forward table fragments and partial results to parent

(49)

3.2 Hierarchical Joins

D¹

D³ D²

A¹

A²

A³ T¹¹

T³¹

T²³ T¹²

T²² T²¹

T¹³

T³²

T³³

D¹ A¹ A³

D³ D¹

A¹ A³

D² A²

(50)

3.2 PIER - Discussion

• Real query planning

• Very efficient access to individual tuples and small result sets

• Very good scalability in terms of network size

• Degrades to broadcast for many types of queries

– Aggregate queries – Joins

• INSERT operation expensive (see P2P Inform.

Retrieval)

• No load-balancing mechanisms

(51)

Overview

1. Why Peer-to-Peer Databases?

1. Federation

4. „New‟ internet 2. P2P Databases

1. Challenges

3. Existing P2P Database systems

(52)

3.3 Piazza

• Tackles problem of „reconciling different models of the world” (A. Halevy)

• Goal: provide a uniform interface to a set of autonomous data sources

• New abstraction layer over multiple sources

• Introduce mappings between „world views‟

– Mapping rules are specified

manually by experts

(53)

3.3 Example – Publication Databases

UCSD

(54)

3.3 Mapping Rules

• Datalog is used to specify mapping rules

UCSD : Member(projName; member) : UW : Member(;pid; member; );

UW : Project(pid; ; projName):

UCSD : Member(projName; member) : UPenn : Student(sid; member; );

UPenn : ProjMember(pid; sid);

UPenn : Project(pid; projName; )

UCSD : Member(projName; member) : UPenn : Faculty(sid; member; );

UPenn : ProjMember(pid; sid);

UPenn : Project(pid; projName; )

Mapping from UW to UCSD

Mapping from UPenn to UCSD

(55)

3.3 Storing and Indexing

• Unstructured network (Gnutella-like)

• Peer keeps its database

– No exchange of data between peers

• Indexing

– Only on schema level

– Each peer maintains schema catalog of its neighbors – Mappings Stored in central catalog (hybrid system)

• could be replaced by DHT

– Replication of mappings to all relevant peers

(56)

3.3 Query Routing

• Query Flooding

– Peer translates query to

schema of neighbor (if possible) – Result tuples are

converted on way back

• Queries answered by traversing semantic paths

UCSD

UPenn

DBLP CiteSeer

Q1

Q4 Q3

M(UW, UCSD) M(UW, Stanford)

M(UCSD, UPenn)

M(Stanford, DBLP)

(57)

3.3 Piazza - Discussion

• Supports multiple schema world (more realistic)

• Very expressive mapping mechanism

• Not scalable

– Gnutella-like topology and flooding

• Piazza mapping technique could be applied to

other network infrastructures

(58)

Overview

1. Why Peer-to-Peer Databases?

1. Federation

4. „New‟ internet

2. P2P Databases

1. Challenges

3. Existing P2P Database systems

(59)

3.4 HiSbase

• Specialized on distributed spatial data

• Application: astronomy data

– Huge amounts of data (terabyte scale) – Region-based queries

– Skewed data distribution

• Main ideas

– Distribute data on peers by region – Use DHT for data access

– Use neighbor-preserving hash

function (space-filling curve)

(60)

3.4 Load Distribution

• Use Quad-Tree structure to split data space into

equally loaded regions

(61)

4.4 Data Hashing

• Use Z-Linearization for hashing coordinates

(62)

3.4 Insertion into DHT

(63)

3.4 Query Processing

• Point query

– Simple DHT access

• Region query

– Route to arbitrary peer in range (e.g. using upper left region boundary)

– This peer acts as coordinator

– Forward query to peer region neigbors

• Until whole area is covered

– Collect results at coordinator

(64)

3.4 HiSbase - Discussion

• Very efficient for spatial queries

– But only spatial queries possible

• Not completely self-organizing

– Quad-Tree splitting needs central coordination

(65)

3. P2P Database Networks – Summary

• Challenges

– Multi-Dimensional Search Space – Schema Heterogeneity

– Potentially large result sets

• Design Dimensions

– Network Properties (Data Placement, Topology and Routing) – Data Access (Data Model, Query Language)

– Integration Mechanism (Mapping Representation/Creation/Usage)

• P2P Database Types

– Focus on high network scalability (e.g., Edutella) – Focus on high query expressivity (e.g., PIER) – Focus on information integration (e.g., Piazza) – Focus on specific data structures (e.g. HiSbase)

(66)