• Keine Ergebnisse gefunden

Data Management Peer-to-Peer

N/A
N/A
Protected

Academic year: 2021

Aktie "Data Management Peer-to-Peer"

Copied!
66
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Wolf-Tilo Balke Sascha Tönnies

Institut für Informationssysteme

Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

Peer-to-Peer

Data Management

(2)

Overview

• Why Peer-to-Peer Databases?

Federation

Information integration

Sensor networks

• P2P Databases

Challenges

Design Dimensions

• Existing P2P Database systems

Edutella: focus on expressivity

PIER: focus on scalability

Piazza: focus on integration

(3)

1 Motivation

• Peer-to-peer data management might need some database-like functionality

– Complex queries over possibly large volumes of data

Examples of applications include

– Federation of sources – Information integration – Sensor networks

– „New‟ internet

(4)

1.1 Federation of similar data providers

• Examples

– (Digital) Libraries

– Primary Scientific Data Providers (Gene Databases)

– News Providers

• All nodes offer the same kind of information

• Homogeneous network (fixed schema)

• Non-P2P solutions exist, but not open/scalable

(5)

1.2 Information Integration

• Examples

– Find German professors having published at least three papers at the Conference on

Very Large Databases

– Find introductory database book in German, written by a German professor

– Find all recordings of Mozarts ‚Magic Flute„ with conductors who also once conducted Berliner Philharmoniker

• Very tedious to find with current search engines

• Needs database-like querying capabilities

• Heterogeneous network

– Information from several databases need to be combined

(6)

1.3 Sensor Networks

• Examples

– Network Monitoring:

• network maps

• event detections

• ...

– Car Traffic Monitoring

• Huge amount of nodes

• Low amount of data

• Homogeneous network

Screenshots from project PHI presentation, J. Hellerstein, Berkeley

(7)

Overview

1. Why Peer-to-Peer Databases?

1. Federation

2. Information integration 3. Sensor networks

2. P2P Databases

1. Challenges

2. Design Dimensions

3. Existing P2P Database systems

1. Edutella: focus on expressivity 2. PIER: focus on scalability

3. Piazza: focus on integration

4. HiSbase: focus on scalability for spatial data

(8)

2.1 Challenges of Schema-Based P2P Networks

• Multi-Dimensional Search Space

– DHTs only work for one dimension (one attribute)

• Schema Heterogeneity

– Sources use different database schemas for similar information

• Potentially large result sets

– SELECT * FROM Firewalls.BlockedPackets ...

– Range and Aggregate Queries

• And the usual P2P challenges...

– Trust

– Network Churn

(9)

2.2 Design Dimensions

• Network Properties

– Data Placement

– Topology and Routing

• Data Access

– Data Model

– Query Language

• Integration Mechanism

– Mapping Representation – Mapping Creation

– Integration Method

(10)

2.2 Data Placement

• Placement according to ownership

– Data stays at information source

– Full control of data by owner (access policy, availability, etc.) – More autonomy of single nodes

• Placement according to search strategy

– Data is distributed according to later access mechanism (e.g., DHT)

– No control over data access

– More freedom to optimize query routing

• Additional caching/replication possible

– Essential for load balancing

(11)

2.2 Topology and Routing (1)

• Unstructured Networks

– Flooding as routing algorithm

– Supports arbitrary expressive queries – Agnostic to schema heterogeneity – Inefficient (filtered flooding can help)

• Short-cut networks

– Unstructured, but continuously optimize network connections

– Can develop into regular structures like Small-World networks

– Clustering & filtered flooding reduces query distribution traffic

– Fireworks routing

(12)

2.2 Topology and Routing (2)

• Super-peer networks

– Inherits advantages and disadvantages of unstructured network

– Better efficiency and scaling (but still flooding)

– Good match to distributed databases (super-peers become mediators)

• DHT Networks

– Create separate overlay for each attribute

Or use Multidimensional DHTs, e.g. Mercury

– Limited query expressivity

– Suitable for homogeneous schema

(13)

2.2 Topology and Routing - Summary

• Local indexing

– No knowledge about other peers

• Central indexing

– One node holds complete index

• Distributed indexing

– Distributed Hash Tables – Filtered Flooding

– Short-cut networks – Super-peer networks

Doesn‘t scale

Single point of control (and failure)

(14)

2.2 Data Model

Fixed set of attributes

– Allows for sophisticated topologies – Inflexible

– Applicability: custom applications

Relational model

– Usual database model

– Not designed for distribution

XML

– Semi-structured data

RDF

– Semantic Web exchange format – Very suitable for distributed data

(15)

2.2 Query Language

• None

– Fixed set of parameterized queries

• Relational query language

– Always subset of SQL

• XML query language

– XPath or XQuery

• RDF Query Language

– SPARQL or its predecessors

– Logic language

(16)

2.2 Mapping Representation

• Declarative

Translation between schema elements

Distributed database approaches applicable

• Procedural

Imperative description how to translate/transform queries and data

• Mapping characteristics

Unidirectional or Bidirectional

Simple (one-to-one) mapping or complex mappings

• Mapping of objects

State equality of objects in different sources

(17)

2.2 Mapping Creation

• Manual

– Users create mappings

– Network distributes mappings and uses them for translation

• Semi-automatic

– System proposes mappings, based on heuristics

• attribute name

• similar data

– User feedback used to validate created mappings

• Automatic

– E.g., probabilistic mapping

– Similar techniques like for semi-automatic mapping

(18)

2.2 Integration Mechanism

• Query Rewriting

– Query is translated to target schema

– Data is translated back to source schema – Most common approach

• Data Rewriting

– Data is replicated to source schema

– Only feasible for small data sets

(19)

2.2 Existing Systems - Typology

• Focus on network scalability

– homogeneous schema – low query expressivity

– DHT as underlying network structure

• Focus on expressivity

– super-peer or unstructured – unlimited query complexity

• Focus on integration

– typically unstructured

– query routing driven by mappings

(20)

2.2 Existing Systems – Overview

List not complete

Name Topology Data

Placement

Data Model

Query Language

Scalability PIER DHT (Bamboo) Distributed Relational SQL subset

RDFPeers DHT (MAAN) Distributed RDF -

Mercury DHT (Symphony) Distributed Tuples -

Expressivity SQPeer Super-peer Owner RDF RQL

PeerDB Unstructured Owner Relational SQL subset

Edutella Super-peer Owner RDF datalog (SQL)

Integration Piazza Unstructured Owner XML XQuery subset GridVine DHT (P-Grid) Distributed RDF -

DRAGO Unstructured Owner Descr. Logics OWL subset

(21)

Overview

1. Why Peer-to-Peer Databases?

1. Federation

2. Information integration 3. Sensor networks

2. P2P Databases

1. Challenges

2. Design Dimensions

3. Existing P2P Database systems

1. Edutella: focus on expressivity 2. PIER: focus on scalability

3. Piazza: focus on integration

4. HiSbase: focus on scalability for spatial data

(22)

3.1 Edutella: Introduction

Initial Goal: Achieve interoperability between

heterogeneous metadata-driven (e-learning) systems

• Provides metadata only, not the resources

– Resources are fetched via http

• Query Examples

– “Find software engineering course lecture notes for undergraduates in German language”

– “Find an introduction to Enterprise Java Beans for professionals”

– “Find a course in software requirements analysis from a

Swedish university”

(23)

3.1 Query Service

• Provides standardized query/retrieval of RDF metadata stored in distributed RDF repositories

• Query Exchange Language

– Based on Datalog (allows expression of rules) – RDF syntax

– For exchange only

• Adapters to enable QEL (query exchange

language) query processing on diverse backends

(24)

3.1 Query processing

• Parsers/Formatters convert between query languages

• Applications and backends are shielded from communication layer

• Query messages are exchanged in RDF/XML format

• Wrappers available for SQL, RDQL, RQL, and others

Provider Provider Provider Consumer

Application

Edutella Consumer Interface

Query Parser

App.

specific format

EQM

P2P Network

QEL

Edutella Provider Interface Query Formatter

Back-End (Repository)

Rep.

specific format EQM

(25)

3.1 Edutella Topology

• Super-Peers

• Content Providers

• Content Consumers

• Use filtered flooding in super-peer backbone

HyperCuP topology

for backbone

(26)

3.1 Cayley Graphs

• Graph representing a permutation group G, described by a set of generators

– Regular, vertex-symmetric, recursively decomposable – Optimal routing and broadcast algorithms exist

1

0

1 0 1

0 1

0

1 0 1

0

1

0

1 0 1

0 1

0

1 0 1

0 2 2

2

2 2

2 2

2

2 2

2

a b

2 2

d c

2 2

a b 1234

2134

3124

1324 2314 3214

4231

2431

3421

4321 2341 3241

3412

1432

4132 1342

4312

2413

1423

4123 1243

4213

8 1

2

0

1 1

3 0

4 5

7

0

1 1

6 0

2 2

2 2

(27)

3.1 Super-peer Topology: HyperCuP

0

0

1

0 1

1 1

0 2

2 2

2

SP1

SP3 SP4 SP2 SP5

SP7 SP8 SP6

 Super-peers are arranged as hypercube

 Broadcast needs n-1 messages, log

2

(n) hops

 High connectivity, resilient against node failures

SP1 SP3

SP2

SP7 SP5

SP8 SP6

SP4

Minimal spanning tree

(28)

3.1 Super-Peer-based Query Routing

• Database fragment summaries

• Index structure and maintenance

• Query Routing

(29)

3.1 Peer Fragment Summaries

Peer1.Doc

Identifier Title Date Format Language

521354021 Csdoi sdofi sfi sfdsf 1948 Book de

593574021 Deor aodfi sdfwe dls 1952 Book de

534536021 Toid sdofij cvcdova 1937 Book de

528943021 Csdo asofdi weor 1916 Book de

529874521 Epodsf csmieo mo 1924 Book de

526983221 Awer fzwe xhzpwf 1959 Book de

Peer2.Doc

Identifier Title Date Language Coverage

1861978766 Eoite odsifj woifj 1993 en Scotland

1394875966 Oewr svonwe 2005 en Wales

1817305606 Psadoifh sdafns dsf 1999 en York

1809239086 Vsd sdfokj sfew 2001 en West Midlands

1345398705 Wdfj vspo sdfp dort 1989 en London

Peer1

Doc.Identifier Doc.Title

Doc.Date[1916-1959]

Doc.Format [Book]

Doc.Language[de]

Peer2

Doc.Identifier Doc.Title

Doc.Date[1989-2005]

Doc.Language[de]

Doc.Coverage[UK]

(30)

0

1 1

0

SP1

SP3 SP4 SP2 P1

P2

P4 P3

3.1 Super-peer / Peer Indices

Super-Peer1 SP/P Index Doc.Identifier P1, P2

Doc.Title P1, P2

Doc.Date[1916-1959]

[1989-2005]

P1 P2 Doc.Format [Book] P1 Doc.Language[de]

[en]

P1 P2 Doc.Coverage[UK] P2 Peer1 Summary

Doc.Identifier Doc.Title

Doc.Date[1916-1959]

Doc.Format [Book]

Doc.Language[de]

Peer2 Summary Doc.Identifier

Doc.Title

Doc.Date[1989-2005]

Doc.Language[en]

Doc.Coverage[UK]

 Peers forward summary to super-peer

(31)

3.1 Super-Peer Fragment Summaries

Doc

Identifier Title Date Format Language Coverage

521354021 Csdoi sdofi sfi sfdsf 1948 Book de

593574021 Deor aodfi sdfwe dls 1952 Book de

534536021 Toid sdofij cvcdova 1937 Book de

528943021 Csdo asofdi weor 1916 Book de

529874521 Epodsf csmieo mo 1924 Book de

526983221 Awer fzwe xhzpwf 1959 Book de

1861978766 Eoite odsifj woifj 1993 en Scotland

1394875966 Oewr svonwe 2005 en Wales

1817305606 Psadoifh sdafns dsf 1999 en York

1809239086 Vsd sdfokj sfew 2001 en West Midlands

1345398705 Wdfj vspo sdfp dort 1989 en London

Super-Peer1 SP1 Summary

Doc.Identifier Doc.Title

Doc.Date[1916-2005]

Doc.Format [Book]

Doc.Language[de, en]

Doc.Coverage[UK]

(32)

3.1 Super-peer/Super-peer Indices

• Naively forwarding is not optimal

0

1 1

0

SP1

SP3 SP4

SP2 SP1 Summary

Doc.Identifier Doc.Title

Doc.Date[1916-2005]

Doc.Format [Book]

Doc.Language[de, en]

Doc.Coverage[UK]

Super-Peer2 SP/SP Index

Doc.Language[de]

[en]

SP1 SP1

Super-Peer3 SP/SP Index

Doc.Language[de]

[en]

SP1 SP1

Super-Peer4 SP/SP Index

Doc.Language[de]

[en]

SP2,SP3 SP2,SP3

Super-Peer4 SP/SP Index

Doc.Language[de]

[en]

SP2 SP2

(33)

3.1 Super-peer/Super-peer Indices

0

1 1

0

SP1

SP3 SP4 SP2

SP1 Summary

Doc.Language[de, en]

Take edge dimension into account

• forward SP/SP index entries only along lower edges

Super-Peer3 SP/SP Index

Doc.Language[de]

[en]

SP1 (1) SP1 (1)

Super-Peer2 SP/SP Index

Doc.Language[de]

[en]

SP1 (0) SP1 (0)

Super-Peer4 SP/SP Index

Doc.Language[de]

[en]

SP3 (0) SP3 (0)

Super-Peer4 SP/SP Index

Doc.Language[de]

[en]

(34)

0

1 1

0

SP1

SP3 SP4

SP2 P1

P2

P4

P3

3.1 Query Routing

Super-Peer3 SP/SP Index

Doc.Language[de]

[en]

SP1 (1) SP1 (1)

Super-Peer4 SP/SP Index

Doc.Language[de]

[en]

SP3 (0) SP3 (0)

Use SP/P and SP/SP indices as filters

SELECT * FROM Doc WHERE Language=”de“ AND …

Super-Peer1 SP/P Index

Doc.Language[de]

[en]

P1 P2

(35)

3.1 Application: P2P Digital Library Network

• Large amount of individual DLs

• Autonomous institutions

• Users have to

– find relevant DLs

– search separately on every found DL

• Violates 4th law of Library Science

– “Save the time of the reader”

(Ranganathan, 1931)

blah blah blah

(36)

3.1 DL Search Engine Solution

• Search engine approach

– ‚Crawl„ DLs – Copy Content

– Offer unified collection

• Issues

– Search engine controls content – Proprietary interface

(or just Web crawl)

– Difficult to preserve metadata – Single point of failure

blah blah blah

(37)

3.1 Open Archive Initiative Solution

• Standardize metadata ‚Crawling„ interface

– OAI-PMH (Protocol for Metadata Harvesting)

• Harvesters

– collect metadata from DLs – offer search facilities

• Issues

– No single entry point

– Harvesters control content – Points of failure

– Incentive for Harvester?

blah blah blah

(38)

3.1 From OAI to P2P

• Create „peer wrapper‟ for existing DLs

Super-peer backbone

Digital Libraries

OAI-PMH Interface Content

Providers

(39)

3.1 OAI-P2P – a Digital Library Network

• P2P approach:

– DLs form self-organized network – User queries are distributed

• Advantages

– No dependency on service provider – Each DL still controls its content

– No single point of failure

• 5th law of Library Science:

– “The library is a growing organism”

(Ranganathan, 1931)

blah blah blah

(40)

3.1 Edutella – Discussion

• Efficiently limits query distribution to relevant peers

• Very good scalability in terms of data size

– No data movement required – Little index maintenance efforts

• Flooding limits super-peer backbone scalability

– Will never scale to millions of peers

• Mainly query forwarding

– Initial extension to full query planning exists

• No load-balancing mechanisms

(41)

Overview

1. Why Peer-to-Peer Databases?

1. Federation

2. Information integration 3. Sensor networks

4. „New‟ internet

2. P2P Databases

1. Challenges

2. Design Dimensions

3. Existing P2P Database systems

1. Edutella: focus on expressivity 2. PIER: focus on scalability

3. Piazza: focus on integration

4. HiSbase: focus on scalability for spatial data

(42)

3.2 PIER

P2P Relational Database

• Foundation: any DHT

• Extended hash interface

– put(namespace, key, value) – get(namespace, key)

– namespace/key combination is used as hash value (DHT Key)

• Extended network capabilities

Exploit DHT structure for broadcast

15

0 1

2

3

4

5

6 8 7

9 10 11 12

13 14

Spanning Tree

(43)

3.2 Application: Phi

Phi: Public Health for the Internet

– Monitor ip network state world-wide – Collect statistics

• Network traffic

• Latency

• …

– Malware alerts

(44)

3.2 Storing and Indexing Tuples

Storing

– Every tuple needs a synthetic tuple key

– Choose combination of table name and tuple key as DHT key

– Insert complete tuple into DHT using this key

Indexing

– Additional attribute indexes are built by inserting attribute value/tuple key pairs into the DHT

– Choose combination of attribute name and attribute value as DHT key

– Insert tuple key as DHT value

(45)

3.2 Example

• Sample Database

• Sample tuple : (456, „Critique of pure Reason‟, 1781,

„en‟)

• Storing

– put(Doc, 456, (456, „Critique...‟, „en‟, Philosophy))

• Indexing on „Title‟ and „Date‟ attributes

– put(Doc.Title, „Critique...‟, 456) – put(Doc.Date, „1781‟, 456)

Doc Id Title Date Language

Author DocId PersonId

Person Id

Name Surname

(46)

3.2 PIER Query Plans

• DHT-Scan

– Use index to retrieve tuple key(s) – Use key(s) to retrieve data tuple(s)

• Example

– SELECT Id, Title FROM Doc WHERE Date= „1781‟ AND Lang = „en‟

• Each peer can create a query plan

• One DHT lookup per result tuple

• Filter has to be done on query originator

dht-scanSubject(Doc, Date=‟1781‟) filter(Lang=‟en‟)

project({Id,Title})

(47)

3.2 Aggregate and Range Queries

Example

– SELECT COUNT(Id) FROM Doc WHERE Date>„1780‟ AND Date<„1790‟

• Use spanning tree for broadcast

• Aggregate on return

1

1 1

3 1 1

16

(48)

3.2 Join Queries

Example

– Assume a Person tuple (789, „Kant‟, „Immanuel‟)

– SELECT Id, Title FROM Doc WHERE Author.DocId = Doc.Id AND Author.PersonId = 789

• Approach: Hierarchical Joins

– Use spanning tree for broadcast

– Do local select on peer table fragments – Do local join on each peer

• Improves load balancing

– Forward table fragments and partial results to parent

(49)

3.2 Hierarchical Joins

D1

D3 D2

A1

A2

A3 T11

T31

T23 T12

T22 T21

T13

T32

T33

D1 A1 A3

D3 D1

A1 A3

D2 A2

(50)

3.2 PIER - Discussion

Real query planning

• Very efficient access to individual tuples and small result sets

• Very good scalability in terms of network size

• Degrades to broadcast for many types of queries

– Aggregate queries – Joins

• INSERT operation expensive (see P2P Inform.

Retrieval)

• No load-balancing mechanisms

(51)

Overview

1. Why Peer-to-Peer Databases?

1. Federation

2. Information integration 3. Sensor networks

4. „New‟ internet 2. P2P Databases

1. Challenges

2. Design Dimensions

3. Existing P2P Database systems

1. Edutella: focus on expressivity 2. PIER: focus on scalability

3. Piazza: focus on integration

4. HiSbase: focus on scalability for spatial data

(52)

3.3 Piazza

• Tackles problem of „reconciling different models of the world” (A. Halevy)

Goal: provide a uniform interface to a set of autonomous data sources

• New abstraction layer over multiple sources

• Introduce mappings between „world views‟

– Mapping rules are specified

manually by experts

(53)

3.3 Example – Publication Databases

UCSD

(54)

3.3 Mapping Rules

• Datalog is used to specify mapping rules

UCSD : Member(projName; member) : UW : Member(;pid; member; );

UW : Project(pid; ; projName):

UCSD : Member(projName; member) : UPenn : Student(sid; member; );

UPenn : ProjMember(pid; sid);

UPenn : Project(pid; projName; )

UCSD : Member(projName; member) : UPenn : Faculty(sid; member; );

UPenn : ProjMember(pid; sid);

UPenn : Project(pid; projName; )

Mapping from UW to UCSD

Mapping from UPenn to UCSD

(55)

3.3 Storing and Indexing

Unstructured network (Gnutella-like)

• Peer keeps its database

– No exchange of data between peers

Indexing

– Only on schema level

– Each peer maintains schema catalog of its neighbors – Mappings Stored in central catalog (hybrid system)

could be replaced by DHT

– Replication of mappings to all relevant peers

(56)

3.3 Query Routing

Query Flooding

– Peer translates query to

schema of neighbor (if possible) – Result tuples are

converted on way back

• Queries answered by traversing semantic paths

UCSD

UPenn

DBLP CiteSeer

Q1

Q4 Q3

M(UW, UCSD) M(UW, Stanford)

M(UCSD, UPenn)

M(Stanford, DBLP)

(57)

3.3 Piazza - Discussion

• Supports multiple schema world (more realistic)

• Very expressive mapping mechanism

• Not scalable

– Gnutella-like topology and flooding

• Piazza mapping technique could be applied to

other network infrastructures

(58)

Overview

1. Why Peer-to-Peer Databases?

1. Federation

2. Information integration 3. Sensor networks

4. „New‟ internet

2. P2P Databases

1. Challenges

2. Design Dimensions

3. Existing P2P Database systems

1. Edutella: focus on expressivity 2. PIER: focus on scalability

3. Piazza: focus on integration

4. HiSbase: focus on scalability for spatial data

(59)

3.4 HiSbase

• Specialized on distributed spatial data

• Application: astronomy data

– Huge amounts of data (terabyte scale) – Region-based queries

– Skewed data distribution

Main ideas

– Distribute data on peers by region – Use DHT for data access

– Use neighbor-preserving hash

function (space-filling curve)

(60)

3.4 Load Distribution

• Use Quad-Tree structure to split data space into

equally loaded regions

(61)

4.4 Data Hashing

• Use Z-Linearization for hashing coordinates

(62)

3.4 Insertion into DHT

(63)

3.4 Query Processing

Point query

– Simple DHT access

Region query

– Route to arbitrary peer in range (e.g. using upper left region boundary)

– This peer acts as coordinator

– Forward query to peer region neigbors

• Until whole area is covered

– Collect results at coordinator

(64)

3.4 HiSbase - Discussion

• Very efficient for spatial queries

– But only spatial queries possible

• Not completely self-organizing

– Quad-Tree splitting needs central coordination

(65)

3. P2P Database Networks – Summary

Challenges

– Multi-Dimensional Search Space – Schema Heterogeneity

– Potentially large result sets

Design Dimensions

– Network Properties (Data Placement, Topology and Routing) – Data Access (Data Model, Query Language)

– Integration Mechanism (Mapping Representation/Creation/Usage)

P2P Database Types

– Focus on high network scalability (e.g., Edutella) – Focus on high query expressivity (e.g., PIER) – Focus on information integration (e.g., Piazza) – Focus on specific data structures (e.g. HiSbase)

(66)

3. Conclusion

P2P Databases do already work

– although immature compared to traditional database technology

• One size does not fit all

– Choose P2P database approach according to application requirements

Open problems

– Load Balancing (Replication/Caching)

– How to combine DHT and filtered flooding advantages – Reliability (probabilistic guarantees)

– ...

Referenzen

ÄHNLICHE DOKUMENTE

Role of Immunity in Heart Failure Frantz, Stefan, Ertl Georg, and Bauersachs, Johann: Innate Immunity in Heart Failure [Extended Abstract]

Recent data implicate a number of inflammatory stimuli, including cytokines like tumor necrosis factor (TNF) and interleukin (IL)-1β in myocardial depression and

Feldmeyer D, Lu¨bke J, Silver RA, Sakmann B (2002) Synaptic connections between layer 4 spiny neurone-layer 2/3 pyramidal cell pairs in juvenile rat barrel cortex: physiology

The effects of developmental low-level lead exposure on the number of polysialic-acid linked neural cell adhesion molecule (PSA-NCAM) expressing cells in adult rat hippocampus

Nowa- days, many people associate the idea of welcome culture with the reception of refugees; however, the term was original- ly coined by politicians in the context of the

He contributes to interdisciplinary research by offering pragmatic research approaches and designs, including exploratory and descriptive, research methods and techniques

Where a tax assessment, which has been contested by appeal or action is amended or replaced, the new tax as- sessment becomes, by act of law, the subject

Both have now been amended and merged into one document, namely the GoBD (principles of duly keeping and retain- ing books, records and documents in electronic