Data Management Peer-to-Peer

Volltext

(1)Peer-to-Peer Data Management Wolf-Tilo Balke Sascha Tönnies Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de.

(2) 14. Overview 1. Introduction 2. Content Searching in Peer-to-Peer Applications 1. Problems in Peer-to-Peer Information Retrieval 2. Related Work in Distributed Information Retrieval. 3. Index structures for Query Routing 1. Distributed Hash Tables for Information Retrieval 2. Routing Indexes for Information Retrieval 3. Locality-Based Routing Indexes. 4. Supporting Effective Information Retrieval 1. Providing Collection-Wide Information 2. Estimating the Document Overlap 3. Prestructuring Collections with Taxonomies. 5. Summary and Conclusion VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(3) 14.1 What is IR? • Information retrieval (IR) is the science of searching for documents, for information within documents and for metadata about documents – A user enters a query, i.e. an information need, into the system – Several objects may match the query with different degrees of relevancy. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(4) 14.1 Representing Text • How do we represent the complexities of language? – Computers don‟t “understand” documents or queries. • Simple, yet effective approach: bag of words – Treat all the words in a document as index terms for that document – Assign a “weight” to each term based on its “importance” – Disregard order, structure, meaning, etc. of the words VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(5) 14.1 Representing Text McDonald's slims down spuds Fast-food chain to reduce certain types of fat in its french fries with new cooking oil. NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier. But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste. Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment. …. 16 × said 14 × McDonalds 12 × fat 11 × fries 8 × new 6 × company, french, nutrition 5 × food, oil, percent, reduce, taste, Tuesday …. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig. Bag of Words.

(6) 14.1 Retrieval • Retrieving relevant information is hard! – Evolving, ambiguous user needs, context, etc. – Complexities of language. • To operationalize information retrieval, we must vastly simplify the picture – Information retrieval is all (and only) about matching words in documents with words in queries – Obviously, not true… – But it works pretty well! VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(7) The quick brown fox jumped over the lazy dog’s back.. Document 2 Now is the time for all good men to come to the aid of their party.. Term aid all back brown come dog fox good jump lazy men now over party quick their time. 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0. Document 2. Document 1. Document 1. 14.1 Representing Documents as Vectors. 1 1 0 0 1 0 0 1 0 0 1 1 0 1 0 1 1. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig. Stopword List for is of the to.

(8) 14.1 Representing Text. Howaccents, to compare spacing, stopwords etc. documents and queries?. document text + structure. structured recognition. structure. noun groups. stemming. automatic or manual indexing. text. full text. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig. index terms.

(9) 14.1 Boolean Retrieval • Weights assigned to terms are either “0” or “1” – “0” represents “absence”: term isn’t in the document – “1” represents “presence”: term is in the document. • Build queries by combining terms with Boolean operators – AND, OR, NOT. • The system returns all documents that satisfy the query. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(10) Term aid all back brown come dog fox good jump lazy men now over party quick their time. Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8. 14.1 Boolean View of a Document-Set (=Collection). 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1 0. 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1. 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0. 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1. 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0. 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1. 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0. 1 0 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0. Each column represents the view of a particular document: What terms are contained in this document? Each row represents the view of a particular term: What documents contain this term? To execute a query, pick out rows corresponding to query terms and then apply logic table of corresponding Boolean operator. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(11) Term. Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8. dog fox. 0 0 1 0 1 0 0 0 0 0 1 0 1 0 1 0. dog  fox. 0 0 1 0 1 0 0 0. dog AND fox  Doc 3, Doc 5. dog  fox. 0 0 1 0 1 0 1 0. dog OR fox  Doc 3, Doc 5, Doc 7. dog  fox. 0 0 0 0 0 0 0 0. dog NOT fox  empty. fox  dog. 0 0 0 0 0 0 1 0. fox NOT dog  Doc 7. Term. Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8. 14.1 Sample Queries. good party. 0 1 0 1 0 1 0 1 0 0 0 0 0 1 0 1. gp over. 0 0 0 0 0 1 0 1 1 0 1 0 1 0 1 1. good AND party  Doc 6, Doc 8. gpo. 0 0 0 0 0 1 0 0. good AND party NOT over  Doc 6. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(12) 14.1 The Perfect Query Paradox • Every information need has a perfect set of documents – If not, there would be no sense doing retrieval. • Every document set has a perfect query – AND every word in a document to get a query for it – Repeat for each document in the set – OR every document query to get the set query. • But can users realistically be expected to formulate this perfect query? – Boolean query formulation is hard! VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(13) 14.1 Why Boolean Retrieval fails • Natural language is way more complex • AND “discovers” nonexistent relationships – Terms in different sentences, paragraphs, …. • Guessing terminology for OR is hard – good, nice, excellent, outstanding, awesome, …. • Guessing terms to exclude is even harder! – Democratic party, party to a lawsuit, …. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(14) 14.1 Strengths and Weaknesses • Strengths – Precise, if you have a clear idea of what you‟re looking for – Efficient for the computer. • Weaknesses – Users must learn Boolean logic – Boolean logic insufficient to capture the richness of language – No control over size of result set: either too many documents or none – All documents in the result set are considered “equally good” – What about partial matches? Documents that “don‟t quite match” the query may be useful also VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(15) 14.1 Ranked Retrieval • Order documents by how likely they are to be relevant to the information need – Present hits one screen at a time – At any point, users can continue browsing through ranked list or reformulate query. • Attempts to retrieve relevant documents directly, not merely provide tools for doing so. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(16) 14.1 Why Ranked Retrieval? • Arranging documents by relevance is – Closer to how humans think: some documents are “better” than others – Closer to user behavior: users can decide when to stop reading. • Best (partial) match: documents need not have all query terms – Although documents with more query terms should be “better”. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(17) 14.1 Similarity-based Retrieval? • Let‟s replace relevance with “similarity” – Rank documents by their similarity with the query. • Treat the query as if it were a document – Create a query bag-of-words. • Find its similarity to each document • Rank order the documents by similarity • Surprisingly, this works pretty well!. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(18) 14.1 Vector Space Model t3. d2. d3 d1. θ φ. t1 d5. t2 d4. Postulate: Documents that are “close together” in vector space “talk about” the same things Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”) VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(19) 14.1 How to Weight Terms? • Idea: Hans Peter Luhn 1958, IBM • Here‟s the intuition: – Terms that appear often in a document should get high weights The more often a document contains the term “dog”, the more likely that the document is “about” dogs.. – Terms that appear in many documents should get low weights Words like “the”, “a”, “of” appear in (nearly) all documents.. • How do we capture this mathematically? – Term frequency – Inverse document frequency VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(20) 14.1 TFxIDF • TFxIDF [Gerald Salton, 1961]. • Term Frequency (TF) – How often a term appears in a document – can be calculated locally. • Document Frequency (DF) – Number of documents, which contain a specific term – Needs global (system wide) knowledge. • Inverse Document Frequency (IDF) – Discriminator for the importance of a term regarding the number of occurrences in all documents – Needs global (system wide) knowledge VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(21) Term aid all back brown come dog fox good jump lazy men now over party quick their time. Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8. 14.1 Working on Indices 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1 0. 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1. 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0. 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1. 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0. 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1. 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0. 1 0 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0. The term-document matrix again has “bag of words” information about the collection VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(22) 14.1 Small yet Fast? • Can we make this data structure smaller, keeping in mind the need for fast retrieval? • Observations: – The nature of the search problem requires us to quickly find which documents contain a term – The term-document matrix is very sparse – Some terms are more useful than others. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(23) Term aid all back brown come dog fox good jump lazy men now over party quick their time. Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8. 14.1 Posting Lists 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1 0. 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1. 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0. 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1. 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0. 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1. 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0. 1 0 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0. Postings 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(24) 14.1 Inverted Document Index Term aid all back brown come dog fox good jump lazy men now over party quick their time. Postings 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(25) 14.1 What goes in the Postings? • Boolean retrieval – Just the document number. • Ranked Retrieval – Document number and term weight (tf.idf, ...). • Proximity operators – Word offsets for each occurrence of the term. • …. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(26) 14.2 Overview 1. Introduction 2. Content Searching in Peer-to-Peer Applications 1. Problems in Peer-to-Peer Information Retrieval 2. Related Work in Distributed Information Retrieval. 3. Index structures for Query Routing 1. Distributed Hash Tables for Information Retrieval 2. Routing Indexes for Information Retrieval 3. Locality-Based Routing Indexes. 4. Supporting Effective Information Retrieval 1. Providing Collection-Wide Information 2. Estimating the Document Overlap 3. Prestructuring Collections with Taxonomies. 5. Summary and Conclusion VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(27) 14.2 Information Retrieval in P2P Systems • Information Retrieval deals with complex documents – Meta-data can only capture some aspects of a document, but not anticipate all semantic searches • E.g. sports-related newspaper article, but no names, locations, etc.. – Support for full-text searches needed. • Find the best-matching document from the bestconnected peer – Unlike in file sharing emphasis is on the document quality – If there are multiple sources offering similar quality documents, choose best peer according to connection, etc.. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(28) 14.2 Challenges in P2P IR • Efficient query evaluation scheme – Central inverted index of documents is expensive to maintain – How to disseminate a peer„s query? • Simple flooding of all queries is not scalable, if „best„ documents have to be found (not just some match). • Dealing with network churn – A peer can always alter the set of documents offered, or significantly change individual documents – Peers may join and leave the network, i.e. whole document collections may disappear, or can be added. • Integration of collection-wide information – Peers are not able to calculate IR-style scorings from local knowledge, but needs some knowledge from the (virtual) merged collection – Constant dissemination of collection-wide information needs a lot of bandwidth VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(29) 14.2 Example: Problem of Collection-wide Information. • Example: Different news collections, query on keyword „basketball„ – General news collection, e.g. • Many articles, only few about basketball, therefore IDF small • Keyword discriminates well between articles. – NBA news collection • Few articles, almost all about basketball, therefore IDF high • Keyword hardly discriminates between articles. – Merged collection: IDF medium • But how do independent collections (peers) exchange their information? VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(30) 14.2 Example: Problem of Collection-wide Information. Top object …B…. global scoring all objects identical TF = 1 IDF = 6/3. Peer 1 …A.... …A.... Querying Peer Query: A and B. …B... …A… Top object. TF=1 IDF=3/2 TF=1 IDF= 3/1 local scoring. Peer 2 …A.... TF=1 IDF=3/1 …B.... …B.... TF=1 IDF= 3/2 local scoring. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(31) 14.2 Distributed IR • Distributed information retrieval techniques grew increasingly important for searching Web sources – Abstracts of information sources • To support distributed retrieval sources have to register abstracts or keyword sets • Abstracts can either be kept in a central repository or distributed by gossiping algorithms, e.g. PlanetP [Cuenca-Acuna et al., „03]. – Collection selection • Having no central index needs a sophisticated way of choosing the most promising collections for querying VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(32) 14.2 Distributed IR • Such abstracts can be compactly represented by Bloom Filters, i.e. bit vectors that allow membership queries – Each term is hashed with n different functions and the position in the bit vector for each hash value is set to 1 – Allows for false positives, but no false negatives – In Counting Bloom Filters objects can also be removed. ?. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(33) 14.2 Distributed IR • Benefit estimators for collection selection use aggregated statistics about individual collections for selection, e.g. CORI measure [Callan et al., „95] CORI calculates collection score si for collection i regarding query q:. with and where n is the number of collections, cdf the collection document frequency, cdfmax the maximum cdf and cft the collection frequency of term t VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.


(35) 14.3 Index Structures for Query Routing. • Traditional index structures cannot be readily employed in P2P systems – High degree of distribution – High degree of volatility (churn). High degree of index maintenance. • Distributed paradigms needed to route queries to appropriate peers – – – –. Simple flooding method does not scale Distributed hash table lookup Using indexed routing information Using shortcut overlays VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(36) 14.3 Distributed Hash Tables for IR • Distributed hash tables – Route queries to appropriate peers with number of hops logarithmic in network size – No peer needs to maintain more than logarithmic amount of routing information – But… Exact match queries only All new content has to be published, if peers join/change Old content has to be unpublished, if peers leave Documents added/removed will contain a lot of different terms to be published/unpublished. Thus, usually many index peers have to be addressed • Conjunction of query terms needs to access many peers, but there is still no guarantee that a single document with the conjunction exists • • • •. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(37) 14.3 Distributed Hash Tables for IR • Improvement: Hybrid P2P infrastructures [Loo et al.,‟04] – Efficiency of DHT is worst, if highly replicated items are requested • Experiments show worse behavior than flooding, degrading with churn. – Querying and content allocation follow Zipf-distribution. 16,00% 14,00%. Occurrence Frequency. • Only few highly replicated and often queried items • „People are looking for hay, not for needles‟ (S. Shenker). Query Frequency Distribution. 12,00% 10,00% 8,00% 6,00% 4,00% 2,00%. – Hybrid P2P infrastructures use DHTs only for the less replicated and rarely queried items, all other queries are flooded – Still, DHTs have to be maintained for the majority of query terms 0,00%. 0. 10. 20. 30. 40. 50. Query. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig. 60. 70. 80. 90.

(38) 14.3 Routing Indexes for IR • Routing indexes are local collections of (key, peer) pairs – Key is either a keyword or a query – Peer is the address of a peer that either offers relevant results, or routes the query to other peers with relevant result. •. In contrast to flooding only „interesting‟ directions are queried – Often distinguished between links in the default network (directions of content providers) and overlay structure of direct links to content providers („shortcuts‟). • First introduced by [Crespo & Garcia-Molina, „02] to choose best neighbors in the default network for query forwarding – Index maintenance is of local nature and index coverage is usually high due to Zipf distribution of requests – Correctness of index is influenced by network volatility/churn. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(39) 14.3 Routing Indexes for IR • Routing index policies in the face of network churn – With restricted index sizes new entries are collected and always stored. If the maximum size is reached, some stale information is replaced • A simple strategy always replaces the currently oldest index entries • „Least recently used‟ (LRU) strategy assigns higher usefulness to entries that have been successfully used recently • Optimal index size is a problematic parameter. – Indexes with unrestricted size have to combat network churn differently • „time to live‟ assigns an expiry time for each new index entry • „forgetting factors‟ can periodically weigh down reliability of link information VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(40) 14.3 An Algorithm for Correct Query Routing. • Goal: progressive distributed top-k ranking of documents • Putting techniques together to design an efficient top-k algorithm – Minimal number of object transfers – Optimal number of object accesses. • Features of the P2P based approach – Optimized Query-Routing – No global Index – Query-driven term-indexing VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(41) 14.3 Bird’s View 1. 2. 3. 4.. Distribute query through the network (Routing) Every peer scores documents locally (Ranking) Hierarchical construction of the final result (Merging) Optimized query routing (Index). VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(42) 14.3 Building Blocks local ranking Structured network. result. query-driven index. merging. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(43) 14.3 Network Structure • Observation: peers strongly differ in availability, bandwidth, computing power, … • Hierarchical network structure with super-peers – Query routing – Result merging – Indexes. . VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(44) 14.3 Network topology • Super-peers as hypercube (HyperCuP protocol) • Resilient against leaving peers • Broadcast with (n-1) messages, log2(n) hops minimal spanning tree SP5. SP6. 0. 2 1. SP1. 0. SP7. 1. 1. SP2. SP1. 1 2. 0. SP4. SP3. SP7. SP2. SP6. SP8. 0. 2. SP3. SP5. 2. SP4. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig. SP8.

(45) 14.3 Local Ranking • Super-peer asks for local rankings of peers„ collections • Top-k results (plus metric-dependent information) are returned to SP • Arbitrary similarity measures can be used – TFxIDF – Similarities in taxonomies. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(46) 14.3 Result Merging • Results will be merged at the super-peers – Unique scoring function – Maximum of k messages per SP-SP egde P3 SPC P2 P7. P6. P5. 0. P4. P1 SPD. 0. SPB. 1. SPA PQ. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(47) 14.3 Indexing • Super-peers keep indexes – IDFs (collection wide information) • IDF-values for query terms. – Top peers (routing) • List of peers that already have contributed to a previous top-k result. – Others possible, e.g. for taxonomies. • Index entries are query-driven VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(48) 14.3 Routing Indexes Example:Top k Query Routing. • Example for routing indexes in P2P networks with super-peer backbone holding routing indexes • Progressive P2P top-k algorithm [Balke et al., „04] – If query q is indexed, distribute query and collect results – Otherwise flood query and • • •. Compute ranks at local peers Merge results at super-peers Use statistics for new entry in routing index (routing information, collection-wide information, etc.). – Data structures at super-peers • • • •. RequestResults: Peers which are queried for result (index information) BestPeer: Peers which delivered recent best result TopRes: Current top results Delivered: Delivered results. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(49) 14.3 Routing Indexes Example:Top k Query Routing SP4. SP5 SP1. P0. SP3 SP2. P1. q1 ?. d11. 0.8. Find top 2 documents. d12. 0.3. d13. 0.2. SP7 SP6. RequestResults. {SP8,P2, P3, P4}. BestPeers. {}. TopRes. {}. Delivered. {}. Empty routing index at SP4. SP4. SP8. P3. P4. P2 d21. 0.7. d31. 0.6. d41. 0.5. d22. 0.4. d32. 0.6. d42. 0.5. d23. 0.3. d33. 0.1. d43. 0.2. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(50) 14.3 Routing Indexes Example:Top k Query Routing SP4 RequestResults {} {SP8,P2, {SP8, P3,P3, P4}P4}. SP5 SP1. P0 q1 ?. SP3 SP2. P1 d11. 0.8. d12. 0.3. d13. 0.2. SP7 SP6. BestPeers. {} {} {P2}. TopRes TopRes TopRes Delivered. {} {(P3, d31, 0.5), {(P2, d21, 0.7)} {(P2, d21, 0.7), (P4, (P3, d31, 0.4)} 0.5), {} d41,. Delivered. (P4, d21, d41, 0.7)} 0.4)} {(P2,. Delivered. {}. SP4. SP8. P3. P4. P2 d21. 0.7. d31 d21. 0.6 0.7. d41 d21. 0.5 0.7. d22. 0.4. d32. 0.6. d42. 0.5. d23. 0.3. d33. 0.1. d43. 0.2. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(51) 14.3 Routing Indexes Example:Top k Query Routing SP1. SP5 SP1. P0 ? q1 {(d11, 0.8)}. SP3 SP2. P1 d11. 0.8. d12. 0.3. d13. 0.2. RequestResults {} {SP3,SP5, P1}. SP7. BestPeers. {} {P1}. TopRes TopRes Delivered. {(SP2, d21, {(P1, d11, 0.7)} 0.8),. Delivered. {}. (SP2, d21, 0.7)} {SP2} {(P1, d11, 0.8)}. SP6 SP4. SP8. P3. P4. P2 d21. 0.7. d31. 0.6. d41. 0.5. d22. 0.4. d32. 0.6. d42. 0.5. d23. 0.3. d33. 0.1. d43. 0.2. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(52) 14.3 Routing Indexes Example:Top k Query Routing SP1 SP1 BestPeers {} RequestResults {P1} RequestResults. SP5. Delivered BestPeers BestPeers. SP1. P0 q1 {(d11, {(d11, 0.8), 0.8)} q1 (d21, 0.7)}. SP3 SP2. P1 d11. 0.8. d12. 0.3. d13. 0.2. P2. {(P1, {SP2}d11, 0.8)} {}. RequestResults {} TopRes {(P1, d12, TopRes {(SP2, d21,0.3)} 0.7)}. SP7. Delivered TopRes Delivered. {(P1, 0.8)} {(SP2, d21, 0.7), {(P1, d11, d11, 0.8), (P1, d12, 0.7)} 0.3)} (SP2, d21,. SP6 SP4. SP8. P3. P4. d21. 0.7. d31. 0.6. d41. 0.5. d22. 0.4. d32. 0.6. d42. 0.5. d23. 0.3. d33. 0.1. d43. 0.2. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(53) 14.3 Routing Indexes Example:Top k Query Routing SP4 SP2 Routing Routing Index Index SP1. SP5. q1 SP1. SP3. q1 {P2, P3} q1 {SP4} {SP2, P1} {} RequestResults. SP7. BestPeers. {SP2}. TopRes. {(P1, d12, 0.3)} {(P1, d11, 0.8),. Delivered. P0 q1. SP2. P1. {(d11, 0.8),. d11. 0.8. (d21, 0.7)}. d12. 0.3. d13. 0.2. P2. (SP2, d21, 0.7)}. SP6 SP4. SP8. P3. P4. d21. 0.7. d31. 0.6. d41. 0.5. d22. 0.4. d32. 0.6. d42. 0.5. d23. 0.3. d33. 0.1. d43. 0.2. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(54) 14.3 Query Routing • At the first appearance of a queries peers only send out their input for IDF computation • Super-peers aggregate IDFs and build index • Whenever a query is repeated – SPs will send recent IDF-values together with query terms – Peers will uses IDFs for local score computation. • Disadvantage: at first occurrance of query it has to be sent twice – Zipf-Distribution minimizes number of queries concerned. • Advantages: – No effort for maintaining global IDF index – Values for often occurring queries are kept up-to-date VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(55) 14.3 Query Routing und Network Churn. • Query index strategy – Send queries only to peers that have already recently contributed to answering a query. • Problem: the network„s and each peer„s volatility • •. Solution 1: Send queries also to a randomly selected set of peers Solution 2: “Best before”-timestamp. SP1. X SP2. SP3. X. SP4. X SP6. X SP8 VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig. SP7. SP5.

(56) 14.3 Locality-Based Routing Indexes • Refinement of routing indexes by social metaphors • Similar retrieval process like in real life – – – –. Every person has only limited knowledge of the environment Who knows about a certain topic? Who might know other people who know about the topic? Try to build (short) „chains of acquaintances‟ that will bring you close to the requested information. • Aims at building „social networks‟ as overlays • Peers semantically connected by certain topics form „small world networks‟, e.g. [Milgram, ‟67; Kleinberg, „00] • Paradigm of interest-based locality – If a peer has relevant content for a user‟s query, it very often also has some other content that this user might be interested in. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(57) 14.3 Locality-Based Routing Indexes • For information retrieval in P2P network this enables new routing in interest-based overlay structures – Route queries to peers with documents matching semantically close queries – Traces on practical data collections show that • Peers get well-connected • The overlay graph shows highly-clustered characteristics with a small minimum distance between any two nodes. – „Overhearing‟ of communications routed through a peer can be used to enhance its local index – Randomly sending queries also to peers from the default network helps to extend knowledge and can remedy the effect of network churn VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.


(59) 14.4 Supporting Effective P2P IR • P2P information retrieval has to deal with the trade-off between – Efficient local maintenance of statistics / index information, where information can be stale (incorrect) – Expensive global maintenance of statistics / index information, where information always is accurate. • Needed is „just the right level‟ of dissemination of statistics to guarantee a „sufficiently effective‟ retrieval • Some techniques help to support efficient retrieval – Providing adequate collection-wide information – Estimate document overlap between peers – Pre-structure collections by categories / taxonomies VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(60) 14.3 Providing Collection-Wide Information • Collection-wide information is important for retrieval quality, but cannot be calculated locally like e,g., IDFs – Some systems like e.g. PlanetP, do not use CWI directly, but circumnavigate the problem by using an inverted peer frequency. – – – –. where N is the number of all peers and Nt is the number of peers offering documents on term t If summarizations of peers (abstracts) are eagerly disseminated, each peer can locally decide values for N and Nt The relevance of peers in multi-keyword queries is simply the sum of IPFs for the individual terms Practical tests show an average overlap of about 70% between result sets retrieved with IDFs and those retrieved with IPFs Using IPFs the scalability is, however, still limited VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(61) 14.4 Providing Collection-Wide Information • Tests in Web information retrieval, e.g. [Viles & French, „95], show that CWI stays relatively stable over the whole collection of Web Sites even with churn – Only joining/leaving corpora on completely new topics result in significant change. • Indexing CWI in a similar way as the routing information for queries is possible [Balke et al., „05] – In structured networks CWI can be aggregated along the backbone and indexed CWI can be distributed together with the query – New queries have to be flooded/routed twice • The first flooding collects and aggregates CWI • The second one provides the correct CWI for local scorings. – Non-expired indexed CWI can always be used when available VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(62) 14.4 Estimating the Document Overlap • Assessing the novelty of collections also supports retrieval quality – Pre-computed statistics about expected result quality in each collection is often used to minimize the number of queried collections – Choosing collection with high overlap for querying will usually not improve result sets sufficiently to justify the access costs – Especially progressive searches, like top-k searches, profit from focusing on collections with small overlaps, since result merging procedures will ignore identical/similar results. • The novelty of a collection can only be calculated with respect to some reference collection(s) – e.g. those collection(s) already in a peers local routing index VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(63) 14.4 Estimating the Document Overlap • A definition of a peer p‟s collection Cp with respect to a reference collection Cref [Bender et al., „05] – Since the information what exact documents a peer offers is usually not disseminated, the values have to be approximated from statistics • E.g. if abstracts in the form of Bloom filters are given, a combined Bloom filter bp can be calculated by bitwise logical AND between p‟s Bloomfilters for all keywords in a query • Novelty then can be estimated by comparing it to as the union of those Bloom filters bi of the set of collections S that have already been retrieved • The degree of novelty is given by counting locations where p‟s Bloom filter has differing set bits VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(64) 14.4 Prestructuring Collections with Taxonomies. • Retrieval in P2P systems generally considers two basic paradigms – Fulltext-based queries – Metadata-based queries. • Integrating these paradigms can support retrieval effectiveness – Structuring document collections – Disambiguation of query terms. • Peers often host collections of similar documents, e.g. similar kind of information (newspaper articles, etc.) on similar topics, etc. – Scalability and successful use of statistics are strongly improved, if a common system of categories to classify the documents can be used – Since categories are more or less similar to each other a taxonomy on categories allows for easily finding semantically similar documents. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(65) 14.4 Prestructuring Collections with Taxonomies. • Topical similarity within a taxonomy is defined by [Li et al., „03] – – – –. l: shortest path between categories c1 and c2 h: level of common subsumer Common values  = 0.2, =0.6 (experimentally determined) E.g. newspaper articles: News Business. Politics. hh l. …. sim(Politics, Sports): Foreign): Sports. l. Foreign Domestic. l = 21 h = 12 sim = 0.35 0.68. Tennis. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(66) 14.4 Combination of Topics and Keywords • Topics dominate keywords • Cooperative Filter: Relax on topics until k results have been found • Example: [<Politics>, “London Olympics”] Topic Similarity Politics Foreign Domestic Sports Business Tennis. Text Collection Politics Foreign Sports. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(67) 14.4 Combination of Topics and Keywords SP1 RequestResults {P1}. SP5 SP1. P0 {d11} {d11, d21}. SP3. SP7. SP2. P1 d11. P. 0.8. d12. P. 0.3. d13. P. 0.2. SP6. P2 News Politics. …. Sports. TopRes TopRes. {(P1, d11, [P,0.8]), [P,0.7])} (SP2, d21, [P,0.7]),. Delivered. (P1, d11, d12, [P,0.8])} [P,0.3])} [S,0.3])} {(P1,. Delivered Delivered. {(P1, d11, [P,0.8]), [P,0.8])}. SP4. SP8. P3. P4. (SP2, d21, [P,0.7])}. d21. D P. 0.7. d31. D P. 0.6. d41 d41. S. 0.9. d22. P. 0.4. d32. D. 0.5. d42. S. 0.5. d23. S. 0.3. d33. D. 0.1. d43. S. 0.2. VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.


(69) 14.5 Summary and Conclusion • In today‟s P2P systems only exact match keyword retrieval is prevalent (usually on meta-data) • Information retrieval in P2P scenarios is needed – Individual, loosely coupled document collections need fulltext retrieval and ranking techniques – Applications range from shared working environments e.g. in project groups, to distributed digital libraries. • Almost all IR systems use at least some global statistics, in P2P infrastructures the dissemination of necessary statistics becomes a performance bottleneck – Trade-off between cached, but sometimes stale statistics and new, but expensively updated statistics needs to be managed – How much staleness does a „sufficient‟ retrieval effectiveness allow? VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(70) 14.5 Summary and Conclusion • Choosing the „right‟ collections for querying improves retrieval efficiency – Containing most promising documents with possibly little overlap – Small worlds offer quick connections to semantically close collections. • Query routing indexes can handle some network churn while providing results of sufficient quality – Local indexes can be efficiently maintained – Can exploit advantages by Zipf-distributed content allocations and querying behavior – Need to contact only small numbers of peers. • Supporting techniques like efficient CWI estimation/ dissemination or taxonomies of document categories further improves retrieval VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig.

(71)