• Keine Ergebnisse gefunden

Towards a distributed search engine

N/A
N/A
Protected

Academic year: 2022

Aktie "Towards a distributed search engine"

Copied!
2
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Towards a Distributed Web Search Engine

Ricardo Baeza-Yates Yahoo! Research Barcelona, Spain rbaeza@acm.org

Abstract:We present recent and on-going research towards the design of a distributed Web search engine. The main goal is to be able to mimic a centralized search en- gine with similar quality of results and performance, but using less computational resources. The main problem is the network latency when different servers have to process the queries. Our preliminary findings mix several techniques, such as caching, locality prediction and distributed query processing, that try to maximize the fraction of queries that can be solved locally.

1 Summary

Designing a distributed Web search engine is a challenging problem [BYCJ+07], because there are many external factors that affect the different tasks of a search engine: crawling, indexing and query processing. On the other hand, local crawling profits with the prox- imity to Web servers, potentially increasing the Web coverage and freshness [CPJT08].

Local content can be indexed locally, communicating later local statistics that can be help- ful at the global level. So the natural distributed index is a document partitioned index [BYRN99].

Query processing is very efficient for queries that can be answered locally, but too slow if we need to request answers from remote servers. One way to improve the performance is to increase the fraction of queries that look like local queries. This can be achieved by caching results [BYGJ+08a], caching partial indexes [SJPBY08] and caching documents [BYGJ+08b], with different degree of effectiveness. A complementary technique is to predict if a query will need remote results and request in parallel local and remote results, instead of doing a sequential process [BYMH08]. Putting all these ideas together we can have a distributed search engine that has similar performance to a centralized search en- gine but that needs less computational resources and maintenance cost than the equivalent centralized Web search engine [BYGJ+08b].

Future research must study how all these techniques can be integrated and optimized, as we have learned that the optimal solution changes depending on the interaction of the different subsystems. For example, caching the index will have a different behavior if we are caching results or not.

2

(2)

References

[BYCJ+07] Ricardo Baeza-Yates, Carlos Castillo, Flavio Junqueira, Vassilis Plachouras and Fab- rizio Silvestri. Challenges on Distributed Web Retrieval. InICDE, 6–20, 2007.

[BYGJ+08a] Ricardo Baeza-Yates, Aristides Gionis, Flavio P. Junqueira, Vanessa Murdock, Vas- silis Plachouras and Fabrizio Silvestri. Design trade-offs for search engine caching.

ACM Trans. Web, 2(4):1–28, 2008.

[BYGJ+08b] Ricardo Baeza-Yates, Aristides Gionis, Flavio P. Junqueira, Vassilis Plachouras and Luca Telloli. On the feasibility of multi-site Web search engines. Submitted, 2008.

[BYMH08] Ricardo Baeza-Yates, Vanessa Murdock and Claudia Hauff. Speeding-Up Two-Tier Web Search Systems. Submitted, 2008.

[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-Neto.Modern Information Retrieval. Ad- dison Wesley, May 1999.

[CPJT08] B. Barla Cambazoglu, Vassilis Plachouras, Flavio Junqueira and Luca Telloli. On the feasibility of geographically distributed web crawling. InInfoScale ’08: Proceedings of the 3rd international conference on Scalable information systems, 1–10, ICST, Brussels, Belgium, Belgium, 2008.

[SJPBY08] Gleb Skobeltsyn, Flavio Junqueira, Vassilis Plachouras and Ricardo Baeza-Yates.

ResIn: a combination of results caching and index pruning for high-performance web search engines. InSIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, 131–138, New York, NY, USA, 2008. ACM.

3

Referenzen

ÄHNLICHE DOKUMENTE

Whereas persistent database objects can be copied from master to worker databases, this is not possible for main memory objects used in query processing.. Again, such objects must

The partitionF operator, applied to a distributed array whose fields contain relations, lets workers in parallel (and sequentially per worker) partition the relation of a field

Introduction to JXTA Search Architecture and Components Query Routing Protocol (QRP) Query Resolution.

Making these imaging data available and allowing medical professionals to perform retrieval based on visual characteristics of images is the challenge that content–based image

In this paper, we study the performance gains for DTC of two link layer error control mechanisms, namely forward error correction (FEC) and local link layer retransmissions..

The German Devel- opment Institute / Deutsches Institut für Entwicklung- spolitik (DIE) is one of the leading global research in- stitutions and think tanks on global

We consider the polynomial ring Q [Ξ] (this is the polynomial ring over Q in the indeterminates Ξ; in other words, we use the symbols from Ξ as variables for the polynomials) and

A production method, that ensures good pollen quality is described, as well as the main quality criteria, that can be included in a future standard.. There are