Information Retrieval Software - Using Search Term Positions for Determining Document Relevance

<DOC>

<PROFILE>_AN-BEOA7AAIFT</PROFILE>

FT 14 MAY 91 / International Company News:

Contigas plans DM900m east German project

</HEADLINE>

<BYLINE>By DAVID GOODHART</BYLINE>

<TEXT>

CONTIGAS, the German gas group 81 per cent owned by the utility Bayernwerk, said yesterday that it intends to invest DM900m (Dollars 522m) in the next four years to build a new gas distribution system in the east German state of Thuringia. ...

</TEXT>

<PUB>The Financial Times</PUB>

<PAGE>International Page 20</PAGE>

</DOC>

Figure 2.21: A document extract from theFinancial Times(TREC-8).

Like most of the traditional retrieval collections, there are three distinct parts in TREC-8: the documents, the topics, and the relevance judgements. The documents are tagged using SGML to allow easy parsing (see Fig. 2.21). The philosophy in the formatting at NIST is to leave the data as close to the original as possible. No attempt is made to correct spelling errors, sentences fragments, strange formatting around tables, or similar faults.

The title field of the topics consists of up to three words that best describe the topic, and the description field is a one sentence description of the topic area containing all of the words of the title field. The narrative gives a concise description of what makes a document relevant.

Ad hoc topics have been constructed by the same person who performed the relevance assessments for that topic (called assessor).

Figure 2.22: The interface of the Expansion Analyzer represents a virtual document con-taining three arbitrary set of terms and permits a graphical analysis of the proposed term position models.

classes for the TREC evaluation are implemented.

2.11.1 The Expansion Analyzer

With the aim of understanding the properties and behaviour of the different term positions models, the Expansion Analyzer, a Java module that emulates a “virtual document” and the position of three arbitrary group of terms was developed. This software provides a graphi-cal view of the term position models and permits us to analyze the different mathematigraphi-cal operators used to compare the proposed term distribution functions.

The program interface (Figure 2.22) consists of two main groups of components: the Input Controls and the Analysis Windows:

The Input Controls

As shown in Figure 2.22, the Input Controls help the user to set the parameters of the test environment. They consist of three input fields (A), three check-boxes (B) and one slider (D). The input fields in (A) define the positions of three arbitrary term sets: theReference (black),Distribution 1(blue), andDistribution 2(red). The selected check-box in (B) define the mathematical model (expansion) used to calculate the term position functions, and the slider (D) set the order of the selected expansion.

searcher query parser

analyzer

parser pdf

parser ms word parser

html parser

text

text html pdf ms word

index

Front-end process Back-end process

query

results

user

search interface

Figure 2.23: Lucene Structure.

Functions Viewers

The Functions viewers are a set of windows reproducing graphically the characteristics of the test environment defined through the Input Controls. The main viewer (C) describes the form of the term position functions in the virtual document, and the group of windows in (F) depicts four different mathematical operators applied to the input functions in the form:

• Operator(Ref erence, Distribution₁)(blue curve), and

• Operator(Ref erence, Distribution2)(red curve).

The implemented operators are, from left to right: (a) the scalar product, (b) The cosine of the corresponding coefficient vectors, (c) the norm difference, and (d) the projection.

Finally, using the Button (E) one can dump the expansion coefficients of the current configuration to confirm the accuracy of the calculated coefficients.

With the help of this module it was possible to determine the most suitable configuration applied in the experimental phase.

2.11.2 Apache Lucene

Apache Lucene is a popular scalable IR library written in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Lucene is a member of the popular Apache Jakarta family of projects, licensed under the liberal Apache Software License [63].

Lucene provides a core API for indexing and search tasks that can be easily integrated in existing applications. Figure 2.23 shows the indexing architecture of Lucene. Lucene uses different parsers for different types of documents. Take HTML documents, for example – an HTML parser does some pre-processing, such as filtering the HTML tags and so on. The HTML parser outputs the text content, and then the Lucene Analyzer extracts tokens and related information, such as token frequency, from the text content. The Lucene Analyzer then writes the tokens and related information into the index files of Lucene.

Manager

index

corpus

Front-end process Back-end process

results

user Indexer

parsing

query

application

Collection

TermPipeline IndexBuilders

Document

Figure 2.24: Terrier Structure.

2.11.3 Terabyte Retriever - Terrier

Terrier is a search engine that implements state-of-the-art indexing and retrieval functions providing a platform for the rapid development of large-scale retrieval applications [106].

Terrier is written in Java, and can be used for Web and Enterprise search, Desktop, Intranet and Vertical search engines, as well as developing and evaluating novel large-scale text information retrieval techniques and applications.

The open source version of Terrier provides a platform for research and experimentation in text retrieval, supporting commonly used TREC research collections (e.g. TREC CDs 1-5, WT2G, WT10G, GOV, GOV2, Blogs06).

Terrier implements two main applications: (a) Trec Terrier: an application that enables indexing and querying of TREC collections, and (b) Desktop Terrier: for the indexing and retrieval of local user content.

As shown in the Figure 2.24, Terrier has a similar indexing and retrieval structure as Lucene and also is an Open Source project (Mozilla Public Licence).

Im Dokument Using Search Term Positions for Determining Document Relevance (Seite 74-77)