Lucene as Indexing Framework - Wikipedia in an Inverted Index

4.3 Wikipedia in an Inverted Index

4.3.1 Lucene as Indexing Framework

We use Apache’s search engine Lucene² as framework, a fast and memory efficient open source implementation that allows the creation of inverted indices and facili-tates the search in large scale text collections in a structured way. Each document in a Lucene index can have a multitude of distinctfields that are used to store specific types of information or content. A search in an index is performed using a query consisting of one or more query terms that are matched against the fields of the indexed documents. The result of the search is a ranked list of indexed documents where the ranking basically states how well each document in this ranking fits the query.

Fields and Queries

Each field in a Lucene index is qualified by a name and the value it stores. To indicate the difference, we use specific fonts for field names. Typically, the value of a field is a string or a collection of strings but it can also be a number. For instance, a field may store the title of an entity as one keyword, another may store all words from an article text as a collection of strings. A field may also be added several times to a document with multiple different values, for instance to store all synonyms of an entity³.

Fields are the targets for search queries and named fields allow the placement of dedicated query terms. In Lucene, a query term q_f(x) is associated with a specific field f and some value x. In the following we use the simplified notation as follows when referring to query terms on specific fields.

1http://www.en.wikipedia.org, version from September 1st, 2011.

2http://lucene.apache.org/

3Internally, Lucene handles this as a concatenation of all values associated with that field.

110

4.3 Wikipedia in an Inverted Index

Notation (Query terms)

For a query term q_f(x) the argument x is the value that is to be matched against the field f.

For example, the query term q_title("Apple")is associated with a field title and matches any field title that contains the value "Apple". A search query q is then formulated through a conjunction of one or more query terms, i.e.

q =q_f₁(x₁)∧. . .∧q_f_n(x_l), (4.5) where each query term q_f_i(x_j) is associated with a field f_i, i = 1, . . . , n and some value x_j, j = 1, . . . , l that may be the same or different for each query term.

Each query term can be characterized as either optional, mandatory or excluding.

Mandatory terms must appear in an indexed document in order for the document to be retrieved, excluding terms effectively rule out all documents containing this term. Optional terms should appear in the document and if so will increase the query related score of a document, but in a conjunction of optional terms not all terms need to be present. Each optional term can be endowed with a weight to emphasise its importance. The same holds for a sequence of terms and the combination of different queries.

Example 11 (Queries and Query Terms)

Assume an index with fields A, B, C, and D each storing some string value. An exemplary queryq to search this index is

q = +q_A("a")∧!q_B("b") ∧ q_C("x")∧ w·q_D("y").

The query q is a conjunction of the following query terms:

• a mandatory term q_A("a") (indicated by +)

• an excluding term q_B("b") (indicated by !)

• two optional terms q_C("x")and q_D("y"), the latter weighted by a factor w.

The query q will retrieve only those documents that contain the value "a" in a field A, but do not contain the value "b" in any fieldB. The documents fulfilling these constraints are ranked according to the number of matches on the fields A, C and D, matches on the latter are weighted by some factor w. If for instance the term q_D("y") is three times more important than q_C("x"), we would use w= 3.

The concept of fields allows us to encode distinct entity attributes in specific fields that may have different importance for entity linking. As already stated, the importance of an attribute can be emphasised or boosted through the usage

Chapter 4 Local and Global Search for Entity Linking

of weights. For instance, we may formulate a query that has a higher weight for exact matches between mention name and entity name and a lower weight for partial matches. This weighting will then be used in Lucene’s scoring function. This scoring function associates each document retrieved from a search with an individual ranking score.

Scoring in Lucene

According to Hatcher et al. [2010], the score sIW(q, d)of a document dfor a search with a queryq in an index IW is given by:

s_I_W(q, d) = norm(q)·c(q, d) X

q_f(x)∈q

tf_q_f_(x),d·idf_q²

f(x)·w_q_f_(x),d·norm(q_f(x), d). (4.6) The quantity tf_q_f_(x),d ∈ N^≥0 denotes the frequency of term q_f(x) in d which is the number of times the value x appears in any field f in document d. The factor idf_q_f_(x) ∈R^≥0 is the inverse document frequency as in Eq. 3.2 and reflects how may documents contain the valuex in any field f. The factor w_q_f_(x) ∈R^>0 is the weight on a specific query term q_f(x). Later, we will use this factor to emphasize matches on dedicated fields.

The normalization factor norm(q) in Eq. 4.6 is the same for all documents and used internally by Lucene to compare different queries. It is given by the sum of squared weights of each of its terms

norm(q) = 1

qw_q²·P

qf(x)∈q(idf_q_f_(x)·w_q_f_(x))²

, (4.7)

where the additional factor w_q ∈ R^>0 denotes the weight on conjunctions of terms in the queryq.

The coordination factor c(q, d) in Eq. 4.6 is based on the number of terms q_f(x) a document contains. It rewards a document containing many query terms by increasing the document’s score over those of documents containing less terms. The last factor norm(q_f(x), d) in Eq. 4.6 is a field-length norm and computed over the values in a field f. Comparable to a length norm, it is used to give a higher score to fields with few values matching the query compared to fields with many values matching a query. For instance in the context of entity linking, this field-length norm makes sure that entities with short article text are treated similarly to entities with longer article text.

Having described the general aspects for index or search based entity retrieval, we will now describe the underlying indices. We start with the link indexI_L since this index is also created first.

112

4.3 Wikipedia in an Inverted Index

Im Dokument Entity Linking to Wikipedia (Seite 124-127)