• Keine Ergebnisse gefunden

2.9 Retrieval Utilities

2.9.4 Automatic Relevance Feedback

As mentioned before, in a relevance feedback cycle, the user examines the top ranked docu-ments and separates them into two classes: the relevant ones and the non-relevant ones. This information is then used to select new terms for query expansion or query re-weighting. An automatic variant of this procedure involves usually identifying terms that are related to the query terms. Such terms might be synonyms, stemming variations, or terms that are close to the query terms in the text. Two basic types of strategies can be attempted: a global one and a local one.

Automatic Local Analysis

In a local strategy, the documents retrieved for a given queryqare automatically examined at query time to determine terms for query expansion. Two different strategies will be discussed, the first strategy proposed by Attar and Fraenkel [7] known aslocal clustering and the second strategy calledlocal context analysiscorresponds to the work of Xu and Croft [164] which is based on a combination of local and global analysis.

Local feedback strategies are based on expanding the query with terms correlated to the query terms. Such correlated terms are those present in local clusters built from the local documents set. To build these cluster structures, Attar and Fraenkel proposed three basic strategies:

Association Clusters. An association cluster is based on the co-occurrence ofstems2(or terms) inside documents. The idea is that stems that co-occur frequently inside documents have a synonymity association [8]. The association clusters are generated as follows:

Definition 4 The frequency of a stem si in a document dj, dj ∈ Dl, is referred to as fsi,j. Let m~ = (mij) be an association matrix with |Sl| rows and |Dl| columns, where mij = fsi,j. Let m~t be the transpose of m. The matrix~ ~s = m ~~mt is a local stem-stem association matrix. Each elementsu,v in~sexpresses a correlationcu,v between the stems suandsv, namely,

cu,v= X

dj∈Dl

fsu,j×fsv,j (2.21)

The correlation factor cu,v quantifies the absolute frequencies of co-occurrence and is said to be unnormalized. Thus, if we adoptsu,v = cu,v, the association matrix~sis said to be unnormalized. An alternative is to normalize the correlation factor using su,v =

cu,v

cu,u+cv,v−cu,v, then the association matrix~sis said to be normalized.

2Astemis the part of a word that is common to all its inflected variants.

Given a queryq, we are normally interested in finding clusters only for the |q|query terms. Further, it is desirable to keep the size of such clusters small. This means that such clusters can be computed in query time. A similar procedure can be applied for a non-stemmed version where keywords instead of stems are used. Keyword-based local clustering is equally worthwhile trying because there is controversy over the advantages of using a stemmed vocabulary [8].

Metric Clusters. Associations clusters are based on the frequency of co-occurrence of pairs of terms in documents and do not take into accountwhere the terms occur in a doc-ument. Since two terms that occur in the same sentence seem more correlated than two terms that occur far apart in a document, it might be worthwhile to consider the the distance between two terms by the computation of their correlation factor. The metric clusters are based in the following definition:

Definition 5 Let the distance r(ki, kj) between two keywords ki andkj be given by the number of words between them in the same document. Ifkiandkjare in distinct documents, we taker(ki, kj) =∞. A local stem-stem metric correlation matrix~sis defined as follows.

Each element su,v of ~s expresses a metric correlation cu,v between the stem su and sv namely,

cu,v = X

ki∈V(su)

X

kj∈V(xv)

1

r(ki, kj) (2.22)

In this expression, as already defined, V(su) andV(sv)indicate the sets of keywords that havesuandsv as their respective stems.

The correlation factorcu,vquantifies absolute distances and is said to be unnormalized.

Thus, if we adoptsu,u =cu,v the association matrix~sis said to be unnormalized. An alter-native is to normalize the correlation factor. For instance, adoptingsu,v = |V(s cu,v

u)|×|V(sv)|, then the association matrix~sis said to be normalized.

Give a local matrix~s, we can use it to build local metric clusters as follows.

Definition 6 Consider theuthrow in the metric correlation matrix~s(i.e., the row with all the associations for the stemsu). LetSu(n)be a function that takes theuthrow and returns the set ofnlargest valuessu,v, wherevvaries over the set of local stems andv 6= u. The Su(n)defines a local metric cluster around the stemsu.

Scalar Clusters. One additional form to obtain a synonymity relationship between two local stems (or terms)suandsvis by comparing the setsSu(n)andSv(n). The idea is that two stems with similarneighborhoodshave some synonymity relationship. In this case, we say that the relationship is indirect or induced by the neighborhood. One way to calculate such neighborhood relationships is to arrange all correlation valuessu,i in a vector~su, to arrange all correlation valuessv,iin another vector~sv, and to compare these vectors through a scalar measure. For instance, the cosine of the angle between the two vectors is a popular scalar similarity measure.

Definition 7 Lets~u = (su,1, su,2, ..., su,n)ands~v = (sv,1, sv,2, ..., sv,n)be two vectors of correlation values for the stemssuandsv. Further, let~s= (su,v)be a scalar association matrix. Then, eachsu,vcan be defined as

su,v = ~su·~sv

|~su| × |~sv| (2.23) The correlation matrix~sis said to be induced by the neighborhood. Using it, a scalar cluster is defined as follows.

Definition 8 LetSu(n)be a function that returns the set ofnlargest valuessu,v, v 6=u, defined according to equation 2.23. Then,Su(n)defines a scalar cluster around the stem su.

A stem su that belongs to a cluster (of size n) associated to another stem sv (i.e., su ∈ Sv(n)) is said to be a neighbor of sv. While neighbor stems are said to have a synonymity relationship, they are not necessary synonyms in the grammatical sense. Often, neighbor stems represent distinct keywords that are correlated by the current query context [8]. The local aspect of this correlation is reflected in the fact that the documents and stems considered in the correlation matrix are all local (i.e.,dj ∈Dl, su∈Vl.

x x

x

x

x

x

x x

x

* *

x x x

x x

Sv Su

Sv(n)

Figure 2.16: Stemsu as neighbor of stemsv.

Figure 2.16 illustrates a stem (or term)su that is located within a neighborhoodSv(n) associated with the stem (or term)sv. In general, neighbor stems are an important product of the local clustering process since they can be used for extending a search formulation in a promising unexpected direction, rather than merely complementing it with missing synonyms [8].

The qualitative interpretation of normalized and unnormalized clusters is that unnor-malized clusters tend to group stems whose ties are due to their large frequencies, while normalized clusters tend to group stems which are more rare. Thus, the union of these two clusters provides a better representation of the possible correlations.

Experimental results reported in the literature usually support the hypothesis of the use-fulness of local clustering methods. Furthermore, metric clusters seem to perform better than pure association clusters. This strengthens the hypothesis that there is a correlation between the association of two terms and the distance between them [8].

Local Context Analysis. As discussed above, clustering techniques are based on set of documents retrieved for the original query and use the top ranked documents for cluster-ing neighbor terms uscluster-ing the term co-occurrence criterion inside the documents boundary.

Terms that are the best query term neighbors are then used to expand the original query. A distinct approach is to search for term correlations in the whole collection (global analy-sis) which usually involves the building of a thesaurus that identifies term relationships in the whole collection. The local context analysis approach [164] combines global and local analysis, and is based on the use of noun groups (i.e., single noun, two adjacent nouns, or three adjacent nouns in the text), instead of simple keywords, as document concepts. For query expansion, concepts are selected from the top ranked documents (as in local analysis) based on their co-occurrence with the query terms (no stemming). However, this approach usespassages(text windows of fixed size) instead documents (as in global analysis). More specifically, local context analysis is divided into three steps.

• First, retrieve the top n ranked passages using the original query. This is accom-plished by breaking up the documents initially retrieved by the query in fixed length passages (for example, of size 300 words) and ranking these passages as if they were documents.

• Second, for each concept cin the top ranked passages, the similarity sim(q, c) be-tween the whole queryq (not individual query terms) and the conceptcis calculated using a variant of tf-idf ranking.

• Third, the topmranked concepts (according tosim(q, c)) are added to the original queryq. To each added concept a weight is assigned given by1−0.9×i/mwhere iis the position of the concept in the final concept ranking. The terms in the original queryqmight be stressed by assigning a weight of 2 to each term.

The similaritysim(q, c)between each related conceptcand the original query (step 3) is computed as follows.

sim(q, c) = Y

ki∈q

δ+log(f(c, ki)×idfc) log n

idfi

(2.24) where n is the number of top ranked passages considered. The function f(c, ki) quantifies the correlation between the concept c and the query term ki and id given by f(c, ki) =Pn

j=1 pfi,j×pfc,j, wherepfi,j is the frequency of termkiin thejthpassage andpfc,j is the frequency of the conceptcin thejthpassage. Notice that this is the stan-dard correlation measure defined for association clusters (by Equation 2.21) but adapted for passages. The inverse document frequency factors are computed as follows.

idfi =max(1,log10N/npi

5 ) (2.25)

idfc=max(1,log10N/npc

5 ) (2.26)

where N is the number of passages in the collection, npi is the number of passages containing the termki, andnpcis the number of passages containing the concept c. The factorδis a constant parameter that avoids a value equal to zero forsim(q, c). Usually,δis a small factor with values close to 0.1 (10% of the maximum of 1). Finally, theidfi factor in the exponent is introduced to emphasize infrequent query terms.

The procedure to calculatesim(q, c)is a non-trivial variant oftf −idf ranking. Fur-thermore, it has been adjusted for operations with TREC data and did not work so well with different collections. Thus, it is important to have in mind that for operations with different collections, tuning might be required.

Automatic Global Analysis

In a global strategy, all documents in the collection are used to determine a global-like the-saurus structure that defines term relationships. We discuss two variants of these strategies, one based on a similarity thesaurus and a second one based on a statistical thesaurus.

Automatic Global Analysis based on a Similarity Thesaurus. The similarity thesaurus [114] proposes a term to term relationship, considering that terms are concepts in a concept space. In this concept space, each term is indexed by the documents in which it appears.

Thus, terms assume the original role of documents while documents are interpreted as in-dexing elements. The following definitions establish the proper framework.

Definition 9 Lettbe a number of terms in the collection,N the number of document in the collections,fi,j be the frequency of occurrence of the termki in the documentdj. Further, lettj be the number of distinct index terms in the documentdj anditfj be the inverse term frequency for the documentdj. Then,itfj =logtt

j, analogously to the definition of inverse document frequency.

Within this framework, to each termki a vector~kigiven by~ki = (wi,1, wi,2, . . . , wi,N) is associated wherewi,j is a weight associated to the index-document pair[ki, dj]. These weights are computed as follows.

wi,j = (0.5 + 0.5maxfi,j

j(fi,j))itfj

qPN

l=1(0.5 + 0.5maxfi,l

l(fi,l))2 itfj2

(2.27) wheremaxj(fi,j)computes the maximum of all factorsfi,j for theithterm (i.e., over all documents in the collection). We notice that the expression above is a variant oftf−idf weights but one that considers inverse term frequencies instead.

The relationship between two termsku andkv is computed as a correlation factorcu,v

given by

cu,v =~ku·~kv =X

dj

wu,j ×wv,j (2.28)

This equation is a variation of the correlation measure used for calculating scalar associ-ation matrices. The main difference is that the weights are based on interpreting documents as index elements instead of repositories for term occurrence.

The global similarity thesaurus is built through the computation of the correlation factor cu,v for each pair of indexing terms [ku, kv]in the collection. Of course, this is computa-tionally expensive. However, this global similarity thesaurus has to be calculated only once and can be updated incrementally.

Given a global similarity thesaurus, query expansion is done in three steps:

• First, represent the query in the concept space used for representation of index terms.

• Second, based on the global similarity thesaurus, compute a similarity sim(q, kv) between each termkv correlated with the query terms and the whole queryq.

• Third, expand the query with the toprranked terms according tosim(q, kv).

For the first step , the query is represented in the concept space of index term vectors as follows.

Definition 10 To the queryqis a vector~qin the term-concept space associated given by

~q=X

ki∈q

wi,q~ki (2.29)

wherewi,qis a weight associated to the index-query pair[ki, q].

For the second step, a similarity sim(q, kv) between each term kv (correlated to the query terms) and the user queryqis computed as

sim(q, kv) =~q·~kv = X

ku∈Q

wu,q×cu,v (2.30)

wherecu,vis the correlation factor given in the equation (2.28).

For the third step, the top r ranked terms according to sim(q, kv) are added to the original queryqto form the expanded queryq0. To each expansion termkvin the queryq0, a weightwv,q0 is assigned.

wv,q0 = sim(q, kv) P

ku∈qwu,q (2.31)

The expanded queryq0is used to retrieve new documents for the user.

Automatic Global Analysis based on a Statistical Thesaurus. In this section, we dis-cuss a quite different global analysis technique proposed by Crouch and Yang [37] based on a statistical thesaurus.

The global thesaurus is composed of classes that group correlated terms in the context of the whole collection. Such correlated terms can then be used to expand the original query.

To be effective, the terms selected for expansion must have high term discrimination values [139] which implies that they must be low frequency terms. However, it is difficult to cluster low frequency terms effectively due to the small amount of information about them (they occur in few documents). To avoid this problem, documents will be clustered in classes instead and low frequently terms in these document are used to define the thesaurus classes.

In this situation, the document clustering algorithm must produce small and tight clusters.

A document clustering algorithm that produces such type of clusters is thecomplete link algorithmthat works as follows (naive formulation).

1. Initially, place each document in a distinct cluster.

2. Compute the similarity between all pairs of clusters.

3. Determine the pair of clusters[Cu, Cv]with the highest inter-cluster similarity.

4. Merge the clustersCuandCv.

5. Test a stop criterion. If this criterion is not met, then go back to step 2.

6. Return a hierarchy of clusters.

The similarity between two clusters is defined as the minimum of the similarities be-tween all pairs of inter-clusters documents (i.e., two documents not in the same cluster). To compute the similarity between documents in a pair, the cosine formula of the vector model is used. As a result of this minimality criterion, the resulting clusters tend to be small and tight.

Given the document cluster hierarchy for the whole collection, the terms that compose each class of the global thesaurus are selected as follows.

• Obtain from the user three parameters: threshold class (TC), number of documents in the class (NDC), and minimum inverse document frequency (MIDF).

• Use the parameter TC as a threshold value for determining the document clusters that will be used to generate thesaurus classes. The threshold has to be surpassed by sim(Cu, Cv)if the documents in the clusterCuandCv are to be selected as sources of terms for a thesaurus class.

• Use the parameter NDC as a limit on the size of clusters (number of documents) to be considered.

• Consider the set of documents in each document cluster preselected above (through the parameters TC and NDC). Only the lower frequency documents are used as

sources of terms for the thesaurus classes. The parameter MIDF defines the minimum value of inverse document frequency for any term which is selected to participate in the thesaurus class. By doing so, it is possible to ensure that only low frequency terms participate in the thesaurus class generated (terms too generic are not good synonyms).

Given that the thesaurus classes have been built, they can be used for query expansion.

For this, an average term weightwtC for each thesaurus classCis computed as follows.

wtC = P|C|

i=1wi,C

|C| (2.32)

where|C|is the number of terms in the thesaurus classCandxi,C is a pre-computed weight associated with the term-class pair[ki, C]. This average term weight can then be used to compute a thesaurus class weightwC as

wC = wtC

|C| ×0.5 (2.33)

Experiments with well known document collections (ADI, Medlars3, CACM4and ISI5) indicate that global analysis using a thesaurus built by the complete link algorithm might yield consistent improvement in retrieval performance [8].

The main problem with this approach is the initialization of the parameters TC, NDC, and MIDF. The threshold value TC depends on the collection and can be difficult to set properly. Inspection of the cluster hierarchy is almost always necessary for assisting with the setting of TC. Care must be exercised because a high value of TC might yield classes with too few terms while a low TC value might yield too few classes. The selection of the parameter NDC can be decided more easily once TC has been set. However, the setting of the parameter MIDF might be difficult and also requires careful consideration.