• Keine Ergebnisse gefunden

Note: For this assignment, please use the stemmed version of the Reuters collection, which is available for download on the lecture website.

N/A
N/A
Protected

Academic year: 2021

Aktie "Note: For this assignment, please use the stemmed version of the Reuters collection, which is available for download on the lecture website."

Copied!
2
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Information Retrieval and Web Search Engines Summer Semester 2010 Prof. Dr. Wolf-Tilo Balke and Joachim Selke

Homework Assignment 3

Due to June 3, 2010 (34 points in total)

Remember: If you have any problems or questions regarding this assignment, please let us know. We are happy to help!

Note: For this assignment, please use the stemmed version of the Reuters collection, which is available for download on the lecture website.

Exercise 3.1 (Binary Independence Retrieval)

Answer the query “taxes reagan” using the binary independence retrieval model (you may estimate the term Pr(D i = 1 | D R q ) by 0.9 as proposed by Croft and Harper). Compare the results to the ones generated by the vector space model (using TF–IDF and cosine similarity; see Exercise 2.4). Which

model works better (in your opinion)? (8 points)

Exercise 3.2 (Latent Semantic Indexing)

a) LSI has been reported to work better if it is applied to a transformation of the term–document matrix (rather than to the term–document matrix itself). 1 Therefore, please take the filtered and stemmed Reuters matrix TD from Assignment 2 (this matrix is available for download on our website) and replace each entry td i,j by its corresponding log entropy

td i,j 0 =

 1 + P n

r=1 td

i,r

f

i

· ln td f

i,ri

ln(n )

· ln td i,j + 1 ,

where n is the number of documents in the collection and f i is the total number of times term i occurs in the whole collection.

If you did this and saved the new matrix as TDLSI, the command TDLSI(1:200, 1:200) should return the following:

ans =

(131,12) 0.2519 (108,39) 0.5051 (121,73) 0.5089 (107,192) 0.5487

(10 points)

1

Source: http://en.wikipedia.org/wiki/Latent_semantic_indexing.

(2)

Hint: The MATLAB commands spdiags (to rescale the rows of a matrix by multiplying it with a sparse diagonal matrix) and spfuns (to apply a function to each nonzero entry of a sparse matrix) might be helpful.

b) Perform LSI on the transformed term–document matrix you just created by computing its rank-100 approximation. Do it as shown in the lecture by creating two new matrices U 100 0 and

V 100 0 . (3 points)

Hint: The MATLAB command svds will be helpful (be careful, the matrix V returned by MATLAB is the transpose of the matrix we called V in the lecture).

c) Take a look at the first five latent dimensions generated by LSI by inspecting which terms get the highest and lowest coordinates in each dimension. Try to assign a meaningful concept name to

each dimension! (5 points)

d) Answer the query “taxes reagan” using LSI (on the matrices U 100 0 and V 100 0 using cosine similarity).

Compare the results to the ones generated by the vector space model (on the term–document matrix using TF–IDF and cosine similarity; see Exercise 2.4). Which model works better (in your

opinion)? (8 points)

Hint: The MATLAB command pdist2 might be helpful.

Referenzen

ÄHNLICHE DOKUMENTE

The large-scale drift bodies deposited in the entire eastern Fram Strait are associated with 371  . the northbound West Spitsbergen Current in the southern part, and from the

La OCDE define el gasto de los hogares como la cantidad del gasto de consumo final realizado por estos para satisfacer sus demandas diarias, y dentro de este consumo final es

Finalmente, dado que la distribución de los residuos de la ecuación de corto plazo es unimodal y no presenta mayor masa de probabilidad en sus colas (o valores extremos),

Очереди возникают практически во всех системах массового обслуживания (далее СМО), а вот теория массового обслуживания

MODEL CALCULATIONS A N D RESULTS water across 30° in the Atlantic (10 Sv of AAIW and 3.1 Sv of AABW; see Table 4) carries salinities smaller than the salinity of

I temi specifici che vengono qui trattati riguardano, anzitutto il divario fra Nord e Sud del mondo e l’emigrazione, in secondo luogo, il mercato del lavoro italiano e il

I temi specifici che vengono qui trattati riguardano, anzitutto il divario fra Nord e Sud del mondo e l’emigrazione, in secondo luogo, il mercato del lavoro italiano

Nel senso che certamente la teoria walrasiana è una teoria dell’equilibrio generale; mentre non è esattamente vero l’opposto. Hicks nel suo articolo IS-LM: An