• Keine Ergebnisse gefunden

Homework Assignment 4

N/A
N/A
Protected

Academic year: 2021

Aktie "Homework Assignment 4"

Copied!
2
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Information Retrieval and Web Search Engines Summer Semester 2010 Prof. Dr. Wolf-Tilo Balke and Joachim Selke

Homework Assignment 4

Due to June 17, 2010 (35 points in total)

Remember: (1) Start early to work on this assignment. (2) Let us know if you need help.

Note: Again, for all of the following exercises, please use thestemmedversion of the Reuters collection, which is available for download on the lecture website.

Exercise 4.1 (Retrieval with Language Models)

Implement a retrieval system based on the simple unigram language model presented in the lecture.

Use linear smoothing withλ= 0.8. Use your system to answer the query “taxes reagan” (don’t forget stemming). What is your subjective opinion about the result quality? (5 points)

Exercise 4.2 (Relevance Judgments: Creating Test Data)

As we have seen in the lecture, an IR system’s effectiveness is typically evaluated against some human- definedground truth,that is, a test data set consisting of (1) a document collection, (2) a set of queries, and (3) a relevance judgment for each query–document pair.

Use the pooling method (based on the retrieval methods we used so far in the homework, namely, Boolean retrieval, vector space retrieval with TF-IDF and cosine similarity, 100-dimensional LSI with log entropy and cosine similarity, Binary Independence Retrieval with the 0.9 heuristic, and language models as used in the previous exercise) to determine which documents are relevant with respect to the following queries:

a) taxes reagan b) oil price collapse c) toxic cargo

For each of these queries, please explain (1) which (more detailed) information need you assume to underlie the query and (2) what criteria your relevance judgments are based upon.

Hint: To simplify this task, we recommend to implement some helper functions in MATLAB: First, for each retrieval method, a function that takes a query as input and returns a ranking of the collection’s documents along with the corresponding scores. Second, a function that takes a query as input, calls all retrieval methods, processes their result lists (by considering only a reasonably long prefix of each list), and create a duplicate-free list of documents, which have to be evaluated manually for

relevance. (15 points)

(2)

Exercise 4.3 (Evaluating Precision and Recall)

Use the test data set created in the previous exercise to evaluate the effectiveness of the five retrieval methods we used so far (see above). For each of the three queries, draw a picture showing the precision–recall at k curves for all five methods (you don’t need to apply interpolation).

Compare the three pictures and discuss the strengths and weaknesses of each retrieval method.

Hint: For Boolean retrieval, drawing a precision–recall at k curve might be difficult. What can you do instead?

Hint 2: Drawing precision–recall at k curves is problematic if many documents get assigned the same score in a ranking. What can be done to avoid this problem?

(15 points)

Referenzen

ÄHNLICHE DOKUMENTE

The SuperTable + Scatterplot will be introduced in a 3D GeoLibrary [5] as one new information visualization technique to support users during the different information

While the traditional VSM is based on sets of terms, there are also alternative models that employ vectors representing sets of concepts – we will refer to these as semantic

A theoretical part pro- viding an introduction to mobile computing, smartphone operating systems and development platforms, as well as mobile web applications and mobile

Peter, Gabriel and Steve want to attend a sold-out Genesis concert.. Peter is willing to pay up to 200 EUR for a ticket, Gabriel up to 100 EUR and Steve up to

(Please keep in mind that to minimize storage space you must assign new IDs to the documents based on to their frequency; the most frequent document gets ID 0, the second-most

The MATLAB data file reuters-21578-stemmed-with-topics.mat, which is available for download at http://www.ifis.cs.tu-bs.de/webfm_send/478, contains the whole (stemmed) Reuters data

Information Retrieval and Web Search Engines Summer Semester 2010 Prof.. What is the PageRank vector of its corresponding network graph, for λ = 0.1, λ = 0.5, and λ

Complete Example 2.4.7 by considering the remaining distributions given in Example 2.2.11.