Information Retrieval & Web Search Engines Lecture 2: More Retrieval Models

(1)

Information Retrieval &

Web Search Engines

Lecture 2: More Retrieval Models

Please note: The exercises will be neither collected, nor corrected, or graded.

Even though the homework assignments are optional, you are encouraged to answer them and to discuss them in small groups, as they will help you prepare for your final exam.

1. Given the following subset of documents from a document collection:

𝐷₁ = {𝑡₁, 𝑡₅, 𝑡₉} 𝐷₄ = {𝑡₄, 𝑡₅, 𝑡₁₀} 𝐷₂ = {𝑡₁, 𝑡₂, 𝑡₄, 𝑡₅, 𝑡₉} 𝐷₅ = {𝑡₃, 𝑡₅, 𝑡₆, 𝑡₇} 𝐷₃ = {𝑡₃, 𝑡₆, 𝑡₇, 𝑡₈} 𝐷₆ = {𝑡₁, 𝑡₂, 𝑡₁₀}

a. Create an inverted index for the following terms: 𝑡₁, 𝑡₂, 𝑡₃, 𝑡₅, 𝑡₈

b. According to the Boolean model we reviewed in Lecture 1: Introduction, evaluate the following queries:

i. 𝑞₁ = (𝑡₁ 𝑜𝑟 𝑡₅) 𝑏𝑢𝑡 𝑛𝑜𝑡 (𝑡₃ 𝑜𝑟 𝑡₂) ii. 𝑞₂ = (𝑡₁ 𝑎𝑛𝑑 𝑡₅) 𝑜𝑟 (𝑡₃ 𝑎𝑛𝑑 𝑡₂) 2. Explain and discuss the idea behind TF-IDF.

3. For TF-IDF explain the problem of larger documents and how we can manage it.

4. List the different ways of representing documents in a vector space model.

5. Why do we normalize the vector representation of documents in the vector space model? Is it always a good idea?