Information Retrieval &
Web Search Engines
Lecture 2: More Retrieval Models
Please note: The exercises will be neither collected, nor corrected, or graded.
Even though the homework assignments are optional, you are encouraged to answer them and to discuss them in small groups, as they will help you prepare for your final exam.
1. Given the following subset of documents from a document collection:
๐ท1 = {๐ก1, ๐ก5, ๐ก9} ๐ท4 = {๐ก4, ๐ก5, ๐ก10} ๐ท2 = {๐ก1, ๐ก2, ๐ก4, ๐ก5, ๐ก9} ๐ท5 = {๐ก3, ๐ก5, ๐ก6, ๐ก7} ๐ท3 = {๐ก3, ๐ก6, ๐ก7, ๐ก8} ๐ท6 = {๐ก1, ๐ก2, ๐ก10}
a. Create an inverted index for the following terms: ๐ก1, ๐ก2, ๐ก3, ๐ก5, ๐ก8
b. According to the Boolean model we reviewed in Lecture 1: Introduction, evaluate the following queries:
i. ๐1 = (๐ก1 ๐๐ ๐ก5) ๐๐ข๐ก ๐๐๐ก (๐ก3 ๐๐ ๐ก2) ii. ๐2 = (๐ก1 ๐๐๐ ๐ก5) ๐๐ (๐ก3 ๐๐๐ ๐ก2) 2. Explain and discuss the idea behind TF-IDF.
3. For TF-IDF explain the problem of larger documents and how we can manage it.
4. List the different ways of representing documents in a vector space model.
5. Why do we normalize the vector representation of documents in the vector space model? Is it always a good idea?