• Keine Ergebnisse gefunden

homework 3

N/A
N/A
Protected

Academic year: 2022

Aktie "homework 3"

Copied!
2
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Homework 3:

Representing Simple Documents

Benjamin Roth, Marina Sedinkina Symbolische Programmiersprache Due: Thursday November 7, 2019, 16:00

In this exercise you will:

• Implement a simple document class.

• Get experience using the unittestframework.

You can monitor your progress by calling (from thesrc direcory:) python3 -m unittest hw03_documents/test_documents.py

Exercise 1: TextDocument class [10 points]

1. Implement the helper methodword_tokenizethat takes a string and returns a list of tokens (converted to lower case). Use nltk for tokenization.

2. Complete the constructor for TextDocument. You need to add word_to_count, a dictionary that maps every word to the number of its occurrences in this document.

3. Complete the class methodfrom_file, that creates a document by reading a file, and calls the constructor with the text read from the file (and the filename as its id).

4. Implement the__str__method. It should return a string representation that is at most 25 characters long. If the original text is longer than 25 characters, the last 3 characters of the short string should be“...”. For example, the document text:

“Dr. Strangelove is the U.S. President’s advisor.”

Should yield thestr representation:

“Dr. Strangelove is the...”

5. Complete the function word_overlap that determines the number of words that occur in both of the documents (self and other_doc) at the same time. Every word should be considered only once, irrespective of how often it occurs in either document (i.e. we consider wordtypes). In other words this should return the size of the intersection of the word sets for both documents.

1

(2)

Using NLTK

If you work on the cip pool computers, nltk should already be installed.

If you use the CIP Pool computers, you may have to download the ressource’punkt’: 1. open the Python interactive shell:

python3

2. then execute the following commands:

>>> import nltk

>>> nltk.download(’punkt’)

If you use your own computer:

• Unix (with Python3):

sudo apt-get install python3-pip sudo pip3 install -U nltk

Test the installation:

python3

>>>import nltk

• Windows: http://www.nltk.org/install.html

• If you encounter difficulties, ask fellow students or the tutors.

2

Referenzen

ÄHNLICHE DOKUMENTE

28 Ranie Nasser, Zachy Mehchy, Khalid Abu Ismail, The Socioeconomic Roots and Impact of the Syrian Crisis, Damascus, Syrian Center for Policy Research (SCPR), January

Adam Smith så ikke at aktiviteter med en sterk arbeidsdeling ville får større markedsmakt enn hans idealiserte jordbruk, selv om han forøvrig klart innså at arbeidsdelingen

a) Read the text. Janet and John live in London. Janet is ten years old. John is eleven years old. Janet’s birthday is in May. She likes reading books and riding the bike.

Janet and John live in London. Janet is ten years old. John is eleven years old. Janet’s birthday is in May. She likes reading books and riding the bike. John’s birthday is in

In this paper we describe the eDonkey protocol and measurement results on network/transport layer and application layer that were made with the client software and with an

Introduction: The objective of this paper is to introduce a new population-based (stochastic) heuristic to search the global optimum of a (continuous) multi-modal

[r]

2: Giovanni Battista da Sangallo, Annotations and auto- graph drawings in his version of Sulpicio’s Vitruvius De Ar- chitectura, III.v.5-8, The Corsini Incunabulum, MS 50