• Keine Ergebnisse gefunden

The section has presented the data and annotation which served as input to deFuser. All the data is annotated with sentence boundaries, parts of speech and syntactic dependencies.

German biography corpora are also semantically annotated. All but one corpus (T¨uBa-D/Z) are annotated automatically. The annotation pipeline applied to the German data consists of off-the-shelf tools as well as the lexicon we extracted from Wikipedia with a little effort. A few heuristics have been described which enhanced the annotation.

The fusion algorithm (Chapter 4) is tested on CoCoBi (Sec. 2.1.1). The compression algorithm (Chapter 8) is evaluated on the English compression corpus (Sec. 2.2.1) and on T¨uBa-D/Z (Sec.2.1.3).

12The version from October 26 2008

13Label cop is excluded.

LABEL DESCRIPTION LABEL DESCRIPTION

adv adverbial modifier obja2 second accusative object

app apposition objc clausal object

attr noun attribute objd dative object

aux auxiliary verb objg genetive object

avz verb prefix obji infinitive object

cj conjunction objp prepositional object

det determiner par “parenthesis” (intervening clause)

eth dative subordination part subordinate particle

expl expletive pn noun object for prepositions

gmod genetive modifier pp prepositional modifier

grad degree modifier pred predicate

kom comparative punct punctuation

kon conjuncts rel relative clause

konj subordinate conjunction s root

neb subordinate clause subj subject

np2 logical subject in coordination subjc clausal subject

obja accusative object zeit temporal expression

Table 2.3: Set of dependency relations assigned by WCDG

2.4 Summary 25

LABEL DESCRIPTION LABEL DESCRIPTION

dep* dependent amod adjectival modifier

aux auxiliary appos appositional modifier

auxpass passive auxiliary advcl adverbial clause modifier

arg* argument purpcl purpose clause modifier

agent agent det determiner

comp complement predet predeterminer

acomp adjectival complement preconj preconjunct

attr attribute infmod infinitival modifier

ccomp clausal complement with internal subj partmod participial modifier xcomp clausal complement with external subj advmod adverbial modifier

compl complementizer neg negation modifier

obj object rcmod relative clause modifier

dobj direct object quantmod quantifier modifier

iobj indirect object tmod temporal modifier

pobj prepositional object measure measure phrase modifier

mark word introducing advcl nn noun compound modifier

rel word introducing relative clause num numeric modifier

subj subject number part of compound number

nsubj nominal subject prep prepositional modifier

nsubjpass passive nominal subject poss possession modifier

csubj clausal subject possessive possessive ’s

csubjpass passive clausal subject prt phrasal verb particle

cc coordination parataxis parataxis

conj conjunct punct punctuation

expl expletive ref referent

mod* modifier sdev semantic dependent

abbrev abbreviation modifier xsubj controlling subject

Table 2.4: Set of dependency relations assigned by the Stanford parser

Chapter 3

Grouping Related Sentences

Sentence fusion methods operate on similar sentences. In our work, we define these as sen-tences sharing a significant portion of their content, although for some applications it might be reasonable to group sentences which share only an NP. For example, two sentences sharing the agent, as in (3.1) and (3.2), can be fused so that one of the input sentences becomes a relative clause (3.3):

(3.1) Paul Krugman was awarded the 2008 Nobel Prize in economics.

(3.2) Mr. Krugman is an Op-Ed columnist for The New York Times.

(3.3) Paul Krugman, who was awarded the 2008 Nobel Prize in economics, is an Op-Ed columnist for The New York Times.

Such a fusion might be particularly useful in the context of single-document summarization because in a single document sentences seldom share more than one NP and highly similar sentences are indeed very unusual. However, a simple conversion of a main clause into a relative one requires the use of words not present in the input sentences (e.g., who in Ex.3.3).

Apart from that, this is a relatively unchallenging transformation which is unlikely to produce an ungrammatical sentence, provided that the input is correct. In our work we want to group together or align sentences which are more tightly related.

Ideally, one should align sentences which concern the same event and overlap in propo-sitions. Unfortunately, on the implementation side, this means that such an approach would require an analysis deeper than what can be achieved with existing semantic tools and meth-ods. In this respect shallow methods such as word or bigram overlap, (weighted) cosine or Jaccard similarity are appealing as they are cheap and robust. The approach we undertake relies on such a shallow similarity measure (Section3.2) and clusters similar sentences based on their pairwise similarity. The clustering algorithm is explained in Section3.3.

3.1 Related Work

The task of identifying similar sentences in a pair of comparable documents is akin to the one of aligning sentences in parallel corpora which is necessary for training statistical machine translation (SMT) systems. However, sentence alignment for comparable corpora requires methods different from those used in SMT for parallel corpora. There, the alignment methods rely on the premise that a translation presents the information in the same order as in the source language. The alignment search is thus limited to a window of a few sentences. Unlike that, given two biographies of a person, one of them may follow the timeline from birth to death whereas the other may group events thematically or tell only about the scientific contribution of the person. Thus, one cannot assume that the sentence order or the content is the same in two biographies, and methods developed for parallel corpora are hardly applicable here.

Hatzivassiloglou et al. (1999, 2001) introduce SimFinder – a clustering tool for summa-rization systems which organizes similar text fragments into clusters. SimFinder utilizes a set of linguistic features (e.g., WordNet synsets, syntactic dependencies) and relies on annotated data to compute the similarity of two text pieces. Once pairwise similarities are computed, it uses a non-hierarchical clustering algorithm (Sp¨ath, 1985) to build clusters. In evaluation with human judges SimFinder achieved about 50% precision and 53% recall (Hatzivassiloglou et al., 1999). Since we do not have an annotated corpus at our disposal, we are looking into unsupervised methods of computing text similarity.

As a part of the DAESO1project, a number of shallow similarity measures were evaluated with regard to how well they can identify related sentences in a comparable corpus of Dutch news. According to the presentation available on the project website2, weighted cosine sim-ilarity provides good results (60% of F-measure). This gives us confidence that reasonable clusters can be obtained even with a shallow unsupervised method.

Nelken & Shieber (2006) concern the task of aligning sentences in comparable documents:

the gospels of the New Testament (Matthew, Mark and Luke) and the articles from the Ency-clopedia Britannica. In particular, they demonstrate the efficacy of a sentence-based tf.idf score when applied to such very different comparable corpora. They further improve the ac-curacy by integrating into their algorithm sentence ordering, which is similar across the related documents in their data. For example, the encyclopedia data they use, collected and annotated byBarzilay & Elhadad (2003), consists of pairs of articles – a comprehensive and an elemen-tary one – which all come from the Encyclopedia Britannica. Our articles come from very different sources, and the organization of biographies differs considerably even for the same person. Interestingly, their arguably simpler method outperforms the supervised SimFinder.

1Seehttp://www.daeso.nl.

2http://daeso.uvt.nl/downloads/atila08.pdf