The section has presented the data and annotation which served as input to deFuser. All the data is annotated with sentence boundaries, parts of speech and syntactic dependencies.
German biography corpora are also semantically annotated. All but one corpus (T¨uBa-D/Z) are annotated automatically. The annotation pipeline applied to the German data consists of off-the-shelf tools as well as the lexicon we extracted from Wikipedia with a little effort. A few heuristics have been described which enhanced the annotation.
The fusion algorithm (Chapter 4) is tested on CoCoBi (Sec. 2.1.1). The compression algorithm (Chapter 8) is evaluated on the English compression corpus (Sec. 2.2.1) and on T¨uBa-D/Z (Sec.2.1.3).
12The version from October 26 2008
13Label cop is excluded.
LABEL DESCRIPTION LABEL DESCRIPTION
adv adverbial modifier obja2 second accusative object
app apposition objc clausal object
attr noun attribute objd dative object
aux auxiliary verb objg genetive object
avz verb prefix obji infinitive object
cj conjunction objp prepositional object
det determiner par “parenthesis” (intervening clause)
eth dative subordination part subordinate particle
expl expletive pn noun object for prepositions
gmod genetive modifier pp prepositional modifier
grad degree modifier pred predicate
kom comparative punct punctuation
kon conjuncts rel relative clause
konj subordinate conjunction s root
neb subordinate clause subj subject
np2 logical subject in coordination subjc clausal subject
obja accusative object zeit temporal expression
Table 2.3: Set of dependency relations assigned by WCDG
2.4 Summary 25
LABEL DESCRIPTION LABEL DESCRIPTION
dep* dependent amod adjectival modifier
aux auxiliary appos appositional modifier
auxpass passive auxiliary advcl adverbial clause modifier
arg* argument purpcl purpose clause modifier
agent agent det determiner
comp complement predet predeterminer
acomp adjectival complement preconj preconjunct
attr attribute infmod infinitival modifier
ccomp clausal complement with internal subj partmod participial modifier xcomp clausal complement with external subj advmod adverbial modifier
compl complementizer neg negation modifier
obj object rcmod relative clause modifier
dobj direct object quantmod quantifier modifier
iobj indirect object tmod temporal modifier
pobj prepositional object measure measure phrase modifier
mark word introducing advcl nn noun compound modifier
rel word introducing relative clause num numeric modifier
subj subject number part of compound number
nsubj nominal subject prep prepositional modifier
nsubjpass passive nominal subject poss possession modifier
csubj clausal subject possessive possessive ’s
csubjpass passive clausal subject prt phrasal verb particle
cc coordination parataxis parataxis
conj conjunct punct punctuation
expl expletive ref referent
mod* modifier sdev semantic dependent
abbrev abbreviation modifier xsubj controlling subject
Table 2.4: Set of dependency relations assigned by the Stanford parser
Chapter 3
Grouping Related Sentences
Sentence fusion methods operate on similar sentences. In our work, we define these as sen-tences sharing a significant portion of their content, although for some applications it might be reasonable to group sentences which share only an NP. For example, two sentences sharing the agent, as in (3.1) and (3.2), can be fused so that one of the input sentences becomes a relative clause (3.3):
(3.1) Paul Krugman was awarded the 2008 Nobel Prize in economics.
(3.2) Mr. Krugman is an Op-Ed columnist for The New York Times.
(3.3) Paul Krugman, who was awarded the 2008 Nobel Prize in economics, is an Op-Ed columnist for The New York Times.
Such a fusion might be particularly useful in the context of single-document summarization because in a single document sentences seldom share more than one NP and highly similar sentences are indeed very unusual. However, a simple conversion of a main clause into a relative one requires the use of words not present in the input sentences (e.g., who in Ex.3.3).
Apart from that, this is a relatively unchallenging transformation which is unlikely to produce an ungrammatical sentence, provided that the input is correct. In our work we want to group together or align sentences which are more tightly related.
Ideally, one should align sentences which concern the same event and overlap in propo-sitions. Unfortunately, on the implementation side, this means that such an approach would require an analysis deeper than what can be achieved with existing semantic tools and meth-ods. In this respect shallow methods such as word or bigram overlap, (weighted) cosine or Jaccard similarity are appealing as they are cheap and robust. The approach we undertake relies on such a shallow similarity measure (Section3.2) and clusters similar sentences based on their pairwise similarity. The clustering algorithm is explained in Section3.3.
3.1 Related Work
The task of identifying similar sentences in a pair of comparable documents is akin to the one of aligning sentences in parallel corpora which is necessary for training statistical machine translation (SMT) systems. However, sentence alignment for comparable corpora requires methods different from those used in SMT for parallel corpora. There, the alignment methods rely on the premise that a translation presents the information in the same order as in the source language. The alignment search is thus limited to a window of a few sentences. Unlike that, given two biographies of a person, one of them may follow the timeline from birth to death whereas the other may group events thematically or tell only about the scientific contribution of the person. Thus, one cannot assume that the sentence order or the content is the same in two biographies, and methods developed for parallel corpora are hardly applicable here.
Hatzivassiloglou et al. (1999, 2001) introduce SimFinder – a clustering tool for summa-rization systems which organizes similar text fragments into clusters. SimFinder utilizes a set of linguistic features (e.g., WordNet synsets, syntactic dependencies) and relies on annotated data to compute the similarity of two text pieces. Once pairwise similarities are computed, it uses a non-hierarchical clustering algorithm (Sp¨ath, 1985) to build clusters. In evaluation with human judges SimFinder achieved about 50% precision and 53% recall (Hatzivassiloglou et al., 1999). Since we do not have an annotated corpus at our disposal, we are looking into unsupervised methods of computing text similarity.
As a part of the DAESO1project, a number of shallow similarity measures were evaluated with regard to how well they can identify related sentences in a comparable corpus of Dutch news. According to the presentation available on the project website2, weighted cosine sim-ilarity provides good results (60% of F-measure). This gives us confidence that reasonable clusters can be obtained even with a shallow unsupervised method.
Nelken & Shieber (2006) concern the task of aligning sentences in comparable documents:
the gospels of the New Testament (Matthew, Mark and Luke) and the articles from the Ency-clopedia Britannica. In particular, they demonstrate the efficacy of a sentence-based tf.idf score when applied to such very different comparable corpora. They further improve the ac-curacy by integrating into their algorithm sentence ordering, which is similar across the related documents in their data. For example, the encyclopedia data they use, collected and annotated byBarzilay & Elhadad (2003), consists of pairs of articles – a comprehensive and an elemen-tary one – which all come from the Encyclopedia Britannica. Our articles come from very different sources, and the organization of biographies differs considerably even for the same person. Interestingly, their arguably simpler method outperforms the supervised SimFinder.
1Seehttp://www.daeso.nl.
2http://daeso.uvt.nl/downloads/atila08.pdf