Performance of the Existing Systems - 2 The State of the Art

2 The State of the Art

2.5 Performance of the Existing Systems

be applied to many languages for which there is no pre-existing high-precision database of negative polarity items.

All this work has been done at the lexical level, which is different from the granularity considered in this dissertation. To our best knowledge, in the context of RTE, there is no separation between directional entailment and paraphrase. In Chapter 6, we elaborate on this issue.

2.5 Performance of the Existing Systems

The main evaluation metric for the RTE systems is accuracy, i.e., the percentage of matching system judgments compared against the gold standard compiled by the human assessors. Currently, other measure-ments like efficiency are not the focus of the community.

Based on the intuition that entailment is related to the similarity be-tween text and hypothesis, Mehdad and Magnini (2009) provide several RTE baselines on top of the BoW representation and different similarity estimated as the degree of word overlap betweenTand H. On the RTE-3 dataset, different settings vary from 0.585 to 0.625; while on the RTE-4 dataset, the results vary from 0.510 to 0.587. Both are on the two-way annotated data, Entailment vs. non-entailment.

As for the system performance in the yearly RTE challenges, the aver-age accuracy of the participating systems is around 60% on the two-way annotated data and with a 5-10% drop on the three-way annotated data (Entailment, Contradiction, and Unknown). The full results can be found in the overview papers of the challenges (Giampiccolo et al., 2007, 2009, Bentivogli et al., 2009). The results of the top five participat-ing systems¹⁰ are listed as follows, with two-way annotation (Table 2.2) and three-way annotation (Table 2.3) respectively (our results are shown in bold).

2.6 Applications

One of the original motivations for RTE is to provide a generic semantic engine, which serves for other NLP tasks. In practice, RTE systems have been widely used as components for other systems as well.

10We use the first author’s last name as the indicator for their participating system and we keep the old indicator even if their author list changed later.

Rank RTE-3 RTE-4 RTE-5

System Accuracy System Accuracy System Accuracy

1 Hickl 0.800 Hickl 0.746 Iftene 0.735

2 Tatu 0.723 Iftene 0.721 Wang 0.685

3 Iftene 0.691 Wang 0.706 Li 0.670

4 Adams 0.670 Li 0.659 Mehdad 0.662

5 Wang 0.669 Balahur 0.608 Sammons 0.643

Table 2.2: Top five participating systems in the RTE challenges (two-way annotation)

Rank RTE-4 RTE-5

System Accuracy System Accuracy

1 Iftene 0.685 Iftene 0.683

2 Siblini 0.616 Wang 0.637

3 Wang 0.614 Ferr´andez 0.600

4 Li 0.588 Malakasiotis 0.575

5 Mohammad 0.556 Breck 0.570

Table 2.3: Top five participating systems in the RTE challenges (three-way annotation)

Harabagiu and Hickl (2006) demonstrated how RTE systems can be used to enhance the accuracy of current open-domain question answering systems. In their experiments, they showed that when textual entailment information was used to either filter or rank answers returned by a QA system, accuracy would be increased by as much as 20% overall. Ce-likyilmaz and Thint (2008) used an RTE module to rank the retrieved passages/sentences by matching the semantic information contained in the retrieved sentences and the given questions. Sentences with a high rank are likely to contain the answer phrases.

Roth et al. (2009) defined the problem of recognizing entailed relations - given an open set of relations, find all occurrences of the relations of interest in a given document set - and posed it as a challenge to scalable information extraction and retrieval. They argued that textual entail-ment was necessary to solve the common problems: supervised methods were not easily scaled, while unsupervised and semi-supervised meth-ods were restricted to frequent, explicit, highly localized patterns. They implemented a solution showing that an RTE system can be scaled to a much larger information extraction problem than that represented by the RTE challenges.

2.7. SUMMARY 57 Mirkin et al. (2009b) addressed the task of handling unknown terms in statistical machine translation. They proposed using source-language monolingual models and resources to paraphrase the source text prior to translation. They allowed translations of entailed texts rather than paraphrases only. Their experiments showed that the proposed approach substantially increased the number of properly translated texts. Instead of improving the MT systems, Pad´o et al. (2009a) proposed a metric that evaluated MT output based on a rich set of features motivated by textual entailment, such as lexical-semantic (in-)compatibility and argu-ment structure overlap. They compared that metric against a combina-tion metric of four state-of-the-art scores in two different settings. The combination metric outperformed the individual scores, but was beated by the entailment-based metric.

Many participating systems in Answer Validation Exercise (AVE) at the Cross Language Evaluation Forum (CLEF) (Pe˜nas et al., 2007, Ro-drigo et al., 2008) utilized RTE systems as core engines. The AVE task asked the participating systems to validate answers output by the QA systems, and it can be easily transformed into an RTE problem by com-bining question and answer pairs into Hs and taking documents as Ts.

We also participated in the exercises and achieved the best result for English (Wang and Neumann, 2007b) and for German (Wang and Neu-mann, 2008a).

2.7 Summary

In this chapter, the related work in the field is reviewed. We start with data resources and knowledge resources, including the available anno-tated datasets and textual inference rule collections. Then two important procedures followed by most of the RTE systems, meaning representa-tion derivarepresenta-tion and entailment relarepresenta-tion recognirepresenta-tion, are described. We go through a number of RTE approaches proposed in the recent years and classify them into different categories, and also introduce several related tasks, contradiction recognition, paraphrase acquisition, and direction-ality recognition. Finally, we present the state-of-the-art RTE system performance and several successful downstream applications.

In the rest of this dissertation, we describe our approaches to RTE, as well as other related tasks.

Im Dokument Intrinsic and Extrinsic Approaches to Recognizing Textual Entailment (Seite 55-59)