DiET - Diagnostic and Evaluation Tools for Natural Language Application

(1)

DiET - Diagnostic and Evaluation Tools for Natural Language Processing Applications

Klaus Netter

¹

, Susan Armstrong

⁵

, Tibor Kiss

⁴

, Judith Klein

¹

, Sabine Lehmann

⁵

, David Milward

⁶

, Sylvie Regnier-Prost

²

, Reinhard Schäler

³

, Tillmann Wegst

¹

1DFKI GmbH

Stuhlsatzenhausweg 3, 66123 Saarbrücken, Germany

[klaus.netter|judith.klein|tillmann.wegst@dfki.de]

2Aerospatiale, Paris [sylvie.regnier@siege.aerospatiale.fr]

3LRC, Dublin [reinhard.schaler@ucd.ie]

4IBM Germany, Heidelberg [tibor.kiss@vnet.ibm.com]

5ISSCO/ETI, Geneva [susan.armstrong|sabine.lehmann@issco.unige.ch]

6SRI, Cambridge [david.milward@cam.sri.com]

Abstract

The project DiET is developing a comprehensive environment for the construction, annotation and maintenance of structured reference data for the diagnosis and evaluation of NLP applications. The target user group are developers, professional evaluators and consultants of language technology. The system, implemented in a configurable, open client/server architecture, offers the user the possibility to construct and annotate data by freely choosing his annotation types from a given set coming with all the necessary functions for editing, displaying and storing such annotations. The project will also result in a substantial amount of structured test data, representing linguistic phenomena on the levels of morphology, syntax and discourse and annotated with information covering different linguistic and application specific aspects to make the data as transparent as possible and to support optimal access and retrieval. For the application to new domains, the user is also given various means for customisation. Through a process of corpus profiling, links can be established between the structured test items in the data base and related phenomena occurring in domain specific corpora. Lexical replacement functions allow the user to adapt the vocabulary of the test items to his specific domain and terminology. The tools and database finally allow the user to set up evaluation scenarios and to record the results of test cycles.

Introduction

¹

With an increasing number of products based on natural language processing systems hitting the professional and mass market, a solid and reliable testing and comparative assessment of such applications and systems becomes more and more of an issue both for developers and for professional end users. While product evaluation clearly has to address a range of general aspects, such as user friendliness, documentation and product support, one specific aspect crucial for NLP applications is the testing and evaluation of the linguistic performance. Such an

1 DiET is a project in the Language Engineering Sector of the Telematics Application Programme of the European Commission (LE3-4204). For more information contact the DiET homepage at http://dylan.ucd.ie/DiET/

evaluation quite obviously involves the usage of adequate natural language reference data serving as input.

However, given the lack of generally available reference data and support tools, each developer and user is forced to prepare a new collection of test material.

Thus, language resources employed for evaluation often consist of some sample data taken from a corpus which is assumed to be representative for a specific application.

Developers sometimes build up some test suite over time through which they try to record and keep track of the changes in the performance of systems. However, whichever approach is taken, there are hardly ever systematic evaluations carried out on a broad scale.

Some of the reasons for such a lack of systematicity are quite obvious: The sample data are chosen in a more or less arbitrary way² and they are rarely fed to the application in a manner which allows full control over the input/output performance of a system. The evaluator is often faced with the dilemma of depth vs. breadth of coverage. With a small set of data, a fairly detailed analysis and diagnosis can be envisaged. On the other hand, opting for sample data that are sufficiently representative can imply that they are too large and too uncontrolled to allow any detailed, objective and systematic interpretation of the test results.

At the same time, the development of test data and the carrying out of a halfway systematic and reliable evaluation scheme is extremely costly and labour intensive as the examples of the large American evaluation conferences (MUC, TREC, etc.) clearly show.

One of the most expensive factors is most likely the collection and preparation of test data, i.e. the provision of sufficient amounts of annotated natural language resources which can serve as training and test data. As a consequence it is practically outside the scope of a single

2 One exception is the use of representative corpora in Machine Translation, cf. Rayner, et al., 1995.

(2)

enterprise or institution to develop the necessary data and to carry out such an evaluation scheme.

The sharing of reference data is often hampered not only by the lack of transparent formats for data exchange, but also by the fact that reference data are often collected or constructed for specific domains and applications. Thus, only in the case of domain independent applications, such as unrestricted full-text retrieval, can the cost of customising the systems or the data be kept within reasonable limits. The reference data collected and annotated for specific applications are obviously of little use for the assessment of other applications.

While it is clearly far outside its scope to address the evaluation process as a whole, the project DiET (Diagnostic and Evaluation Tools for NLP Applications) wants to contribute to the solution of at least some of these problems. The project, which grew out of several other projects including TSNLP (Lehmann, Oepen et al., 1996), TEMAA (Maegaard et al., 1996) and FraCaS (Cooper et al., 1996), has as its main emphasis to provide the means for a controlled and systematic construction of test data, as well as tools to customise and validate the data for specific applications and domain specific corpora.

• DiET will provide a flexible architecture for the integration of different tools for the construction, storage, maintenance and customisation of test data.

• The tool box will give the user the means to develop test suites consisting of manually constructed structured test items representing linguistic phenomena at different levels of linguistic abstraction.

However, the tools are not constrained to such types of reference data, but will also encompass the facilities for the treatment and mark up of non- structured textual corpora.

• The system will be open to a range of annotation types assigned to test data, which will give the user the possibility to put together, construct or select those types of annotations which are most relevant for a given application. Such annotations can be either entered manually or automatically through the support of some server application.

• DiET will provide means and methods for corpus profiling, which will allow the user to systematically relate constructed data in the form of test suites to textual corpora, representative of a specific domain or application. Thus, the non-redundant and controlled test suites themselves can be either used as kind of condensed and compact representations of larger corpora or the test items can be employed as index terms into the corpora on the basis of which authentic examples can be retrieved.

• The system will offer the functions to customise existing test suites to specific domains and applications, allowing the user to select subsets of test items and annotations and to perform lexical replacement operations to adapt test items to the vocabulary of a specific domain.

• Last but not least, DiET will result in large amounts of test data. The coverage of the data will range from syntactic phenomena (extending the basis developed in TSNLP), through morphological data to discourse

phenomena, such as ellipsis or anaphora. The data will be constructed in three different languages (English, French and German) and will be validated on the basis of three very different types of applications, including Grammar and Controlled Language Checkers, Machine Translation Systems and Translation Memories.

User Needs

The user needs defined within DiET were elaborated from the perspective of users as developers of NLP software and users as consultants on NLP software adequacy for given applications. Three different types of users are involved in the DiET project : (i) industrial developers, employing test data for quality assurance and version control, (ii) large scale industrial users, interested in the comparative evaluation of different products and progress evaluation of systems applied in house, and (iii) institutions providing professional consultancy services for external customers, employing evaluation tools for carrying out these services.

The usage of diagnostic and evaluation tools is defined relative to the applications to be evaluated by the users, e.g. components of machine translation systems, translation memories, grammar and controlled language checkers.

In the evaluation of a knowledge-based system, for example, the most prominent evaluation types are diagnosis and progress evaluation. The main evaluation criterion will be the compliance of the structures analysed with respect to some predetermined expectations about the structures to be analysed.

As part of a consulting service, diagnostic evaluation tools are used to measure specific properties of the system, while tools for adequacy evaluation are needed to assess the performance of systems against a set of domain- and application-specific characteristics.

Developers are more interested in progress evaluation, while consultants are more interested in adequacy evaluation. For both tasks, however, the availability of diagnostic tools is a strong prerequisite.

There is an important user need for methods and tools which make it possible for a user to adapt a reference generic test suite to domain corpora and to applications as well as to provide assistance in the analysis and interpretation of evaluation results. Depending on the evaluation scenario, a user may want to position or/enrich a generic test suite vis à vis the salient characteristics of a corpus, using corpus profiling tools and techniques, or he may concentrate on the adaptation of the test data to a specific application or sets of applications, thus matching the test set to the salient characteristics of the application under evaluation. The evaluation methodology and the associated annotated data and tools used will differ according to the kind of technology employed in the application. Ideally, diagnostic and evaluation tools must be flexible enough to accommodate both linguistic representations, such as a tree structure, as well as non- linguistic output, such as a set of strings related to one another in a predefined manner (cf. Kiss et al. 1997). The tools should provide the basis for comparison of system behaviour on both levels. The ultimate goal is to provide relevant and reliable information and performance

(3)

measurements on the actual and expected behaviour of a variety of NLP systems under evaluation.

Architecture and Design

The objective of DiET is to offer a flexible instrument and tool set which is open enough to cover the needs and requirements of very different types of users who wish to employ the system for the diagnosis and evaluation of a range of applications as broad as possible. In the design of the architecture of DiET utmost care has been taken to impose as few restrictions on the type of data, type of annotations and external modules for data construction as possible, while still providing a sufficiently useful and extendable platform. Thus, the architecture of DiET is designed as an open client/server architecture, with a graphical user interface as the central client for data construction, annotation and configuration, and various modules, among them a data base server and several automatic annotation tools, as servers feeding into the client.

The central data construction and annotation tool serves to enter new data and to annotate the items with attributes which can be freely chosen and configured. The test data themselves are either test items, (ordered) groups of test items, or segments from test items. Examples for these different types could be sentences or similar linguistic expressions (test items) or words, phrases, or even sub- lexical units (test item segments). Larger units, grouping together related items, can also be defined, including lists of sentences, paragraphs, discourse or dialogue sequences, sets of sentences illustrating paradigmatic variations over certain linguistic phenomena, and pairs of sentences in a translation relation. The data can be entered either manually or drawn from an external source, such that the system can, in principle, annotate isolated, manually constructed test items as well as larger amounts of user-defined corpora.

The user is free to choose the type of test material to be developed as well as the types of annotations associated to the test data. In a highly modular way, the user is offered a choice of basic annotation modes building on a fixed inventory of data types, such as trees, strings, numbers, boolean values, etc. These annotation modes are directly associated with display, storage and editing functions necessary for handling the data type. This means, for example, that as soon as the user declares the annotation as data type tree, all the relevant functions necessary to construct and edit tree-like annotations as well as the respective data base storage and search functions are activated and available to the user.

The user can define new annotation types, relating data types and annotation modes, which are given a name and can be sorted into a hierarchy of annotation types.

Annotation types can be applied to different types of linguistic objects (i.e., test items, groups of test items, etc.) and their values (if any) can be configured to be entered manually are alternatively provided through some server. Instances of such annotation types could be, for example, phrasal or relational structures over strings building on the data type tree, anaphoric relation making use of a simple data type arc, well-formedness judgements with a boolean value, the number of parses

resulting from a diagnostic test run with a value of type number, etc.

As soon as these annotation types are defined, the user can select a subset of them, attribute them to a test item, and assign values to the slots. Although in principle every single test item could be attributed a different set or configuration of annotation types, one will of course pre- configure certain clusters of annotation types for specific application oriented test suites. This will be supported by the appropriate ergonomic functionalities for memorising, copying and editing specific configurations and annotations.

While most of the described functions of declaration, selection, and data entry will be carried out in the central client module, there will also be a number of specialised and potentially decentralised servers supporting the tasks of data construction and annotation. Among the most important ones is the central data base server, in which the data will be stored and maintained. The usage of a full-fledged DBMS will make it possible above all to reconfigure and reuse data for different applications in a convenient and flexible way. Thus, the user will be offered the possibility to choose existing data from the data base and select those annotations entered by other users as appropriate for a given application. The search in the data base, the browsing and displaying of the data sets will be supported by the same mechanisms as used for data annotations, i.e., the user can choose certain annotation types (with or without values) and use this configuration of attributes as a query to the data base.

Automatic annotation of data by servers is foreseen for standard annotation types such as POS tagging. These will be available for the three languages, as will be a morphology component to assign standardised morpho- syntactic classifications. The graphical tool for the construction of constituent structures and relational structures over test items will be supported by a language independent parser with bootstrapping facilities, which acquires language models from existing annotations and after a short training phase makes suggestions to the user.

Data Construction and Annotations

Although the main objective of DiET is to provide flexible and user-friendly tools for the construction of test data, it will also produce substantial amount of test data in three different languages. This is quite essential for the project in order to validate the adequacy of the approach and the usability of the tool box on a larger and more varied set of data than initially supplied by TSNLP.

Secondly, as long as there is no sufficient commercial basis, a continuous extension of the test data will most likely rely on the NLP community as a whole becoming interested in the enterprise and making a contribution by adding to the data and sharing them. This will only be feasible, however, if the basis is large enough to allow for the trading of existing data against new data.

Data construction in DiET will follow the test suite approach developed in TSNLP (Lehmann, Oepen et al., 1996). The data stored in the data base are centred around systematically constructed, paradigmatically varied structured test items, which are controlled, for example, with respect to redundancy and ambiguity. The test items are typically constructed in the form of minimal pairs or

(4)

sets, with a controlled variation along one or more specific parameters defining a linguistic phenomenon, such that, in the ideal case, a set of items exhausts the logical space defined by these parameters. Such variations may also result in negative test items, which normally do not naturally occur in corpora but which may be indispensable for diagnostic evaluation. Depending on the application in mind, negative does not have to be identical with ungrammatical but may also be inadmissible in the context of a controlled language checker, for example.

The annotations assigned to the test items serve a range of very different purposes and functions. They may be added for purely organisational and descriptive purposes in order to support a better structuring or a more systematic construction and coverage of the data, as well as a better understanding of the phenomena and problems represented by the data. Many of these annotations will also prove useful, if not essential, for search and retrieval purposes, i.e., to help to recover specific subsets from the data base. Some of the linguistic annotations recording the structure of test items will be used to establish a link between test items and textual corpora, e.g. to determine the occurrence and frequency of certain phenomena illustrated by an item in a corpus (cf., also below the section on corpus profiling). Such relations can be established by comparing the structural annotations (e.g., the POS pattern) of an item with the corresponding markup in a tagged corpus, using the former pattern as a query over the corpus annotation. Last but not least, for the evaluation process itself, items can be annotated with attributes and values representing the expected output or system results that can then be compared with the actual results in a test run. Performance measures and statistics of test runs can also be added for future comparison.

The annotations will often serve multiple purposes.

Therefore it makes sense also to look at annotations relative to their content. Under this point of view the annotations foreseen in DiET can be classified as describing (i) linguistic properties, (ii) application- specific features, (iii) corpus-related information based on corpus profiling, and (iv) evaluation-specific attributes.

Among the linguistic annotations, there are features which interpret and classify a test item as a whole, including language, information about (relative) well- formedness and the test item category (i.e. phrasal, sentence or text). Other annotations describe the morphological, syntactic and discourse structure of test items. The morphological annotations comprise information on part-of-speech at word level and assigns a uniquely defined morpho-syntactic ambiguity class to the word forms. At the syntactic level – following the work in the projects TSNLP (Lehmann, Oepen et al., 1996) and NEGRA (Skut et al., 1997) – tree representations provide structural information where the non-terminal nodes are assigned phrasal categories and the arcs some grammatical functions (subject, object, modifier, etc.).

Annotations at discourse level contain information on the direction (e.g. antecedent) and type (e.g. co-reference) of semantic relations between test item segments.

All test items can be classified according to the linguistic phenomenon illustrated. The phenomenon annotations consist of name and description of the phenomenon, its relation to other linguistic phenomena in a hierarchical

phenomenon classification, and characteristic properties of the phenomenon, e.g. the features number and person for the syntactic phenomenon subject-verb agreement.

The application-specific annotations can quite trivially assign a test item to a specific application, such that the user is supported in searching for and retrieving suitable test material. However, under this heading one may also find the standard reference or expected output which can be used in the comparison phase of the evaluation procedure. For grammar and controlled language checkers, for example, the error-type of ill-formed test items can be encoded and ungrammatical test item s can be linked to their grammatical counterparts. The test material for parsers can be annotated with the expected number of alternative readings for ambiguous test items.

For translation memories the test items of the source language can be connected to the corresponding translation units.

Corpus-related annotations are those which establish a link between test items and domain and application specific corpora. This could be information about the frequency (and indirectly also relevance) in a given corpus of the phenomenon represented by an item, or it could even be an index pointing to the occurrences of the respective patterns in the corpus, which can then be used for retrieving such ‘real-life’ examples from the text.

Many of the evaluation-specific annotations will simply help the user to keep track of the set-up of an evaluation and of different test runs, i.e., they will describe the evaluation scenario, including the user-type (developer, user, customer), name and type of the system under evaluation, goal of the evaluation (diagnosis, progress, adequacy), conditions (black-box, glass-box), evaluated objects (e.g. parse tree, error-correction, etc.), evaluation criteria (e.g. correct syntactic analysis, correct error flag, etc.), evaluation measure (the analysis of the system results, for example in terms of recall and precision), and the quality predicates (e.g. the interpretation of the results as excellent, good, bad, etc.). Since the data base is meant to accommodate evaluation results, the annotations will also be used to record the actual output of the examined NLP system to individual items, which will allow for the comparison of the actual to the expected output (the reference annotation specified within the application- specific annotation types).

Besides these item-specific annotations, the data base will also contain meta-information about the test suite as a whole, such as listings of the vocabulary occurring in the test items, tag sets or generally the descriptive terminology employed in the annotations.

Customisation

Systematically constructed data may be useful for diagnosis but will not necessarily provide a sufficient basis for adequacy evaluation. The performance of a system should be evaluated on the basis of its intended application. To help the user prepare the test suites for a given evaluation scenario, a number of tools are foreseen to customise the data. The basic goal is to bridge the gap between the isolated, artificially constructed test items and real-life, empirically obtained data from corpora.

(5)

For a given evaluation, only those test items should be selected which are relevant for that domain. If a user has a specific corpus for which the NLP system is to be evaluated, the selection criteria should thus be defined on the basis of the salient characteristics exemplified through the data of the corpus. If the document does not, for example, contain any examples of coordination, the selected test items should not contain occurrences of that phenomenon either. To help in the selection process, tools are foreseen to extract relevant characteristics found in the corpus.

After selection of the relevant test suites according to the phenomena to be evaluated, the user may wish to adapt them to better reflect the actual examples contained in the corpus. Tools are also foreseen to help adapt the data, specifically with a lexical replacement program.

Corpus Profiling

The identification of the typical and salient properties of the texts is what we refer to as corpus profiling. The tools used to identify and classify the corpus characteristics will rely on shallow state-of-the-art corpus processing techniques. These include morphological analysis, part- of-speech tagging, standard statistical measurements (which can be calculated over the entire corpus or only for given localities defined according to a limited set of parameters) and general pattern matching techniques (which are basically used for the extraction of linguistically relevant units). The quality of the result will depend on the success of the shallow processing stage.

Accuracy will be much improved if the corpus is already annotated with compatible part-of-speech tags, either by hand or by a tagger trained to the specific corpus.

The corpus profile will present measurements of easily and reliably identifiable corpus characteristics that can be automatically calculated using state-of-the-art corpus analysis techniques. The corpus profile contains information about the following types of corpus properties: (i) string characteristics, (ii) lexical properties and (iii) syntactic information.

String characteristics include format codes, i.e. the identification of types of tags (titles, sub-titles, list items, etc.), their frequency and their distribution. Other relevant string properties contain the orthographic properties of words (e.g. all upper case), the occurrence of alphanumeric sequences and punctuation marks.

Information regarding segment types, e.g. paragraphs, sentences and words will be summarised in the profile.

Lexical profiles on word forms, lemmas and their frequency can serve to identify to what degree the lexical data in the corpus matches the vocabulary used in the DiET test items. The information on lexical coverage is needed for lexical replacement, but it could for example also be used to classify common and less common nouns or give information on type-token frequency. The words can also be classified by part-of-speech. This tagging constitutes the basis for the identification of many other characteristics of the document, e.g. syntactic constructions.

The identification of syntactic information is perhaps the most challenging task of corpus profiling (it presupposes that the document is tagged with (at least) part-of- speech). Though sequences of POS tags are indicative of

syntactic phenomena, only a subset can be identified with a reasonable degree of reliability. The frequency and distribution of closed class items can serve as very simple, but useful indicators of the occurrence of syntactic constructions. This however, only provides coarse grained information. For example, we can determine how many times the word ‘and’ is used in a text, but not whether the coordination is between nouns, noun phrases, verbs, etc. Without any further specification, a user would thus have to extract all test items classified under that phenomenon, even if the major part of them might not be representative for the corpus in question. Better results can be obtained by applying a more refined procedure, namely through the systematic extraction of patterns of sequences of part-of-speech tags.

The specification of a sequence such as [NP coord-conj NP] reveals for example whether the corpus contains coordination of noun phrases. While this method yields correct results for a sentence such as (1), it is unsatisfactory for (2), although both examples contain the same pattern ([NP coord-conj NP]: Sally and Peter).

(1) Harry meets [Sally and Peter].

(2) [Harry meets Sally] and [Peter meets John].

This situation is due to the different bracketing of the two sentences: (1) shows the correctly extracted coordination of nouns, but (2) contains a coordination of sentences.

Further refinement of the noun phrase coordination pattern might be possible, however we will generally expect users to perform some hand filtering of the output, or accept some inaccuracies. However, even with a high recall and low precision, the functionality of the system will help the user to locate relevant examples in a corpus in a much faster way than by having to go through the corpus manually. It has to be kept in mind that to obtain a high degree of precision require a corpus with correctly disambiguated full syntactic parses would be required.

The characteristics recorded in the corpus profile will not be sufficient to automatically extract the relevant subset of DiET test items for a given evaluation. Since the data developed in DiET is designed to provide basic linguistic coverage with a limited vocabulary, often simplified to isolate the different phenomena, it would be unrealistic to assume a simple mapping between the annotated test items and the sentences containing these phenomena in the user-specific corpus. A corpus profile can, however, provide indications as to which items might be relevant and what extension or modification of the test data needs to be envisaged. The properties described in the profile will also serve as a basis for assigning relevance measures to the selected test items, i.e. test suites are actually related to domain specific corpora in order to provide weighted test data. It will be up to the human evaluator to assign these relevance measures.

Lexical Replacement

Another method for customising existing test suites is the replacement of lexical material contained in the test items by other lexical material. In DiET, such a replacement is foreseen for two purposes.

(6)

Lexical replacement can be used as a repair strategy if lexical material contained in a test item is not contained in the application under evaluation.

If the evaluation is not concerned with lexical coverage, such sentences should be excluded from the evaluation.

Given, however, that a non-redundant test suite may contain just a single test item exemplifying a certain phenomenon, the simple exclusion of a test item from a test set might lead to an inappropriate evaluation. This is so since pertinent phenomena could not be evaluated after such an exclusion. In this case, lexical replacement will replace the unknown word by a word that is both contained in the test suite lexicon and the application lexicon.

A second application for lexical replacement is the deliberate insertion of new lexical material into a set of test items. New lexical material might consist of some specialised vocabulary, e.g., for the inclusion of a certain terminology.

As reported in Kiss & Steinbrecher (1998), DiET will concentrate on providing tools for the first purpose but wishes to extend these tools to cover the second application as well; the reasons behind this confinement being that the tools required for the first purpose are also needed to successfully cover the second purpose of lexical replacement.

Kiss & Steinbrecher (1998) assume that equivalence classes for lexical replacement within well formed and ill- formed items can be built on the basis of a structured lexicon, where categorial and morphological descriptions can be related to each other in terms of identity and generalisation.

In the case of well formed examples, an unknown word can be replaced by a known word if the unknown word and the substitute share their description. In addition, a replacement is possible if the substitute receives a more general description than the replaced item.

In the case of ill-formed examples, an unknown word can be replaced by a known word under the same conditions that applied to well formed examples. It must be considered, however, that in the case of generalisation, the ill-formedness of a test item could be neutralised. This might happen if the origin of the ill-formedness is lexical in nature and, moreover, if the offending lexical feature is the one over which the generalisation is defined. As an illustration, consider the example in (1). (We commemorate the ball/the rhythm).

*Wir gedachten des Ball. → Wir gedachten des Rhythmus.

Figure 1: Neutralisation of features leads to bad replacement

The description of Rhythmus (rhythm) is more general than the one of Ball (ball). Since the nouns share their other specifications, the generalisation is defined over the case feature: While Rhythmus is completely neutralised with respect to case, Ball can be used for nominative, dative, and accusative, but crucially not for genitive. That

Ball is not genitive is the very reason for the ungrammaticality of (1) and such a replacement must hence be blocked.

This is achieved by not only looking at the descriptions of the lexical material, but also by indicating that an element is the origin of the ill-formedness of an item.

If lexical replacement is used as a repair strategy, it seems most useful that the customised test suite is derived by eliminating those test strings that contained unknown words, while inserting the newly formed test items.

If used as an extension strategy, the newly formed test strings may form part of the customised test suite together with the initial material that was used as a pattern to form the new test items.

For a more detailed description of lexical replacement, the reader is referred to Kiss & Steinbrecher (1998).

Evaluation

The DiET tool package will mainly be employed for the evaluation of NLP-systems. Three types of evaluation are usually distinguished and the DiET system is designed to support all of them: (i) diagnostic evaluation which aims at localising deficiencies in NLP systems (ii) adequacy evaluation which determines whether and to what extent a particular system meets some pre-specified requirements, and (iii) progress evaluation which compares successive stages of the development of a system.

The quality of the evaluation results produced with the DiET tool package will be measured along the following five criteria: thoroughness, adequacy, comparability, accessibility, and efficiency.

An evaluation is thorough if the system under evaluation is subjected to tests which force the system to exhibit its performance, ideally with respect to all kinds of constructions. DiET supports this by the provision of a test suite designed to illustrate a wide range of relevant linguistic constructions at morphological, syntactic and semantic levels.

An evaluation is adequate if it subjects the system under evaluation to just those tests which are relevant in a given situation. The situation is determined by

• the domain the NLP system is to work in (including subject matter, lexical coverage, syntactic constructions, etc.) and

• the task which the NLP system is supposed to fulfil.

DiET supports adequacy by allowing the user

• to form test suites adapted to a situation with regard to linguistic coverage on morphological, syntactic and semantic levels choosing the appropriate test items, vocabulary and annotations;

• to form test sets as sub-sets of a test suite actually used for a test;

• to give weights to annotation types, reflecting the relative importance of certain annotations over others.

If the choice of a candidate is to be based on evaluations of several systems, the evaluation results must be comparable, i.e. they must reflect the performance of the candidates on exactly the same tasks. Comparability is also relevant for developers within progress evaluations, i.e. to keep track of their product's performance over time.

(7)

DiET will support this by

• allowing the formation and storage of named test sets which can be re-used arbitrarily often to subject several systems or different versions of the same system to exactly the same tasks over time, and of course by

• the storage of results, keeping them available for comparison.

An evaluation should not express somebody's personal impression alone, but should be grounded on a well- understood and generally accessible procedure.

• storing the (raw) results of diagnostic runs, and generally by

• presenting the details of the diagnostic and evaluation procedure to the user who is responsible for the interpretation of these results.

The resulting performance values may be expressed in terms of quantitive measurements. Alternatively, qualitative values may be assigned manually in terms of a numerical ranking, e.g. 1-10 or expressed as a quality judgement, e.g. "excellent, "mediocre", and "poor".

An evaluation is efficient if it does what it is supposed to do without wasting time or money.

• providing automated procedures where possible;

• avoiding redundant testing, i.e. making sure the same feature is not tested several times;

• allowing the confinement to an isolated testing of only the relevant features relative to a chosen domain and the type of system.

On the basis of these criteria, the typical evaluation scenario performed by means of the DiET tool will proceed as follows:

For a given application the user has formed a test set which contains that part of the test suite particularly relevant for the system under evaluation. This test data might have been constructed from scratch or adapted from a test set already contained in the data base. In the latter case, the user can call on external programs such as corpus profiling to guide the selection process or highlight desirable modifications. The lexical replacement module can be used to facilitate and control lexical modification of the test suite items.

(i) The test set is then handed over to the NLP- system. The NLP-system processes the items in the test set and returns the results back to DiET.

(ii) In DiET, a comparison will take place: Each result (as the actual value) given as output by the NLP-system is compared with or related to the corresponding values of the adequate annotation type(s) stored in the test suite (as the rated value).

Based on quality measurements provided by the users, the interpretation of the comparison values between actual and rated values tells how well the NLP-system has performed with regard to a certain aspect.

The values showing the NLP system’s performance with regard to one particular annotation type will be integrated to yield a value indicating the application's performance

as a whole. A system’s raw results, the comparison values, and the interpretation of these values will be kept in the DiET data base.

This evaluation procedure will be used in the DiET project for three different types of applications: machine translation systems, translation memories, and grammar and controlled language checkers.

The following properties of the DiET tool package will be tested and validated at three industrial user sites:

• extended coverage of test items to reflect real-life combinations of phenomena

• reliability of diagnostics produced

• user-friendliness, maintainability, import/export capabilities of the testing tools

• portability to various specific environments

• modularity and flexibility of customisation tools and functions

• extendability of the toolkit.

Conclusion

One of the ultimate goals of the DiET project is to provide the professional market (i.e. developers, industrial users and consultants) with a reusable evaluation methodology and tool package which can be easily customised to other applications, other industrial sectors and corpora, and other languages. By setting up a centralised server from which data can be obtained and into which new data can be fed, we also hope to reach the objective of sharing reference data within the NLP community. However, DiET can only provide some of the technical foundations for such an enterprise, whose success will very much depend on whether there is sufficient interest in the community to exchange and share such data and to agree on some common standards.

References

Cooper et al. (1996) Using the Framework. FraCaS Deliverable D16. Edinburgh, 1996. Also available from http://www.cogsci.ed.ac.uk/~fracas/deliverables.html Kiss, T. et al. (1997) The DiET User Requirements

Analysis. D1.1 IBM Heidelberg, 1997.

Kiss, T. and Steinbrecher, D. (1998) Lexical Replacement in Test Suites for the Evaluation of Natural Language Applications. This volume. 1998.

Lehmann, S., Oepen S. et al. (1996). TSNLP – Test Suites for Natural Language Processing. Proceedings of Coling, 711—716, 1996.

Maegaard, B. et al. (1996). TEMAA – A Testbed Study of Evaluation Methodologies: Authoring Aids. Final Report. 1996.

Rayner, M., Bouillon, P., and Carter, D. (1995) Using Corpora to Develop Limited-Domain Speech Translation Systems. Proceedings of Translating and the Computer 17 (ASLIP), November 1995. Also available from http://www.cam.sri.com/ as SRI Technical Report CRC-059. 1995.

Skut W. et al. (1997). An Annotation Scheme for Free Word Order Languages. Proceedings of ANLP, 88—

96, 1997.