Evaluation of the Syntactic Analysis Component of an Information Extraction System for German

(1)

Evaluation of the Syntactic Analysis Component of an Information Extraction System for German

Judith Klein, Thierry Declerck, and G ¨unter Neumann

German Research Center for Artificial Intelligence (DFKI GmbH) Stuhlsatzenhausweg 3, 66123 Saarbr¨ucken (Germany)

lastname @dfki.de Abstract

This paper describes two distinct evaluation initiatives on the NLP performance ofsmes(Saarbr¨ucker Message Extraction System).

The first study aimed at the improvement ofsmesused as analysis component of a German dialogue system for appointment scheduling.

While the first scenario was mainly designed to test the overall analysis result, the second is concerned with a systematic evaluation of the NLP components of the core engine of thesmessystem. The morphological analysis, the disambiguation ability of the POS tagger as well as one of the various parsing modules are now under evaluation. Suitable reference material and supporting evaluation tools are being developed in parallel.

1. Introduction

Evaluation of NLP systems is necessary, both to find out if the technology employed is suitable as part of other applications, but also to assess and improve the quality of the components of the bare NLP system independently of its integration into a broader application scenario. This paper reports on two independent evaluations on the NLP performance of thesmessystem (Saarbr¨ucker Message Ex- traction System), developed at the DFKI (Neumann, et al.

1997). It is important to note that the evaluation was not concerned with the application-specific abilities of smes, i.e. its performance on the correct and efficient extraction of information but only with its language processing func- tionalities.

The first part describes the evaluation of smes used as analysis component of a German dialogue system for appointment scheduling via e-mail developed within the COSMA(Cooperative Schedule Management Agent (Buse- mann et al. 1997)) project

. The experiment which examined the linguistic coverage ofsmesfor the appointment domain based on an annotated German e-mail corpus is briefly summarised in section 3. The results of this evaluation study not only acknowledgedsmesa satisfying performance as syntactic analysis component within the COSMA

system but also encouraged the employment ofsmesfor other application domains. Since the experiment once more emphasises the fundamental importance of evaluation stud- ies, the main advantages and disadvantages of the applied methodology are discussed.

The second part of the paper presents ongoing evaluation work on smes carried out within the PARADIME

(Parameterisable Domain-adaptive Information and Mes- sage Extraction) project

. The scenario is designed to systematically test the NLP components of the core engine of

The COSMA project is described in more detail under http://www.dfki.de/lt/projects/cosma.html.

More information on PARADIME can be found under http://www.dfki.de/lt/projects/paradime-e.html.

the smes system which, in the meantime, has been ex- tended and revised. In the context of the evaluation of the parsing module, the profound examination of the morphological componentMORPHIX++ and the recently integrated POS tagger (an adapted version of the BRILL tagger (Brill 1995)) was necessary for two reasons: Firstly, it is important to detect and eliminate processing deficien- cies which could influence the subsequent syntactic analysis. Secondly, the results of these NLP components must be reliable in order to support efficient semi-automatic annotation of the required reference material forsmes. A comprehensive evaluation of the parser will follow after the necessary improvements of the previous processing steps have been completed. Currently the methodology for the systematic testing of the shallow parsing component is being designed. In addition to the corpus-based evaluation on the smesreference material, the various modules of the syntax component will be evaluated on the basis of linguistic competence data partly taken from the TSNLP (Lehmann et al. 1996) and DiET (Netter et al. 1998) test suites.

The improved parser output will be employed for the semi- automatic syntactic annotation of the reference corpus.

2. SMES – Saarbr ¨ucker Message Extraction System

Thesmessystem (Neumann et al. 1997) combines a core machinery for shallow processing and a user-defined system of finite state automata defining the grammar. The core engine of the prototype ofsmesconsisted of:

a tokenizer based on regular expressions, which scans the input to identify the fragment patterns (e.g. words, date expressions, etc.);

MORPHIX++

, an efficient and robust German morphology component which performs morphological inflection and compound processing;

MORPHIX++ is based on MORPHIX(Finkler and Neumann 1988).

(2)

a shallow parsing module consisting of a declarative specification tool for expressing finite state grammars;

a linguistic knowledge base which includes a lexicon of about 120,000 lexical root entries;

an output presentation component where fragment combination patterns are used for defining linguis- tically oriented head-modifier constructions.

After text scanning and morphological analysis of the identified fragment patterns, shallow parsing is done in two steps:

Firstly, specified finite state transducers (FST) perform fragment recognition and extraction on the basis of theMOR-

PHIX++ output. Fragments to be identified are user-defined and typically consist of phrasal entities (e.g. NPs, PPs) and application-specific units (e.g. complex time and date expressions). In the second phase, user-defined automata rep- resenting verb frames operate on the extracted fragments to combine complements and adjuncts into predicate-argument structures.

As mentioned in the introduction, the current version of smeshas undergone some extensions and revisions. So, for the purpose of lexical tagging, a new module has recently been integrated betweenMORPHIX++ and the shallow parser: the unsupervised BRILLtagger is employed in order to disambiguate the morphologically ambiguous input texts as far as possible.

3. Evaluation of smes as Analysis Component

Within the COSMAproject the first prototype ofsmeswas integrated as a syntactic analysis component for the Ger- man dialogue system of appointment scheduling via e-mail.

For the application domain a set of automata was designed for typical verbs and various kinds of temporal expressions used in appointment scheduling negotiations.

3.1. Evaluation Experiment

The evaluation of smes in this setting aimed at the improvement of the syntactic analysis of thesmesprototype for appointment scheduling (Klein et al. 1997). For the determination and extension of the linguistic coverage of smesa database system containing 160 carefully selected e-mail messages was developed. Based on the TSNLP guidelines for the annotation of test material for NLP sys- tems (Lehmann et al. 1996), 240 e-mail fragments were assigned structural information and application-specific annotations relevant especially for temporal expressions.

In order to define the sub-language for the appointment domain, document profiling on the e-mail corpus was done by means of database queries: based on the association between text fragments and linguistic annotations, domain- specific phenomena and their occurrence in various linguistic constructions could be extracted. The resulting corpus profile provided a priority list of linguistic phenomena the

For more information on the project see http://tsnlp.dfki.uni- sb.de/tsnlp/.

parsing system must cope with for appointment scheduling, including the correct processing of specific time expressions and the analysis of copula constructions.

The evaluation was carried out under glass-box conditions where two randomly selected sets of e-mail messages, (i) 50 from the corpus database and (ii) 50 new e-mail messages, were fed one by one to thesmessystem. The examination of the results was only concerned with the overall output of the smessystem, i.e. the source of the er- rors was sometimes difficult to locate and could be con- tributed to several of the internal components. Each analysis was checked manually and, depending on how well the application-specific requirements were met, one of the three quality predicates (good, medium, bad) was assigned to the output. The evaluation showed that nearly 75 % of the unseen e-mail messages could be classified as good or medium

while about one fourth of the test set showed bad results.

3.2. Results of the Evaluation

Since the evaluation of the syntactic component was carried out in the context of COSMA, the applied evaluation cri- teria and the quality assessment were task-dependent, and hence not solely based on linguistic competence. The results acknowledged thatsmeswas well suited as a robust application-oriented shallow parsing module within the Ger- man dialogue system for appointment scheduling. A main advantage of the system consists in its modularity. Espe- cially the user-defined system of finite state automata al- lowed the easy extension and customisation of the grammars for the application-specific sub-language.

The promising performance results of this first evaluation study encouraged the application ofsmesfor other domains and therefore stimulated the profound evaluation of allsmescomponents which is currently done within the

PARADIMEproject.

3.3. Conclusions of the Evaluation Study

Since the outcome of the smesevaluation had important consequences for the further development of the system it is necessary to reflect on the applied evaluation methodology to improve future evaluation initiatives.

First of all, the NLP system prototype should have undergone test-suite-based diagnostic evaluations to gain a clear picture of the system's abilities and shortcomings. If those phenomena of linguistic competence irrelevant for the envisaged application could have been filtered out it might have been easier to judge the performance of the system on real-life test data.

During the preparatory stage of the evaluation procedure, profound test material – the e-mail corpus database – was built. But even though the e-mail fragments proved very useful for testing, they were not adequate to be used as reference material. The linguistic annotations were very well suited for document profiling, thus specifying the de- sired linguistic coverage ofsmes. Contrary to our expec- tations however, they were not used as a reference basis for result comparisons because:

About 62 % of the results were good and 13% medium.

(3)

the inspection of system results was done manually by the developers, who need not look at the reference annotations to judge thesmesoutput;

reference annotations and smesoutput differed in coverage and format so that direct comparison was not possible;

automatic comparison could not be envisaged since the reference annotations were stored separately from the e-mail messages whereas thesmesoutput was directly attached to the texts.

In addition, the various syntactic structures exemplified within the test e-mails were not weighted for relevance. But information on frequency and relative importance of linguistic phenomena is necessary to reliably judge the performance of a system with respect to a particular application.

The testing procedure was handicapped by the one-by- one feeding of the e-mail messages intosmes. The manual examination of the results was difficult and quite time- consuming because of the complex bracketed output repre- sentations.

The last point concerns the measuring method. The actual smes output was immediately assigned a quality predicate. There was no intermediate step calculating the rowsmesoutput in numeric terms, i.e. quality assessment was based on the developers' (subjective) judgement. Even though subjective interpretation is not completely avoid- able, especially within the evaluation of application-specific NLP systems, subjectivity might be considerably reduced by the employment of more objective measurements like, for example, recall and precision.

4. Evaluation of the smes Components

The PARADIMEproject aims at the development ofsmes as a parameterisable core machinery for a variety of com- mercially viable applications in the area of information extraction (IE). A significant amount of work in the project is therefore dedicated to the improvement of thesmescom- ponents including morphological analysis, exploration of strategies for lexical tagging, and shallow parsing. Cur- rently, both morphological analysis and part-of-speech tagging are under evaluation on the basis of real data taken from a German business magazine. The evaluation setting for the parsing module is being designed, and reports on the evaluation results will soon be available.

4.1. A Reference Corpus for smes

An important aspect of the improvement of the IE-core system is the evaluation of the performance of its components applied to large texts. For this purpose, a corpus from the German business magazine “Wirtschaftswoche” from 1992 (1.4 Mega-Byte), consisting of 366 texts on different top- ics written in distinct styles, was selected. Augmented with the necessary morpho-syntactic annotation these texts will serve as reference corpus for thesmesevaluation.

Since the manual building of a large, richly annotated reference corpus is cost and time intensive, the components

ofsmeswill be employed to support the annotation work.

But, before such semi-automatic annotation is possible, profound evaluation of the NLP modules must guarantee that the analysis results are reliable and need only minor manual post-editing.

In a first step the corpus texts were marked up with lexical information provided by MORPHIX++. Using the recently implemented parameterisable interface of MOR-

PHIX++, the cardinality of possible tag sets ranges from 23 (considering only POS) to 780 (considering all possible morpho-syntactic information). For the time being, the corpus is tagged with POS

only, where ambiguous words are associated with the corresponding alternative tags, as can be seen in figure 1. In a separate step, items marked as unknown words are tagged as being possibly a noun, an adjective or a verb.

(“Fuer” (“fuer” :S “VPREF”) (“fuer” :S “PREP”)) (“die” (“d-det” :S “DEF”))

(“Angaben” (“angeb” :S “V”) (“angabe” :S “N”)) (“in” (“in” :S “PREP”))

(“unseren” (“unser” :S “POSSPRON”))

(“Listen” (“list” :S ”N”) (“list” :S “V”) (“liste” :S “N”)) (“wurde” (“werd” :S “V”))

(“grundsaetzlich” (“grundsaetzlich” :S “A”) (“grundsaetzlich” :S

“ADV”))

(“die” (“d-det” :S ”DEF”))

(“weitestgehende” (“weitestgehend” :S “A”)) (“Bilanz” (“bilanz” :S “N”))

(“zugrunde” (“zugrunde” :S “VPREF”)) (“gelegt” (“leg” :S “V”))

(“.” (“.” :S “INTP”))

Figure 1: Morphix++ result for an example sentence of the reference corpus.

Supported by various tools, the first part of the tagged corpus (about 21,000 words) has been manually validated and disambiguated. The resulting texts constitute the reference basis for the first diagnostic evaluation of the morphological analysis and the POS tagger. In order to provide suitable reference material for the parser the corpus is additionally annotated with structural information. Be- ginning with nominal constituents, the phrasal category together with agreement features were manually assigned to the NPs. But the annotation will not stop at phrasal level.

The internal structure of the constituents will be described in term of head-modifier dependencies. As soon as the parser has reached a satisfying performance it will also be employed as annotation tool and provide the syntactic information for the reference corpus.

4.2. Evaluation of

MORPHIX

++

Morphological processing in smesfollows text scanning and provides as output the word form together with all its readings. SinceMORPHIX++ will be employed to provide the annotation basis for the reference corpus, the evaluation started with several cycles of diagnostic evaluation in order to determine the linguistic coverage and detect short-

Actually only 21 POS tags were used, since some categories were subsumed under one class.

(4)

comings of the morphological component. The first test run on the whole corpus showed thatMORPHIX++ had a lexical coverage of 91,12%. The manual inspection of the morphological mark-up revealed some necessary extensions and modifications concerning missing lexical entries, incom- plete or erroneous single word analyses and some general aspects of the assignments of morpho-syntactic information. The current coverage of MORPHIX++ has reached about 94%, where most of the words not analysed are, in fact, proper nouns or misspelled words. On the basis of the improved morphological analysis,MORPHIX++ provides the input for the BRILLtagger in the subsequent disambiguation step.

Since the improved morphological component provides quite accurate results it can be employed for the annotation of the remaining corpus texts, thus reducing the manual post-examination considerably.

4.3. Evaluation of the POS Tagger

For smesan unsupervised tagging strategy was adopted with the goal of investigating to what extent we can dis- pense with the time consuming building of a training corpus. As learning material the lexical mark-up provided by

MORPHIX++ was used instead. The integrated BRILLtag- ger consists of a learning component and an application component. In the learning phase, the tagger induces disambiguation rules on the basis of the morpho-syntactic annotated texts. In a second phase those rules are applied to the morphologically analysed input texts. After a first cor- rection phase on MORPHIX++ the BRILL tagger was run over the annotated corpus using a first version of a simple rule application strategy. The context taken into account for disambiguation consisted just of one word/category on each side of the processed item.

While in this first evaluation round the input texts are annotated with POS information only, in the next steps additional features (e.g., case) will be considered until the best feature set for optimal disambiguation results can been de- termined. The evaluation of the BRILLtagger is based on quantitative and qualitative measures. The recall is calcu- lated as the ratio between the number of disambiguations performed and the number of all word ambiguities provided byMORPHIX++. The precision of the disambiguation step is measured as the ratio between the number of correct disambiguations performed and the number of all words of the manually validated reference corpus (since they are sup- posed to be annotated with the correct reading). A set of tools was implemented to support the automatic comparison and calculation of the output of the BRILLtagger and theMORPHIX++ output and the reference material respec- tively. The following results were reported: 62% of the ambiguities in the input text were disambiguated showing an accuracy of 95%. These results are promising since they have been obtained on the basis of a basic implementation of the BRILLtagger.

Even though it is not yet clear what recall and precision values are necessary to express reliable results, the tagger's current performance is certainly not sufficient for the semi- automatic annotation of the reference corpus. But another training phase on more reliable MORPHIX++ analyses, as

well as the manual integration of additional disambiguation rules, will certainly improve its results considerably.

4.4. Evaluation of Shallow Parsing

Since shallow parsing insmesfollows morphological analysis, the performance of the parser partly depends on the output of MORPHIX++ and the BRILLtagger. But even if their results are quite reliable diagnostic evaluations under glass-box conditions are necessary to determine the coverage of the grammars and to detect and eliminate deficien- cies in the syntactic analysis. Therefore, the various finite state automata defining sub-grammars, like NP grammar, verb-group grammar, grammar for temporal expressions, etc. will be evaluated not only on corpus data but also on competence data taken from systematically constructed test suites. When these evaluation parts have been completed an application-oriented evaluation considering just the final NLP output ofsmeswill be carried out.

The evaluation on shallow parsing is just at the begin- ning so that only the proposed methodology can be described. The first experiment is concerned with the NP grammar. As mentioned before, the evaluation work on NLP systems under development should start with diagnostic evaluation. The existing NP grammar of thesmes prototype is therefore being systematically tested on Ger- man test suite data partly taken from the TSNLP database

, partly constructed to account for structures not found in the German TSNLP test suite. The diagnostic evaluation serves to find out which basic linguistic structures the current NP grammar is able to handle. For this purpose a well chosen, representative test suite of reasonable size (about 100 nominal phrases exemplifying different NP structures) is being collected. The test set also contains a comprehensive list of various pronouns.

Parallel to the competence-based approach, which tests isolated phenomena, the NP automata are also evaluated for their performance on domain-specific corpus material.

On the basis of the syntactically annotated reference corpus a document profile was established. The resulting NP classification contains prototypical structures ranging from bare plurals over simply modified NPs including ad- jectives and genitive NPs up to complex NPs containing various pre- and post-modifications. The classification of the distinct NP-types already indicates the internal structure of the nominal constructions. A more formal and detailed subdivision will be worked out in order to provide the necessary representation format for the annotation of the head- modifier dependencies for the reference corpus. The next step in document profiling will be to assign relevance values to the NP structures according to their frequency in the corpus texts.

The NP classification reflects the coverage that the NP grammar must provide for the envisaged application domain. Small test runs on selected test suite data have already been carried out. The output of the NP automata was manually inspected and gave a first impression of the grammar's performance with respect to the envisaged ap-

Test data can be retrieved from the database under http://tsnlp.dfki.uni-sb.de/tsnlp/tsdb/tsdb.cgi

(5)

plication domain. These basic results indicated what extensions and modifications are necessary, including various co-ordination phenomena. Grammar development and diagnostic evaluation runs on test suite data will be done in parallel until a sufficient application-oriented coverage is obtained.

In the second phase, the NP module will be tested on the NPs extracted from the reference corpus. For this corpus- based evaluation, the actual parsing results will be reported, calculated in terms of recall and precision and interpreted according to the relevance measures assigned to the test data. Figure 2 provides a general list of possible parsing results in comparison to the reference output for NPs:

Test data NP NP NP NP NP

Expected Output A A A A A

Parser Output A A* A- B NIL

Difference 1 0.5 0.5 0 0

Figure 2: Comparison between expected and actual output.

NP is the test phrase under examination. A means correct analysis, A* means partially correct analysis, A- means possible analysis, B means wrong analysis, NIL means no analysis. If there is a 1:1 correspondence between the expected value and the parsing result, the difference value is 1.0, if there is partial correspondence or a possible but not expected analysis, the difference must be specified by the user (a value between 0.0 and 1.0), and if there is no correspondence, the difference value is 0.0.

In order to provide a more fine-grained profile of the performance of the grammar module the output of the NP automata will be examined at two levels:

phrasal level: it is checked, if the recognition of the NP, i.e. the external bracketing, is correct.

internal structure: it is checked if the head-modifier dependencies are assigned correctly.

It will be a difficult task to assign the adequate values to partial correct analyses and possible but unexpected analyses. But whatever strategy is followed, the basic decisions on the values should be documented in order to understand and control the resulting performance values. If the numer- ical values expressed by difference are attributed, they can be easily calculated in terms of recall and precision.

At phrasal level recall describes the ratio between the number of those word sequences of a test set the parser has correctly identified as NPs and the sum of all NPs used for the test run. For information on processing accuracy, pre- cision measures the ratio between the number of those NPs of a test set which the parser has correctly identified as NPs and the sum of all NP recognitions given by the parser.

Recall =

! "$#%&!'()$ +*-,/.

0$.1*2,3.

Precision =

445$) "#%&)'(! 6*2,3.

7"89) "#%&)'(!$ +*2,3.

Concerning the internal structure of the nominal phrases recall describes the ratio between the number of those NPs

of a test set to which the parser has assigned a correct head- modifier structure and the sum of all NPs correctly identi- fied by the parser. Precision measures the ratio between the number of those NPs of a test set to which the parser has assigned a correct head-modifier structure and the sum of all internal NP analyses given by the parser.

Recall = 5$:;#%0$4#<7"=*2,>.&4?%&?@

445$A) "#%&)'(! A*2,3.

Precision =

4$:!#%0#<75B*2,>.C&?%&?(5

758=;#%0$4#<7"=*2,>.&4?%&?@.

Since the parsing module ofsmesshall be employed to support the semi-automatic annotation of the reference corpus, the analysis performance must be increased and the output representation needs to be modified. Our work could also benefit from other sophisticated annotation tools, as for example those currently developed within the DiET project (Netter et al. 1998). With the help of such means, an even richer annotation scheme might be applied to the whole reference corpus. Given the envisaged mapping between the format of the parsing results and the syntactic annotation of the reference data, we will investigate how far automatic routines can be employed for the comparison of actual and expected output.

5. Concluding Remarks and Future Work

Measuring progress during the development of the syntactic analysis component ofsmesprovided not only statistical data about linguistic coverage but also qualitative feedback about the strength and weakness of the grammar.

Two distinct kinds of evaluation have been described:

an experiment on the performance ofsmesused as the syntactic analysis component of a larger NLP system, and the evaluation of the core engine ofsmes, where the testing is done at component level. The first study provided valuable information on the evaluation procedure, including the type of annotated language resources needed and the measurements suitable for NLP evaluation. These findings are taken into account for the current work withinPARADIME. First of all, the systematic testing of the parser modules is done with the help of selected test suite data. In order to provide a suitable basis for the application-oriented interpretation of the parsing results, a document profile of the reference corpus is established which includes information on frequency and relevance of selected phenomena. An important point concerns the annotated language resources. The reference corpus is augmented with exactly that type of information which is also delivered by the output of the NLP components. On the basis of the identity between annotation format and smesoutput automatic comparison will be possible. Furthermore, the annotation of the corpus texts will be supported by the improved NLP components ofsmes.

The employment of these tools will greatly speed-up the building of suitable reference material.

The evaluation work within PARADIME will continue with further diagnostic cycles on the BRILLtagger and on the various sub-grammars until sufficient coverage and ro- bustness for the application domain are reached. The annotation of the reference corpus will successively be aug-

(6)

mented for this purpose. We will investigate the use of existing annotation schemas and tools (like, for example, the DiET initiative) for the annotation of phrasal constituents.

The linguistic competence of the parsing modules is additionally being checked against suitable test suites. Further- more, the kind of measures will be more sophisticated, i.e.

we will conform to the measurements in terms of recall and precision, adapted to the respective components. Finally, an overall evaluation study will be carried out under black box conditions to measure the NLP performance ofsmesused as analysis component for message extraction applications.

6. Acknowledgments

The research underlying this paper was supported by grants from the German Bundesministerium f¨ur Bildung, Wissen- schaft, Forschung und Technologie (BMB+F) to the DFKI project PARADIME (FKZ ITW 9704). We would like to thank Milena Valkova and Birgit Will for their patient and precise annotation work, and Markus Becker for his work on the integration of the BRILLtagger. The evaluation work done within COSMA was largerly influenced by ideas of Stephan Busemann, leader of this project.

7. References

Brill, E. (1995). Unsupervised learning of dismabiguation rules for part of speech tagging. Proceedings of the Sec- ond Workshop on Very Large Corpora, WVLC-95 Boston, 1995.

Busemann, S.; Declerck, T.; Diagne A.K.; Dini, L; Klein, J.

Schmeier, S. (1997). Natural Language Dialogue Service for Appointment Scheduling Agents. Proceedings of the 5th Conference on Applied Natural Language Process- ing, ANLP-97, pp.25-32, Washington DC, 1997.

Finkler, W.; Neumann, G. (1988). Morphix: A fast Realiza- tion of a Classification–based Approach to Morphology.

Proceedings der 4. Osterreichischen Artificial Intelli-¨ gence Tagung, Wiener Workshop Wissensbasierte Sprach- verarbeitung, pp.11-19, Berlin, 1988.

Klein, J.; Busemann, S.; Declerck,T. (1997). Diagnostic Evaluation of Shallow Parsing Through an Annotated Ref- erence Corpus. Proceedings of the Speech and Language Technology Club Workshop, SALT-97, pp.121-128, Sheffield, 1997.

Lehmann, S.; Oepen, S.; Estival, D.; Falkedal, K.; Com- pagnion,H.; Balkan, L.; Fouvry, F.; Arnold, D.; Dauphin, E.; R´egnier-Prost, S.; Lux, V.; Klein, J.; Baur, J.; Netter, K. (1996). TSLNP – Test Suites for Natural Language Processing. Proceedings of the 16th International Con- ference on Computational Linguistics, COLING-96, pp.

711-126, Copenhagen, 1996.

Nerbonne, J.; Netter, K.; Diagne, A.K.; Klein, J.; Dick- mann, L. (1993). A Diagnostic Tool for German Syntax.

Machine Translation 8, 1-2, pp. 85-109, 1993

Netter, K.; Armstrong, S.; Kiss, T.; Klein, J.; Lehmann, S.; Milward, D.; Petitpierre, D.; Pulman, S.; Regnier- Prost, S.; Sch¨aler, R.; Uszkoreit, H.; Wegst, T. (to ap- pear). DiET - Diagnostic and Evaluation Tools for Natu- ral Language Applications. Proceedings of the first international Conference on Language Resources and Evalu- ation. Granada, Spain, 1998.

Neumann, G.; Backofen, R.; Baur, J.; Becker, M.; Braun, C. (1997). An Information Extraction Core System for Real World German Text Processing. Proceedings of the 5th Conference on Applied Natural Language Process- ing, ANLP-97, pp.209-216, Washington DC, 1997.

Oepen, S.; Klein, J.; Netter, K. (1997). TSLNP – Test Suites for Natural Language Processing. In: Nerbonne, J. (Ed.). Linguistic Databases. University of Stanford, CSLI Lecture Notes, Stanford, pp.13-37, 1997.