• Keine Ergebnisse gefunden

Computational Morphological Analysis as an Aid for Term Extraction

5. How can a morphological analyser serve as an aid for term extraction?

Two of the basic outputs of a specialised electronic corpus that are of interest to the terminographer, are term frequency counts and concordance lines. In a morphologically complex language such as Zulu, in which the only constant core element in words or word forms is the root, as has been illustrated above, it is essential that word roots be identified in order to (i) list the morphological variants of a word form under a single lemma or canonical form, and (ii) list selected canonical forms in a concordance indicating their immediate context to the left and the right. Frequency distributions of word forms and concordance lines are not only useful in identifying single-word term candidates, but also assist in determining parts of potential multiword terms by means of collocation patterns. The extraction of accurate frequency counts and concordance lines from an electronic Zulu corpus both call for each word in the corpus to be morphologically analysed.

A very simple term to extract from the sample corpus would for instance be the noun ingculazi ‘AIDS’, should it appear in its uninflected form. Yet, its frequency would be much higher if all inflected forms of this same noun would also be extracted.

Such inflected forms, however, appear in a word list ranked as words separately from ingculazi, each with its own frequency, e.g. yingculazi, zengculazi, ngengculazi, abanengculazi, eyingculazi, kwingculazi, usunengculazi, lengculazi, and so forth. In fact, all the mentioned words have the same root namely -ngculazi.

From our examples it should become clear that in Zulu, scientifically reliable frequency counts and concordance lines are in general difficult to obtain without automated morphological analysis. In the eight words above taken from the sample corpus, the root -ngculazi is realised in a different word form each time. In order to identify a root for each of the forms, a process of so-called affix-stripping (Gibbon 2000: 34) is applied by means of the computational morphological analyser.

For the morphological analyser to be able to identify roots, they need to be present in the analyser. Words constructed from roots that are not present, will not be analysed. Referring back to the sample corpus, should the noun root -ngculazi, for instance, not be present in the morphological analyser, it would not be possible to identify this root in any of the eight different word forms as indicated above.

In order to systematically update and extend the root list of the morphological analyser, reflecting the dynamic nature of the language as well as the well-known principle that terminology compilation is an on-going and repeated activity (Sager 2001: 763), the Xerox finite-state tools include a useful feature, namely the option of building a so-called guesser. The guesser is a variant of the morphological analyser that contains all phonologically possible roots (XRCE 2002). This guesser variant of the morphological analyser is a particularly useful computational tool for exploring (new) language corpora. By applying the guesser to any corpus as a potential source of new word roots, new (as yet unlisted) word roots are detected, analysed and marked for possible inclusion in the current word root list.

Ultimately it will be the responsibility of the terminographer to determine whether such roots form part of the terminology of the particular subject field, since

‘all term-extraction and retrieval systems are still in some sense support systems that offer proposals to the terminologist who must then decide how to proceed, often in consultation with a domain expert’ (Ahmad & Rogers 2001: 740).

6. Conclusion

The complex morphological structure of a language such as Zulu necessitates morphological analysis as an aid for automated term extraction. In order to facilitate reliable and scientific term extraction from a special-language corpus even further, the guesser variant of the morphological analyser is a particularly useful computational tool for identifying possible new terms. Sager (2001: 765) confirms that text analysis of large corpora can be used to identify new terms, and that while this feature of terminology compilation is still in its infancy stages, its significance will increase greatly in the future.

Acknowledgements

The authors would like to acknowledge the financial support of the National Research Foundation (NRF) for the project ‘Computer Analysis of Zulu’, the contribution of Xerox Research Centre Europe (XRCE), whose software and training are a vital component of this project; as well as the continued support and expert advice of Dr.

Ken Beesley (XRCE).

References

Ahmad, K. and M. Rogers. 2001. Corpus Linguistics and Terminology Extraction. In S.E. Wright and G. Budin (eds.): 725-760.

Alberts, M. 2001. Lexicography versus Terminography. Lexikos 11: 71-84.

Beesley, K. and L. Karttunen. 2003. Finite-State Morphology: Xerox Tools and Techniques. Stanford: CSLI Publications.

Gibbon, D. 2000. Computational Lexicography. In F. Van Eynde and D. Gibbon (eds.). Lexicon Development for Speech and Language Processing: 1-42.

Dordrecht: Kluwer Academic Publishers.

Hurskainen, A. 1997. A language sensitive approach to information management and retrieval: the case of Swahili. In R.K. Herbert (ed.). African Linguistics at the Crossroads: Papers from Kwaluseni: 629-642. Cologne: Rüdiger Köppe.

Sager, J.C. 2001. Terminology Compilation: Consequences and Aspects of Automation. In S.E. Wright and G. Budin (eds.): 761-771.

Taljard, E. and G.-M. de Schryver. 2002. Semi-Automatic Term Extraction for the African Languages, with special reference to Northern Sotho. Lexikos 12: 44-74.

Wright, S.E. and G. Budin (eds.). 2001. Handbook of Terminology Management.

Amsterdam: John Benjamins.

XRCE. 2002. References to finite-state methods in Natural Language Processing.

<http://www.xrce.xerox.com/research/mltt/fst/fsrefs.html> (Last accessed on 04/11/2002)

The Practical Use of Knowledge Management Theory in Terminology

Outline

ÄHNLICHE DOKUMENTE