Syntactic Processing of Unknown Words

(1)

in: ed. P. Jorrand and V. Sgurev (eds), Artificial Intelligence IV - Methodology, Systems, Applications, North-Holland, Amsterdam, 1990

Syntactic Processing of Unknown Words

Gregor Erbach

Universität des Saarlandes Lehrstuhl für Computerlinguistik

Im Stadtwald D-6600 Saarbrücken, FRG erbach@sbuvax.campus.uni-sb.de

Abstract

A method for processing sentences which contain unknown words, i. e. words for which no lexical entry exists, is presented. There are three different stages of processing:

1. The sentence with the unknown word is parsed. There are no special

requirements for the parsing algorithm, but the lexical lookup procedure needs to be modified.

2. Based on the syntactic structure of the parse, information about the unknown word can be extracted.

3. The information obtained in step 2 may be too fully specified for a lexical entry.

Therefore a filter is applied to it to create a new lexical entry.

An application of the method is illustrated with examples from Categorial Unification Grammar. The problem of using the extracted information for lexical knowledge acquisition is discussed.

Keywords: Parsing, Lexicon, Learning

(2)

1 On the need for processing unknown words

Within the past few years, the importance of comprehensive lexical resources for natural-language processing systems has been recognized. Yet it is clear that the lexicon cannot be complete because new words can be created and proper names cannot be exhaustively listed in the lexicon. Moreover, for many applications, the size of the lexicon is limited by the size of the available memory. Therefore, robust approaches for processing unknown words are needed.

(Gust, Ludewig 1989) discuss several approaches for extending the lexicon when an unknown word is encountered: These include word formation rules for analyzing compounds, interactive user input and access to machine-readable (conventional) dictionaries.

The method presented in this paper allows the automatic extraction of syntactic information about unknown words. A system with this capacity can be a useful tool in the automatic or machine-aided creation of lexical entries. It can provide the syntactic information of the lexical entry for the unknown word, while filling in the semantics is left to the human knowledge engineer.

2 Syntactic processing: A relational view

In this section we develop a view of syntactic processing which helps understanding what parsing with unknown words means. Under this view, syntactic processing is based on a relation between

a) a lexicon b) a grammar

c) strings of words and

d) syntactic (and semantic) structures.

The lexicon is a relation between words and their lexical entries. The grammar specifies how phrases are built up from constituents, and provides thus a relation between strings of lexical entries and syntactic structures. When one of these four components is not (or only partially) known, there is a need for syntactic processing.

One such syntactic processing task is parsing, where the string, the lexicon and the grammar are known, but the syntactic structure is unknown. Under this view, parsing means computing syntactic structures based on a string, a lexicon and a grammar.

Another syntactic processing task is generation, where the lexicon and the grammar are known, the grammatical structure is partially known (more semantics than syntax), but the string of words is unknown. Generation is then the process of finding a string of words based on a given grammar, lexicon and (semantic) structure.

The other two conceivable syntactic processing tasks are lexicon learning (the lexicon is unknown) and grammar learning (the grammar is unknown).

Processing unknown words is a mixture of two of the above cases: parsing and lexicon learning. The string and the grammar are known, the lexicon is partially known, the structure is unknown.

Syntactic processing of unknows words proceeds in three steps:

1. Finding a syntactic structure for a given string.

2. Extracting constraints for any unknown word(s) contained in the string.

3. Refining the information obtained in step 2 in order to obtain an appropriate lexical entry.

Besides the lexicon and the grammar, there are two additional constraints which can be exploited for parsing with unknown words: One is a constraint on the syntactic structure, namely that it be a

sentence. The other is a constraint on the lexical entry of the unknown word, namely that it be a member of one of the word classes of the language in question. If we assume that all words which belong to the closed word classes of the language are already listed in the lexicon, we can postulate the stronger constraint that the unknown word be a member of one of the open word classes. There are also constraints on what information should and should not be contained in a lexical entry. These latter constraints are exploited in step 3.

(3)

In order to process sentences containing unknown words, we must use all the information contained in the lexicon, the grammar, the known words in the string, and the constraints on a lexical entry and the syntactic structure.

3 A method for processing unknown words

Our approach is based on unification grammar. Unification grammars are useful because they are well suited for the representation of "partial" structures.

Previous approaches, like the lexicon-learning program of (Rayner et al. 1988) or the method presented in (Erbach 1987) are based on phrase-structure or definite-clause grammars. The lexicons of these systems have a small number of categories with hardly any syntactic features. This makes it possible for the processor to make an informed guess about a category and then verify it.

The method presented here is based on rich information structures containing disjunctive information.

The entry for the unknown word is initially highly underspecified; it is a disjunction of the categories of all the open word classes in the language. Constraints from the context are exploited to choose between alternatives of the initial disjunction and accumulate further information.

We base our approach on chart parsing, which is a suitable framework because the intermediate results which are stored during the parsing process are needed for computing the lexical entries of the

unknown words. Any parsing algorithm for unification-based grammars can be used for the processing of unknown words, if it stores intermediate results.

The only step which needs to be modified is the lexical lookup procedure. Traditionally, lexical lookup is a function from words to categories (or sets of categories). It is a partial function because it is not defined for those words which are not in the lexicon. For parsing with unknown words, lexical lookup must be changed into a total function, which returns the set of all word classes (or all open word classes) of the language for all unknown words.

After the parsing is completed, the lexical entries of the unknown words are computed by propagating information down the result tree from the root node to the leaf with the unknown word. Thus all the constraints which the context imposes on the unknown word are collected.

The extracted information must then be refined because it may be overspecified. For example, we might have found a verb which has a plural object. Since we know that verbs do not specify the number feature of their objects, this information must be eliminated.

4 An example 4.1 Introduction

The example is based on Categorial Unification Grammar (Uszkoreit 1986). We want to parse the German sentence Der Schloßturm beherbergt das Museum, where Schloßturm is the unknown word.

The only syntactic rules needed are Left Application and Right Application, written in a PATR-style notation.

Right Application:

value -> functor argument

<functor dir> = right

Left Application:

value -> argument functor

<functor dir> = left Figure 1: Rules of CUG

(4)

We can write the feature structures of these rules as attribute-value matrices¹.

Right Application: Left Application:

argument: 1 functor:

arg: 1 dir: right val: 2 value: 2

argument: 1 functor:

arg: 1 dir: left val: 2 value: 2 Figure 2: Rules of CUG represented as attribute-value-matrices

In the following, we assume a full-form lexicon. The lexical entries for our example are the following:

der: das:

arg:

cat: n arg: *U*

1

gend:mask num: sg case: nom gend: fem num: sg case: gen gend: fem

num: sg case: dat num: pl case: gen dir: right

val:

cat: np arg: *U*

1

arg:

cat: n arg: *U*

case: 1 nom akk gend: neut num: sg dir: right

val:

cat: np gend: neut case: 1 num: sg arg: *U*

Figure 3: Lexical entry for der Figure 4: Lexical entry für das

The determiner der (figure 3) has an ambiguous lexical entry because its argument, the noun it combines with, can either be masculine, singular and nominative case, or it can be feminine, singular and genitive or dative case, or it can be plural and genitive case. The number, gender and case or the resulting NP are the same as those of the argument.

The lexical entry of the determiner das (figure 4) contains only a disjunction between the nominative and accusative cases.

Museum:

cat: n arg: *U*

case: nom akk dat gend: neut

num: sg

Figure 5: Lexical entry for Museum

The noun Museum (figure 5) is ambiguous between three cases.

1In the following, feature graphs are represented as attribute-value matrices. Braces indicate a disjunction of values. Numbers enclosed in boxes indicate reentrancy. The atom *U* is the value UNDEFINED, which does not unify with any other value.

(5)

beherbergt:

arg:

cat: np case: akk arg: *U*

val:

arg:

cat: np case: nom num: sg arg: *U*

val: cat: s arg: *U*

Figure 6: Lexical entry for beherbergt

The verb beherbergt takes two arguments: a nominative NP and an accusative NP. Note that we have not specified the direction in which the verb looks for its argument to permit the two possible orderings Der Schloßturm beherbergt das Museum and Das Museum beherbergt der Schloßturm.

For the unknown word Schloßturm, we assume the following entry, which is a disjunction between nouns (n), adjectives (n/n), intransitive verbs (s/np) and transitive verbs ((s/np)/np)².

cat: n arg: *U*

arg: cat: n arg: *U*

dir: right val: cat: n

arg: *U*

arg:

cat: np case: nom arg: *U*

arg:

val:

arg:

Figure 7: Lexical entry for an unknown word

In this example, the rules do not permit the combination of the unknown word Schloßturm with anything but the determiner der.

4.2 Step 1: Parsing the sentence with the unknown word

The first step³ in parsing is applying the rule of Right Application (RA) to der and Schloßturm. This means that the functor path of RA is instantiated with the lexical entry for der, and the argument path is

2Note that simply returning a variable or the top element in a unification formalism would not work with categorial unification grammar because this variable could be taken as a functor which combines with everything to form a new constituent whose category is again a variable and could in turn combine with anything adjacent to it. This will only produce a lot of ambiguities, while all the resulting

constituents contain no information whatsoever.

3In this example, parsing proceeds bottom-up and only the correct rules are applied. This choice was made for expository convencience only and is irrelevant for the results presented here. Any other parsing strategy would be appropriate.

(6)

instantiated with the lexical entry for the unknown word Schloßturm. The value for the resulting category (Figure 8) is then found in the value path of the rule.

cat: np arg: *U*

1

gend:mask num: sg case: nom gend: fem num: sg case: gen gend: fem

num: sg case: dat num: pl case: gen

Figure 8: Feature structure for der Schloßturm

The combination of der Schloßturm and beherbergt is not possible, because the argument of the verb must be an accusative NP, whereas the NP is ambiguous between nominative, genitive and dative.

RA applied to das and Museum results in the following structure:

cat: np arg: *U*

case: nom akk gend: neut num: sg

Figure 9: Feature structure for das Museum

RA applied to beherbergt and das Museum yields the structure in figure 10.

arg:

Figure 10: Feature structure for beherbergt das Museum

LA applied to der Schloßturm and beherbergt das Museum yields figure 11, which satisfies the constraint that a well-formed parse result have category S and no unfilled arguments.

cat: s arg: *U*

Figure 11: Feature structure of the sentence

4.3 Step 2: Extracting information from the parse result

S

NPnom der S chlo ßturm

VP

beherbergt NPakk das Museum

Figure 12: Parse tree for the sentence

(7)

Figure 12 shows the parse tree for the sentence⁴. For extracting the information about the unknown word, we need to consider only those nodes in the tree which dominate the unknown word, and their children. These nodes are printed in bold face in figure 12. The choice of these nodes is a result of the locality of the syntactic rules. Every rule applies only to a local tree (figure 13), that is one node and its (immediate) children. No dependencies which go beyond a local tree can be expressed in one rule; they must be handled by passing a feature in the tree. We must consider all the children of a node, because there can be dependencies between children which are not reflected in the parent node⁵.

S

NPnom VP

NPnom der Schloßturm

Figure 13: The relevant local trees

For extracting information about the unknown word, we apply the rules "backwards", that is we do not extract the value of the value path, but rather we compute the value of either the functor or the argument path (whichever belongs to the node dominating the unknown word) from the value of the other two paths. This is done starting from the root node, and then going through all nodes which dominate the unknown word.

To collect whatever information is available about the unknown word Schloßturm, we propagate

constraints down the tree from every node dominating the unknown word, starting from the root. In this example, we need to find values for the NPnom node and for the unknown word Schloßturm.

In the following figure, we see the Left Application rule instantiated with the feature structure of the S node (figure 11) for the value path, and the feature structure of the VP node (figure 10) for the functor path.

argument: 1

functor:

arg: 1 cat: np case: nom num: sg arg: *U*

dir: left val: 2 value: 2 cat: s

arg: *U*

Figure 14: Instantiated Left-Application Rule

The feature structure of the argument path is extracted and then unified with the already existing feature structure of the NPnom node (figure 8). Note that all the ambiguity present in figure 8 is resolved in figure 15.

cat: np arg: *U*

case: nom gend: mask num: sg

Figure 15: New feature structure for der Schloßturm

The same method can be applied again starting from the NPnom node. We instantiate the Right

Application rule with the new feature structure of the NPnom node (figure 15) for the value path and the lexical entry for der (figure 3) for the functor path.

4The labels on the nodes are introduced for future reference to the nodes in this paper.

5This is not the case for LFG with its up and down metavariables. For LFG, we would not have to consider any other children.

(8)

argument: 1

functor:

arg: 1 cat: n gend: 3 num: 4 case: 5 arg: *U*

dir: left val: 2

value: 2 cat: np gend: 3 mask num: 4 sg case: 5 nom arg: *U*

Figure 16: Instantiated Right Application rule

The value of the argument path is extracted from this instantiated rule. This is the information that can be extracted about the unknown word Schloßturm from the sentence in the example.

cat: n

gend: 3 mask num: 4 sg case: 5 nom arg: *U*

Figure 17: Feature structure of the unknown word Schloßturm A more formal description of the algorithm follows:

Algorithm for extracting information about an unknown word

Input:

A parse tree with root node S. Associated with each node are a feature structure and the name of the rule with which the node was constructed.

One of the leaves is marked as unknown⁶.

Output:

The feature structure for the unknown word

Function call:

propagate-constraints(S)

Function definition: propagate-constraints(NODE) - if NODE is the unknown word

then return its feature structure.

else

- unify the feature structures of NODE and of its daughters with the corresponding paths of the rule associated with NODE.

- NEXT-NODE is that daughter of NODE which dominates the unknown word.

- extract from the instatiated rule R the new feature structure of NEXT-NODE.

- propagate-constraints(NEXT-NODE)

Figure 18: Algorithm for extracting information about the unknown word

6If there is more than one parse result and/or unknown word, the algorithm must be applied to each pair <parse-result,unknown-word>.

(9)

Semantic information may also be encoded in the lexicon. For example, verbs and adjectives can have selectional restrictions on their arguments. For example, the verb marry requires that both its subject and object be human. When we process the sentence Sigismund marries Friederike, where both names are unknown, we can conclude that both of them must be human.

4.4 Step 3: Creation of lexical entries

The information obtained in step 2 may be too fully specified for a lexical entry.

This problem arises with the extraction of information about functor categories. Imagine that the

unknown word in the example sentence had been beherbergt. The entry for the unknown word (figure 7) includes the definition of a transitive verb as one disjunct:

arg:

val:

arg:

Figure 19: Feature structure for transitive verb

After processing the sentence and extracting the information about the word, we get the following structure for beherbergt.

arg:

cat: np case: akk g end: neut num: s g arg: *U*

val:

arg:

cat: np case: nom g end: mas k arg: *U*

Figure 20: Overspecified lexical entry

This structure contains too much information. It includes values for number and gender of the verbs complements. The situation would be even worse if the NPs had a semantics feature. Then beherbergt would only combine with Museum and Schloßturm. However, we have also found a relevant piece of information about beherbergt, namely that it must have a singular subject.

To overcome this problem, we need a theory about the lexicon that tells us which features can be instantiated in the lexicon and which features must not be instantiated. For example, verbs only specify the case, number and person features of their subject and the case feature for the objects.

We need a filter which filters out from the feature structure extracted for the unknown word only those paths which should be present in a lexical entry. This can be implemented by a unary rule. The right- hand side of the rule is the information extracted from parsing the unknown word, and the left-hand side is the lexical entry which we want to obtain. The following example is the rule for transitive verbs.

This rule only extracts information which is not contained in the definition of a transitive verb (figure 19).

TV-LEX:

lexentry -> info

(10)

Figure 21: Filter for transitive verbs

After applying this filter to the feature structure in figure 20 and unifying the result with the definition of a transitive verb (figure 19), we obtain the following lexical entry.

arg:

val:

arg:

Figure 22: Refined lexical entry

5 Problems

A crucial question is what happens to the unknown word after its lexical entry has been constructed. Of course this lexcial entry is not perfect and should therefore be kept in a temporary lexicon.

The next time the word is found in a sentence, this temporary entry can be used. This second sentence can provide more constraints about the word, which can be extracted using the method described above.

The new lexical entry must then be unified with the old one, because we have acquired additional information about the word.

On the other hand, the first reading which was extracted might be too narrow⁷. For example, in figure 17, the word Schloßturm is in the nominative case. However, in a different context, it can also be dative or accusative. This would call for a disjunction of values. For gender, on the other hand, disjunction is the wrong approach, since nouns only have one gender. If you had the values mask and {mask neut} for gender, the resulting value should be mask i. e. the information about gender must be unified⁸. Once again, this calls for taking into account the structure of lexical entries. For specific word classes, there are some features which must have unique values, while others allow a disjunction of values.

Again, this could be handled by a rule which is used whenever two entries from the temporary lexicon are combined to form one. We think that this problem can only be reasonably approached by integrating the methods presented here with a morphological component.

The next question concerns the handling of selectional restrictions. We assume that selectional

restrictions are represented as sortal restrictions⁹. If we have found different selectional restrictions for a noun in different contexts, all these restrictions apply in conjunction. For example, if something is restricted to be human, female and adult, we know that it is a woman. In terms of a sort lattice, this would be the greatest common subsort.

7This problem can also be the case for words which are already present in the lexicon. A solution to this problem could be to parse a sentence several times. Each time, a different word in the sentence is treated as an unknown word to find out which word might need to be extended. It is hoped that the lexical knowledge engineering is so good that such an approach is not needed.

8There are some exceptions like the German noun "Angestellte" which is ambiguous between masculine and feminine gender.

9For the use of sorts see for example (Smolka 1988).

(11)

For functor categories, which impose selectional restrictions on their arguments, we can try to compute the selectional restriction based on the values that satisfy them in example sentences. If we find a new value that satisfies the restriction, our hypothesis about the restriction must be loosened accordingly. In terms of a sort lattice, the resulting restriction would be the least common supersort of the competing hypotheses.

The problem that arises in this case is finding the correct generalization. Consider the selectional restriction on the object of the verb repair. If we had some examples, e. g. computer, chair, bike, there is no straightforward way to compute the correct sortal restriction that the object of repair can be any artefact.

A serious problem for verbs is that of finding the correct subcategorization frame based on examples, i. e.

discrimination between complements and adjuncts.

Another, more technical, problem is that a misspelt word can be taken to be an unknown word. Then information about it can be extracted and added to the lexicon. To overcome this problem, processing unknown words should be combined with spelling correction. As soon as the information about the unknown word has been extracted, the word is matched against other words compatible with that information.

This overcomes a well-known problem with spelling correction, namely the problem of choosing among several alternatives, all of which are just as plausible, according to the distance measures used in spelling correction algorithms (Fischer 1980, Urmi 1978).

6 Conclusion

We have presented a method which allows to parse sentences containing unknown words, and to obtain preliminary lexical entries for the unknown words. It should be pointed out that parsing sentences with unknown words is in itself useful, even if information about the unknown word is not extracted. A parser which cannot deal with unknown words fails on every sentence which contains one.

This means 100 percent failure for a sentence of which only a small percentage cannot be processed. A text understanding system based on such a parser cannot acquire any knowledge from a sentence which contains an unknown word. A parser which can process unknown words can provide a

syntactic and semantic structure for sentences containing unknown words, which is underspecified for the semantics of the unknown word.

The algorithm has been implemented in Arity/Prolog on an IBM PS/2. STUF (Bouma, König, Uszkoreit 1988) has been used as the unification formalism.

Special thanks to Judith Engelkamp, Günter Neumann and Harald Trost for useful suggestions.

Bibliography

[Bouma, König, Uszkoreit 1988] Bouma, G, E. König and H. Uszkoreit: A Flexible Graph-Unification Formalism and its Application to Natural-Language Processing. In: IBM Journal of Research and Development 32, 2, pp. 170 - 184

[Erbach 1987] Erbach, Gregor: An Efficient Chart Parser Using Different Strategies. Department of Artificial Intelligence Discussion Paper Number 52. Edinburgh, 1987.

[Fischer 1980] Fischer, R. J.: Distanzmasse zwischen Zeichenreihen - Definitionen und Algorithmen. In:

P. R. Wossidlo (ed.): Textverarbeitung und Informatik, Berlin-West 1980, pp. 127 - 138

[Gust, Ludewig 1989] Gust, Helmar and Petra Ludewig: Zielgerichtete Wortschatzerweiterungen in natürlichsprachlichen Systemen. In: D. Metzig (ed.): GWAI-89, 1989, pp. 224 - 233

[Pereira, Warren 1980] Pereira, Fernando and Warren, David: Definite Clause Grammars for Language Analysis - a Survey of the Formalism and a Comparison with Augmented Transition Networks. In:

Artificial Intelligence Vol. 13, Number 3, 1980, pp. 231 - 278

(12)

[Rayner et al. 1988] Rayner, Manny, Åsa Hugosson, Göran Hagert: Using a Logic Grammar to Learn a Lexicon. COLING, Budapest 1988, pp. 524 - 529

[Smolka 1988] Smolka, Gert: A Feature Logic with Subsorts. LILOG-Report 33, Stuttgart 1988.

[Urmi 1978] Urmi, J: String to String Correction. Tekniska Hoegskolan Linkoeping 1978.

[Uszkoreit 1986] Uszkoreit, Hans: Categorial Unification Grammars. COLING, Bonn 1986, pp. 187 - 194.