A Bottom-Up Algorithm for Parsing and Generation

(1)

published as: CLAUS Report number 5, Saarland University, February 1991

A Bottom-Up Algorithm for Parsing and Generation

Gregor Erbach

^*

Universität des Saarlandes Computational Linguistics

Im Stadtwald W-6600 Saarbrücken, FRG e-mail: erbach@coli.uni -sb.de

Abstract

We present a bottom-up algorithm for parsing and generation. The algorithm is a bottom-up chart parser, whose lexical lookup phase has been modified for generation. An analysis of the algorithm offers interesting insights into the relationship between parsing and generation, summarized by the statement that parsing is a very constrained form of generation. The use of the generation algorithm as a component of a grammar development environment is discussed.

1 Parsing and Generation

In this paper, we present a chart-based algorithm for parsing and generation.

We will not consider generation from a semantic representation, for which the usual head-driven algorithms [Shieber et al. 1990] are better suited, but rather the generation of large sets of sentences. This functionality is needed in grammar development systems for exploring the coverage of a given grammar, and for checking whether a grammar overgenerates, which is particularly useful if grammars are developed not only for analysis, but also for generation.

Let us start by giving an intuitive characterization of bottom-up parsing and generation.

*This work was conducted in IBM Germany's text understanding project LILOG. I would like to thank Roman Georg Arens, Monika Becker, Günter Neumann, Karel Oliva and Hans Uszkoreit for fruitful discussion.

(2)

Bottom-up parsing starts by creating an item for every occurence of a word in the input string. Items carry information about the portion of the string they cover, their position in the string, and their syntactic category. If adjacent items match the right-hand side of a grammar rule, they are combined into larger items until an item is found that covers the entire string and whose category is whatever the grammar defines as a well-formed structure. The parsing process produces as many solutions as there are different derivations for the given string.

In bottom-up generation, one initial item is created for every entry of the lexicon. In the case of generation, items need not carry information about string position. Any sequence of items whose categories match the right-hand side of a grammar rule can be combined into a larger item whose category is the left-hand side of the grammar rule. This process continues until no more items can be built. As any interesting language contains an infinite number of sentences, such an algorithm never stops generating items unless it is somehow restricted (see section 3).

Under this view, parsing and generation differ in two respects:

1. In the initialization phase, a parser uses only the words of the string, whereas a generator uses all words of the lexicon.

2. Items used in parsing are indexed for their position in the string in order to check adjacency, whereas items used for generation carry information about the string they cover, but not about string position.These differences are illustrated in figure 1.

These differences suggest that parsing is a very restricted form of generation, in which it is already known which sentence must be generated. A similar point on the relationship between parsing and generation is made in [Kay 1980].

(3)

Figure 1

We will now outline a generation algorithm that functions as a parser if more restrictions are imposed on the process. The only difference between parsing and generation occurs in the initialization phase.

2 The Algorithm

The algorithm presented here differs from the intuitive formulation given in section 1 in that the application of a rule to one or more items in order to form a new item is generalized in the usual way by using active items. A rule is only applied to one item whose category unifies with the first category of the rule's right-hand side, leading to the creation of an item, whose remainder is the rest of the rule's right-hand side. If the remainder is not empty, the item is active, and can be combined with a passive item whose category unifies with the leftmost element of the active item's remainder.

2.1 Data Structures

The principal data structure used by the algorithm is the item. An item is a quadruple:

<Span,String,Category,Remainder>

• Span is a pair <Starting Vertex, Ending Vertex>

• String is a list of words

(4)

• Category is a non-terminal symbol of the grammar¹.

• Remainder is a list of categories. If Remainder is empty, the item is passive (complete), otherwise it is active (incomplete).

Tree structures are not included in the item, as they can be built up in the syntactic category, e. g. if Definite-Clause Grammar or HPSG [Pollard and Sag 1987] are used as the grammar formalism.

When running the algorithm as a parser, the items need not carry information about the string, whereas in the generation case, it is crucial that the string, but not the span is instantiated at the beginning of the process.

Items may be actual or pending. The set of actual items is called the chart, and the set of pending items is called the agenda. Pending items have a priority, and are added to the chart in the order of descending priority.

Whenever a pending item is added to the chart, it becomes actual, and new pending items are created by applying a rule to the item or by combining it with an item of complementary activity (i. e. one of the two items must be active, and the other passive).

The reason that pending items are not immediately added to the chart is that this scheme allows one to control the order of exploration of the search space.

If pending items are added to the chart immediately, the result is depth-first search, which leads to the enumeration of an uninteresting segment of the search space for the generation case, as illustrated by the following example:

The cat saw the cat.

The cat saw the green cat.

The cat saw the green green cat.

The cat saw the green green green cat.

...

In order to avoid such behaviour, we have adopted a strategy that will first generate all well-formed strings² of length 1, then all strings of length 2, all strings of length 3, length 4 and so on, until a user-defined maximum length is reached. If the user does not define a maximum length, the algorithm will run forever. In order to achieve this behavior, we define a parsing strategy which prefers short pending items over long ones.

1We are not comitted to any particular grammar formalism. Our algorithm works for Context-Free Grammars, Definite Clause Grammars [Pereira and Warren 1980] and Feature Unification Grammars [Shieber 1986].

2For practical purposes, only a subset of the possible strings of each length is generated, see section 3.

(5)

For parsing, this scheme opens the possibility to implement different parsing strategies (see [Erbach 1987]).

2.2 The Procedures

PROCESS(String) 1. INITIALIZE(String) 2. MAIN-LOOP

The procedure PROCESS is used for both parsing and generation. For parsing, it must be called with an input string, for generation with a variable.

After this process, all the results for parsing or generation can be found in the chart.

INITIALIZE(String)³

1. remove all items from previous parsing or generation processes 2. if String is instantiated

for every word W in the string from position n to n+1 for every lexical entry of W with category Cat

ADD-ITEM(<span(n,n+1),W,Cat,nil>) if String is not instantiated

for every lexical entry with word W and category Cat ADD-ITEM(<span(?,?),W,Cat,nil>)

The purpose of the initialization procedure is to clear the chart and the agenda, to perform lexical lookup, and to create an initial agenda of pending items.

Lexical lookup is the only phase in which the algorithm differs for parsing and generation. The difference is the following:

In the case of parsing, we know which words occur in the string and what their positions are. Consequently, one item will be created for every lexical entry of every word token of the string, and each item is indexed with string positions. Such indexing allows the algorithm to check for adjacency of items.

In the case of generation, we do not know which words will be used, nor do we know their string positions. Therefore, an item is created for every word in the lexicon. The string position is left uninstantiated. This allows a word to

3nil is the empty list, and ? is a free variable

(6)

occur more than once in the generated string, because it can be used at different positions.

ADD-ITEM(<span(Begin,End),String,Cat,Rem>)⁴

1. add item <span(Begin,End),String,Cat,Rem> to the chart 2. if Remainder is empty

for every rule LHS -> RHS of the grammar if first(RHS) = Cat

ADD-TO-AGENDA(<span(Begin,End),String,LHS,rest(RHS)>) for every active item <span(Begin0,End0),String0,Cat0,Rem0>

if End0 = Begin and Cat unifies with first(Rem0) ADD-TO-

AGENDA(<span(Begin0,End),String0+String,Cat0,rest(Rem0)>) if Remainder is not empty

for every passive item <span(Begin1,End1),String1,Cat1,nil>

if End = Begin1 and first(Rem) unifies with Cat1 ADD-TO-

AGENDA(<span(Begin,End1),String+String1,Cat,rest(Rem)>)

ADD-ITEM adds an item to the chart, and creates new pending items:

- in case of a passive item for every applicable rule and every applicable active item

- in case of an active item for every applicable passive item

ADD-TO-AGENDA(It em)

1. EVALUATE(Item,Priority)

2. if Item is not longer than the maximal length add pair <Item,Priority> to the agenda

ADD-TO-AGENDA adds a new item to the agenda instead of adding it to the chart immediately like ADD-ITEM. Items which are on the agenda are assigned a priority, and the item with the highest priority is added to the chart by MAIN-LOOP. The assignment of a priority to an item on the agenda is the means for controlling the search strategy of the algorithm.

EVALUATE(Item,Priority)

1. give shorter items higher priority

4first(List) selects the first element of List, rest(List) selects the rest of the list (i. e. the list without its first element), + denotes concatenation of strings, and what unify means depends on the grammatical formalism. In the case of context-free grammar it means identity of non -terminal symbols, in the case of DCG it means Prolog unification, and for feature unification grammars, it means feature term unification.

(7)

This strategy is useful in order to achieve complete generation for strings of increasing length. Any other search strategy may be defined.

MAIN-LOOP

1. while agenda is not empty

remove item I with the highest priority from agenda ADD-ITEM(I)

The procedure MAIN-LOOP makes pending items actual, thereby creating new pending items.

3 Application in a Grammar Development Environment

Traditionally, grammar development environments consist of an editor for writing grammars, and a parser for verification of the grammar. A parser is useful for checking whether a grammar can analyze a given sample of sentences, and whether it assigns sentences the correct structures.

When writing grammars for generation⁵, it is important to assure that the grammar is restrictive enough, and does not allow any ungrammatical sentences. For this task, a parser is not very adequate. Even though one can provide the parser with a negative sample of ungrammatical sentences to check whether they are correctly rejected, one can never be sure to anticipate all possibilities for ungrammaticality.

What is needed is a component which generates a representative sample of sentences, so that the grammar writer can check which set of sentences the grammar generates, and whether it contains ungrammatical sentences.

There is a division of labor between the parser and the generator. The parser checks the coverage of the grammar (in logical terms the completeness), and the generator checks whether the grammar overgenerates (in logical terms the correctness).

Because natural languages are infinite, it is not possible to generate the entire language. Therefore, a representative finite subset of the language must be generated. There are a number of ways to restrict the generated language:

Restrict the lexicon

5Here, we mean generation in the usual sense as generation from a semantic representation.

(8)

Restriction of the lexicon avoids uninteresting lexical variation in the generated sample sentences:

John saw the cat John saw the dog John saw the rat John saw the mat ...

For a class of structurally similar lexical entries, only one representative is selected. This requires modification of the lexical lookup phase.

Restrict recursion John saw the blue car John saw the blue blue car John saw the blue blue blue car John saw the blue blue blue blue car ...

This requires storing the derivational history with the items, and to put an upper limit on the repetition that is allowed. However, one should not eliminate recursion completely since it is an interesting property of the grammar.

Limit the length of the sentences generated

This prevents the algorithm from running forever.

Restrict the set of rules

One may want to avoid some specialized grammar rules, if they treat phenomena of the language that one is currently not interested in.

Assure that new or modified rules or lexical entries are used

After modification of a grammar, one may want to know what effect the changes have on the coverage of the grammar. This restriction prevents generation of sentences which are based on exclusively old material.

(9)

4 Implementation and Efficiency

The algorithm has been implemented in Quintus Prolog under the AIX operating system on an IBM PS/2. Due to different restrictions on the parsing and the generation process, we use two separate programs (which share procedures), so that each can be optimized parsing or generation [Arens and Erbach 1991]. They are part of a grammar development environment which consists of a generator, a parser and tools for inspecting feature structures and parse results. STUF [Bouma, König, Uszkoreit 1988] is used as the grammar formalism.

The generation algorithm which was actually implemented differs from the above idealized algorithm in the following respects:

1. In order to save storage space, we do not store the feature structure with the item, but rather the derivation tree, from which the feature structure can be reconstructed. Note that reconstructing the feature structure of a string is computationally much less expensive than parsing the string because no search is involved.

2. We do, however, store a Prolog term for each rule and each item which contains a subset of the information contained in the corresponding feature structure. The motivation is to have a simple and inexpensive test, which can be applied before reconstructing and unifying feature structures. If the Prolog terms do not unify, neither will the corresponding feature structures. Only if the Prolog terms unify, the feature structures must be reconstructed and unified. Since the feature structures contain more information than the Prolog terms, their unification may fail, ever though the Prolog terms unify. But most of the useless computation steps is thus avoided. The same optimization is used by the parser.

3. For binary rules, no active items are constructed, but only passive items, which are produced by applying the rule to two items. This optimization saves a lot of storage space. Incidentally, we use the same optimization in the bottom-up active chart parser of our grammar development environment.

4. No active items are produced whose length is the maximal length. In parsing, no active items are produced which extend to the end of the input string.

With all of these optimizations, the algorithm must be run overnight to get a sample of the sentences up to length 5 and over the weekend for sentences

(10)

up to length 10, as preliminary experimentation with a comprehensive grammar of German indicates.

5 Conclusion

Chart-based algorithms are useful for both parsing and generation of large samples of sentences. They store intermediate results, and avoid recomputation of structures. This is very important for generation of samples, because larger strings are built from shorter ones. Moreover, we found that some optimizations which were developed for chart parsing also make generation more efficient.

(11)

References

[Arens and Erbach 1991]

R. G. Arens and G. Erbach. Evaluation von Grammatiken für die Analyse natürlicher Sprache durch Generierung einer repräsentativen Satzmenge. In:

Th. Christaller (ed.) Proceedings of GWAI 91, Springer, Berlin.

[Bouma, König, Uszkoreit 1988]

G. Bouma, E. König and H. Uszkoreit. A Flexible Graph-Unification Formalism and its Application to Natural-language Processing. IBM Journal of Research and Development 32(2), 170 - 184.

[Erbach 1987]

Karl Gregor Erbach. An Efficient Chart Parser Using Different Strategies.

Department of Artificial Intelligence Working Paper No. 52. University of Edinburgh.

[Kay 1980]

Kay, Martin. Algorithm Schemata and Data Structures in Syntactic Processing. Report CSL-80-12, Palo Alto, CA: XEROX PARC.

[Pereira and Warren 1980]

F. Pereira and D. H. D. Warren. Definite Clause Grammars for Natural Language Analysis. A Survey of the Formalism and a Comparison with Augmented Transition Networks. In: Artificial Intelligence 13, 231 -278.

[Pollard and Sag 1987]

Information-based Syntax and Semantics. Volume 1: Fundamentals. CSLI Lecture Notes No. 13, Stanford, CA.

[Shieber 1986]

An Introduction to Unifications-Based Approaches to Grammar. CSLI Lecture Notes No. 4, Stanford, CA.

[Shieber et al. 1990]

Stuart M. Shieber, Gertjan van Noord, Fernando C. N. Pereira, Robert C.

Moore. Semantic-Head-Driven Generation. Computational Linguistics 16(1), 30 - 42.

(12)

Appendix: Prolog Code of the Algorithm

The code presented here is not optimized for efficiency, but rather made as simple and clear as possible.

/***********************************************************************/

/* */

/* process(?String,?Length,?Structure) */

/* */

/* process/3 is a procedure for parsing strings and for generating all */

/* possible strings up to a given length */

/* */

/* Examples: process([the,cat,is,on,the,mat],Length,Structure) */

/* process(String,10,Structure) */

/* */

/***********************************************************************/

process(String,Length,Structure) :- init_process(String,Length), process_loop([]),

get_result(String,Structure).

init_process(String,Length) :- clear_chart(Length),

nth(String,N,N1,Word), % for every word in the lexicon from N to N1 lexicon(Word,Cat), % for every lexical entry of Word

add_item(item(span(N,N1),[Word],Cat,[])), fail.

init_process(_,_).

process_loop(OldAgenda) :-

update_agenda(OldAgenda,[ITEM-Priority|Agenda]), % sort pending items !, % into agenda

add_item(ITEM), % add item to chart creating new pending items process_loop(Agenda).

process_loop([]).

get_result(String,Structure) :-

clause(item(_,String,Structure,[]),true).

add_item(ITEM) :- % add item to chart assert(ITEM),

fail.

add_item(Passive) :- % apply rule to passive item Passive = item(span(Begin,End),String,Cat0,[]), rule(LHS,[Cat1|RHS]),

unify(Cat0,Cat1),

add_item(item(span(Begin,End),String,LHS,RHS)), fail.

(13)

add_item(Passive) :- % combine passive with active item Passive = item(span(Mid,End),String1,Cat1,[]),

Active = item(span(Begin,Mid),String0,Cat0,[Cat|Remainder]), clause(Active,true),

unify(Cat,Cat1),

append(String0,String1,String),

add_to_agenda(item(span(Begin,End),String,Cat0,Remainder)), fail.

add_item(Active) :- % combine active with passive item

Active = item(span(Begin,Mid),String0,Cat0,[Cat|Remainder]), Passive = item(span(Mid,End),String1,Cat1,[]),

clause(Passive,true), unify(Cat1,Cat),

append(String0,String1,String),

add_to_agenda(item(span(Begin,End),String,Cat0,Remainder)), fail.

add_item(_).

add_to_agenda(Item) :- assert(agenda(Item)).

update_agenda(OldAgenda,Agenda) :- retract(agenda(Item)),

evaluate(Item,Priority),

insert_agenda(Item-Priority,OldAgenda,NewAgenda), !, update_agenda(NewAgenda,Agenda).

update_agenda(Agenda,Agenda).

insert_agenda(Item-Priority,[T0-P0|Items],[Item-Priority,T0-P0|Items]) :- Priority >= P0.

insert_agenda(Item-Priority,[T0-P0|Items],[T0-P0|Agenda]) :- Priority < P0,

insert_agenda(Item-Priority,Tasks,Agenda).

insert_agenda(Job,[],[Job]).

evaluate(item(_,String,_,_),Priority) :- length(String,Length),

clause(max_length(MaxLength),true), check_length(Length,MaxLength),

Priority is 0 - Length. % shorter strings are given priority

check_length(Length,MaxLength) :- var(MaxLength),

!.

check_length(Length,MaxLength) :- Length =< MaxLength.

nth(List,_,_,_) :- var(List), !.

nth([E|_],0,1,E).

nth([_|List],N,N1,E) :- nth(List,N0,N,E), N1 is N + 1.

(14)

unify(X,X). % can be used for context-free grammar and DCG, replace this

% by your favorite unification procedure for other formalisms

clear_chart(Length) :-

retractall(item(_,_,_,_)), retractall(agenda(_)), retractall(max_length(_)), assert(max_length(Length)).

% this is the format for lexical entries lexicon(the,det(det(the))).

lexicon(cat,n(n(cat))).

lexicon(saw,v(v(saw))).

% this is the format for grammar rules rule(s(s(NP,VP)),[np(NP),vp(VP)]).

rule(np(np(DET,N)),[det(DET),n(N)]).

rule(vp(vp(V,NP)),[v(V),np(NP)]).