Implemented Methods - Constituent Order Generation

6.4 Constituent Order Generation

6.4.3 Implemented Methods

6.4.3.1 TheRANDOMBaseline

The bottom line in the evaluation experiments is provided by the baseline which puts the constituents in a random order without taking any linguistic information into account.

6.4.3.2 TheRAND IMPBaseline

We improve the trivial random baseline described above by the three syntax-oriented rules which are motivated by the fact that German is a SVO/SOV⁶ (or only SOV in Generative Grammar (Ouhalla, 1994)) language where the subject by default precedes the object:

1. The VF is reserved for the subject.

2. The second position (i.e., right after the verb in the main clause) is reserved for the direct object if there is any.

3. The order of the remaining constituents is generated randomly.

6.4.3.3 TheSYNT-SEM Baseline

In searching for the correct order, similar to Ringger et al. (2004), we select the order with the highest probability conditioned on syntactic and semantic categories. Unlike them, we use

6S, V and O stand for subject, verb and object respectively.

6.4 Constituent Order Generation 95 dependency parses and compute the probability of the top clause node only, which is modified by all constituents. With these adjustments the probability of an orderO given the historyh, if conditioned on syntactic functions of constituents (s₁...sn), is simply:

P(O|h) =

i=1

P(si|si−1, h) (6.11)

Ringger et al. (2004) do not make explicit, what their set of semantic relations consists of.

From the example in the paper, it seems that these are a mixture of lexical and syntactic in-formation⁷. Our annotation does not specify semantic relations between constituents. Instead, some of the constituents are categorized as pers, loc, temp, org or undef ne if their heads bear one of these labels (see Sec. 2.1.2). By joining these with possible syntactic functions, we obtain a larger set of syntactic-semantic tags as, e.g., subj-pers, pp-loc, adv-temp. We trans-form each clause in the training set into a sequence of such tags, plus three tags for the verb position (v), the beginning (b) and the end (e) of the clause. Then we compute the bigram probabilities⁸.

Thus, our third baseline (SYNT-SEM) selects from all possible orders the one with the highest probability as calculated with the following formula:

P(O|h) =

i=1

P(ti|ti−1, h) (6.12)

wheretiis from the set of joined tags. For Example (6.5), possible tag sequences (i.e., orders) are ’b subj-pers v pred pp e’, ’b pp v subj-pers pred e’, ’b pred v subj-pers pp e’, etc. Note, that given that the longest clause has ten constituents, the algorithm requires up to 10! permutations for every clause in order to find the one with the highest probability.

6.4.3.4 The UCHIMOTO Baseline

As the fourth baseline we train a maximum entropy learner (OpenNLP⁹) and reimplement the algorithm ofUchimoto et al. (2000) (see Sec. 6.2 for the description). For every possi-ble permutation, its probability is estimated according to the following formula (copied from

7For example DefDet, Coords, Possr, werden

8We use the CMU Toolkit (Clarkson & Rosenfeld, 1997) to compute the probabilities.

9http://opennlp.sourceforge.net

Eq. (6.2)):

P(1|h) =P({Wi,i+j = 1|1≤i≤n−1,1≤j ≤n−i}|h)

≈

n−1

i=1 n−i

j=1

P(Wi,i+j = 1|hi,i+j)

n−1

i=1 n−i

j=1

PM E(1|hi,i+j)

(6.13)

The task of the binary classifier is to predict the probability that the order of a pair of con-stituents is correct –PM E(1|h_i,i+j). Figure6.4illustrates the training and the testing phases.

The prediction is made based on the following features describing the verb orhc – the head of a constituentc¹⁰:

vlex the lemma of the root of the clause (non-auxiliary verb);

vpass the voice of the verb;

vmod the number of constituents to order;

lex the lemma ofhc or, ifhc is a functional word, the lemma of the word which depends on it (e.g., for PPs the noun is taken, because it is the preposition which is dependent on the verb);

pos part-of-speech tag ofhc;

sem if defined, the semantic class of c; e.g., im April 1900 and mit Albert Einstein (with Albert Einstein) are classified temp and pers respectively;

syn the syntactic function ofhc;

same whether the syntactic function of the two constituents is the same;

mod number of modifiers ofhc;

rep whetherhc appears in the preceding sentence;

pro whether c contains a (anaphoric) pronoun.

10We exclude features which use information specific to Japanese and non-applicable to German (e.g., on postpositional particles).

6.4 Constituent Order Generation 97

C

₁

C

₂

+ +

+

− −

−

(a) Generating training instances from the data

C C C C

?

? ?

?

3 1 4 2

(b) Estimating the probability of the order

Figure 6.4: The training and testing phases of theUCHIMOTO baseline.

6.4.3.5 The MAXENT Method

The first configuration of our system is an extended version of theUCHIMOTObaseline (MAXENT).

To the features describing c we added the following ones:

det the kind of determiner modifyinghc (def, indef, non-appl);

rel whetherhc is modified by a relative clause (yes, no, non-appl);

dep the depth of c;

len the length of c in words.

The first two features describe the discourse status of a constituent; the other two provide information about its “weight”. Since our learner treats all values as nominal, we discretized the values of dep and len with a C4.5 classifier (Kohavi & Sahami, 1996).

6.4.3.6 TheTWO-STEP Method

The main difference between our first algorithmMAXENTand this one (TWO-STEP) is that we generate the order in two steps:

1. For the VF, using the OpenNLP maximum entropy learner for a binary classification (VF vs. MF), we select the constituentcwith the highest probability of being in the VF, i.e.,arg maxcP(c|V F). Figure6.5aillustrates the process with an example where from a set of four constituents (c1, c2, c3, c4) one is selected (c3).

2. For the MF, the remaining constituents are put into a random order and then sorted.

The training data for the second task is generated only from the MF of clauses. Figure 6.5bgives an example where the three constituents in the VF are sorted with only three comparisons.

Another modification concerns the efficiency of the algorithm. Instead of calculating proba-bilities for all pairs, we obtain the right order from a random one by sorting. We compare adjacent elements by consulting the learner as if we would sort an array of numbers with Bub-ble Sort (the efficiency of the sorting algorithm is not important here). Given two adjacent constituents,ci andcj such thatci precedescj, we compute the probability of them being in the right order, i.e.,P(ci ≺cj). If it is less than 0.5, we transpose the two and compareciwith the next adjacent constituent.

Since the sorting method presupposes that the predicted relation is transitive, we checked whether this is really so on the development and test data sets. We looked for three con-stituents ci, cj, ck from a sentence S, such that P(ci ≺ cj) > 0.5, P(cj ≺ ck) > 0.5, P(ci ≺ ck) < 0.5 and found none. Therefore, unlike UCHIMOTO, where one needs to make exactly N! ×N(N −1)/2 comparisons select the best order of N constituents, we need to makeN(N −1)/2 comparisons at most. Figure 6.6 presents an implementation of ORDER-CONSTITUENTS(cons) from the algorithm in Figure 6.3 with TWO-STEP. Func-tions SELECT-FOR-VF(cons) removes the best candidate for the VF from the given set of con-stituents (cons) and returns it back. Functions RANDOMIZE-ORDER(cons) and SORT(cons) are void and do what they stand for: the former brings the constituents in a list into a random order; the latter sorts constituents by estimating the probabilityP(ci ≺cj). The choice of the sorting algorithm is not important for us as the number of constituents to order hardly ever exceeds eight.

In the implementation of our fusion system, to make the treatment of main and hypotactic clauses similar, we use the first classifier also to find the best candidate for the first position in the subordinate clauses. There, the classifier learns to assign high probability to subordinate conjunctions (e.g., dass (that), weil (because), obwohl (although) etc.) as well as to relative pronouns (die (which/that/who), dessen (whose) etc.). Although this could be done with a few simple rules, we prefer to train an extra classifier in order not to treat hypotactic clauses as a special case.

Im Dokument Dependency Graph Based Sentence Fusion and Compression (Seite 112-116)