Alternative Estimation Functions - Partial Parse Selection

6.2 Partial Parse Selection

6.2.4 Alternative Estimation Functions

Generally speaking, the weights of the edges in the shortest path ap-proach represent the quality of the local analyses and their likelihood of appearing in the analysis of the entire input.

This is an interesting parallel to the parse selection models for the full analyses, where a goodness score is usually assigned to the full analysis. As mentioned in earlier chapters already, the parse disambiguation model described by Toutanova et al. (2002) uses a

maximum entropy approach to model the conditional probability of a parse for a given input sequence P(t|w). Similar approaches have also been reported by Abney (1997); Johnson et al. (1999); Riezler et al. (2002); Malouf and van Noord (2004).

The main difference is that we want to rank the intermediate pars-ing results rather than full analyses here. There are usually some well-formedness constraints given by the grammar (e.g., root con-ditions) which must be satisfied by the maximal projections to be licensed as full analyses. But for intermediate results, there are no such constraints. On the one hand, this allows maximal robustness, for all the local analyses are available on the parsing chart without constraints from larger contexts. On the other hand, this also raises the difficulty of fully discriminating the ambiguities, for a much larger number of possible intermediate results need to be ranked.

Formally, for a given partial parse Φ =< t₁, . . . , t_k >,δ =< ω₁, . . . , ω_k >

is a segmentation of the input sequence so that each local analysis t_i ∈ Φ corresponds to a sub-string ω_i ∈ δ of the input sequence ω. Therefore, the probability of the partial parse Φ given an input sequence ω is:

P(Φ|ω) = P(δ|ω)·P(Φ|δ) (6.1) With the assumption that P(t_i|ω_i) are mutually independent for dif-ferent i, we can derive:

P(Φ|ω) ≈ P(δ|ω)· Yk

i=1

P(t_i|ω_i) (6.2) Therefore, the log-probability will be

logP(Φ|ω) ≈logP(δ|ω) + Xk

i=1

logP(ti|ωi) (6.3) Equation 6.3 indicates that the log-probability of a partial parse for a given input is the sum of the log-probability of local analy-ses for the sub-strings, with an additional component −logP(δ|ω) representing the conditional log-probability of the segmentation. If

we use −logP(t_i|ω_i) as the weight for each local analysis, then the DAG shortest path algorithm will quickly find the partial parse that maximizes logP(Φ|ω)−logP(δ|ω).

The probability P(t_i|ω_i) can be modeled in a similar way to the maximum entropy based full parse selection models:

P(t_i|ω_i) = expP_n

j=1λ_jf_j(t_i, ω_i) P

t⁰∈T expP_n

j=1λ_jf_j(t⁰, ω_i) (6.4) where T is the set of all possible structures that can be assigned to ω_i, f₁. . . f_n are the features and λ₁. . . λ_n are the parameters. The parameters can be efficiently estimated from a treebank, as shown by Malouf (2002). The only difference from the full parse selection model is that here intermediate results are used to generate events for training the model (i.e., the intermediate nodes are used as pos-itive events, if they occur on one of the active trees, or as negative events, if not). Since there is a huge number of intermediate results available, we only randomly select a part of them as training data.

This is essentially similar to the approach of Osborne (2000), where there is an infeasibly large number of training events, only part of which is used in the estimation step. The exact features used in the log-linear model can significantly influence the disambiguation accu-racy. However, this is beyond the scope of this discussion. In this experiment we used the same features as those used in the PCFG-S model of Toutanova et al. (2002) (i.e., depth-1 derivation trees).

With the n-gram language models trained on a large corpus, we can derive the probability of input sequence P(ω), as well as all the sub-sequences in the segmentations P(ω_i). To estimate the condi-tional probabilities of the segmentations, we first use the following estimation to derive the unconditioned probabilities.

Pˆ(δ) = Yk

i=1

P(ω_i) (6.5)

The estimation of the conditional probability can be derived by normalizing the unconditioned probabilities over all the possible seg-mentations for the input:

Pˆ(δ|ω) = Pˆ(δ) P

δ⁰∈∆Pˆ(δ⁰) (6.6) where ∆ indicates the set of all possible segmentations. A closer look easily reveals that ˆP(δ|ω) is solely determined by how independent the occurrences of word groups are around each segmentation point.

Considering a bi-gram language model, the segmentation probability is changed by a factor of ^P^(w_P(wⁱ^)·P^(wⁱ⁺¹⁾

i,wi+1) from the language model prob-ability of the input sequence P(ω) for each segmentation point at i.

Intuitively, a good (plausible and probable) segmentation should sep-arate the input sequence at points where the joint probabilities are lower than the product of individual probabilities. This indicates that the words around the segmentation points are less correlated. Also, note that since ˆP(δ|ω) will be normalized, the computation of the language model probability for the input sequence is not necessary.

Computational-wise, the worst case time complexity of computing Pˆ(δ|ω) for all segmentations is O(|∆| · |ω|). |∆| can be potentially large, for each position between words can be considered as a segmen-tation point, leading to a total number of different segmensegmen-tations up to 2^|ω|−1. Fortunately, in practice not all of them are licensed by the grammar.

Unfortunately, the shortest path algorithm itself is not able to di-rectly find the maximized P(Φ|ω), for each passive edge can occur in different segmentations, making the assignment of P(δ|ω) to edges difficult. Fully searching all the paths can be computationally ex-pensive when there are a lot of different segmentations. In order to achieve a balance between accuracy and efficiency, two different approximation approaches are taken.

One way is to assume that the component logP(δ|ω) in Equa-tion 6.3 has less significant effect on the quality of the partial parse.

If this is valid, then we can simply use −logP(t_i|ω_i) as edge weights, and use the shortest path algorithm to obtain the best Φ. This will be referred to as model I.

An alternative way is to first retrieve several “good” δ with rela-tively high P(δ|ω), and then select the best edges t_i that maximize

P(t_i|ω_i) for each ω_i in δ. We call this approach the model II.

How well these strategies work will be evaluated in Section 6.3.

Other strategies or more sophisticated searching algorithms (e.g., the genetic algorithm) can also be used, but we will leave that to future research. It is even possible to do a complete search for a global opti-mal partial parse, though with even higher (potentially exponential) computational complexity.

6.3 Evaluation of Partial Parse Selection

Im Dokument Robust Deep Linguistic Processing (Seite 118-122)