Evaluation of Partial Parse Selection Models

The evaluation of partial parses is not as easy as the evaluation of full parses. For full parsers, there are generally two ways of evaluation.

For parsers that are trained on a treebank using an automatically ex-tracted grammar, an unseen set of manually annotated data is used as the test set. The parser output on the test set is compared to the gold standard annotation, either with the widely used PARSEVAL¹ measurement, or with more annotation-neutral dependency relations.

The evaluation procedure is largely automated. The annotation qual-ity plays a dominating role in such evaluation. When the analysis of a specific language phenomenon needs to be changed, the treebank annotation needs to be updated accordingly. It is, to say the least, difficult and time consuming for manually annotated treebanks.

For parsers based on manually compiled precision grammars, more human judgment is involved in the evaluation. Unlike the tree-bank induced grammars, the precision grammar based parsing out-put does not conform to the existing treebank annotation. More-over, the analysis can change dramatically with the evolution of the grammar. Therefore, the PARSEVAL metrics are not practical for manually compiled precision grammars. More annotation-neutral

1The PARSEVAL metric counts the proportion of bracketings which group the same sequences of words in both the gold standard trees and the parser output. Early versions of the PARSEVAL metric ignored the question whether matching sequences were labelled the same way in both trees (aka. unlabelled PARSEVAL), but more refined versions have subsequently taken this into account (aka. labelled PARSEVAL).

evaluation methods (e.g., dependency relation-based evaluation) are plausible. However, extra effort is needed to convert the outputs, if the precision grammar uses a different representation. For the DELPH-IN grammars, we use the (Robust) Miminum Recursion Se-mantics ((R)MRS; Copestake et al. (1999); Copestake (2006)) for se-mantic output. The conversion from MRS to dependency structure is, though possible, less well-studied.

Instead of relying on pre-annotated gold standard treebanks, the evaluation of manually compiled DELPH-INprecision grammars is ac-complished by so called dynamic treebanks. With the evolution of the grammar, the treebank, as the parsing output from the gram-mar, changes over time (Oepen et al., 2002). The grammar writer needs to update the treebank by inspecting the parses generated by the grammar and either “accepts” or “rejects” the new analyses. And the performance change of the grammar/parser is evaluated upon the updated treebank: the coverage is the proportion of grammatical sen-tences which receive at least one correct analysis; the overgeneration is the proportion of ungrammatical sentences which receive at least one analysis, etc. However, the size of the dynamic treebanks is usu-ally relatively small, for the burden of updating the entire treebank after each grammar change is non-trivial. Also, the fixed set of test items make the test non-blind. After several iterations, the grammar can be specifically tuned for the test set, either intentionally or un-intentionally. Therefore, only the first round of treebanking can be regarded as an unbiased evaluation. In the evaluation of parser accu-racy, the performance of the statistical disambiguation model should also be considered together with the grammar performance.

However, the evaluation becomes more difficult for partial parsing.

In order to evaluate the partial parsing results for manually compiled grammars, the criterion for acceptable analyses becomes less evident.

And most current treebanking tools are not designed for annotating partial analyses. Large-scale manually annotated treebanks do have the annotation for sentences that deep grammars are not able to fully analyze. But the annotation difference in other language resources makes the comparison less straightforward. More complication is

involved with the platform and resources used in this experiment.

For instance, the transformation of incomplete RMRS fragments into other representations is largely an open question.

In this section, we use both manual and automatic evaluation meth-ods on the partial parsing results. Different processing resources are used to help the evaluation from the syntactic, as well as the seman-tic point of view. Some of the results have been reported earlier by Zhang et al. (2007a), as well.

6.3.1 Syntactic Evaluation

In order to evaluate the quality of the syntactic structures of the partial parses, we implemented the partial parse models described in the previous section in the PET parser. The nov-06 version of the ERG is used for the experiment. As test set, we used a subset of sentences from the Wall Street Journal Section 22 from the Penn Treebank. The subset contains 143 sentences which i) do not receive any full analysis licensed by the grammar, and ii) do not contain lexical gaps (input tokens for which the grammar cannot create any lexical edge). The first criterion allows us to investigate the potential of our partial parsing mechanism, while the second criterion avoids the complication with the coverage loss due to an incomplete lexicon.

Although, the techniques developed in previous sections can largely improve the lexical coverage and provide us with a larger test set, we would like to carefully separate the different aspects of robustness in this study. The average sentence length in this test set is 24 words.

Due to the inconsistency of the tokenization, bracketing and branch-ing between the Penn Treebank annotation and the handlbranch-ing in ERG², we manually checked the partial parse derivation trees. Each output is marked as one of the three cases: GBL (good labelled bracketing) if both the bracketing and the labeling of the partial parse derivation

2In the Penn Treebank, most of the punctuations are treated as separate tokens/words, while in the ERGmost of them are treated as affixes. ERG analyses are strictly either unary or binary, while the Penn Treebank branchings are much more flexible (i.e., flat construction for less agreed upon analyses).

trees are good (with no more than two brackets crossing or four false labellings); GB (good unlabelled bracketing) if the bracketings of the derivation trees are good (with no more than two brackets crossing), but the labeling is bad (with more than four false labellings); or E (erroneous), if otherwise.

The manual evaluation results are listed in Table 6.1. The test set is processed with two models presented in Section 6.2 (M-I for model I, M-II for model II). For comparison, we also evaluate for the approach using the shortest path with heuristic weights (denoted by SP). In case there are more than one path found with the same weight, only the first one is recorded and evaluated.

GBL GB E

# % # % # %

SP 55 38.5% 64 44.8% 24 16.8%

M-I 61 42.7% 46 32.2% 36 25.2%

M-II 74 51.7% 50 35.0% 19 13.3%

Table 6.1: Syntactic evaluation results for different partial parse se-lection models

The results show that the na¨ıve shortest path approach based on the heuristic weights works pretty well at predicting the bracketing (with 83.3% of the partial parses having less than two brackets cross-ing). But, when the labeling is also evaluated, it is worse than model I, and even more significantly outperformed by model II.

6.3.2 Semantic Evaluation

Evaluation of the syntactic structure only reflects the partial parse quality from some aspects. In order to get a more thorough com-parison between different selection models, we look at the semantic output generated from the partial parses.

The same set of 143 sentences from the Wall Street Journal Sec-tion 22 of the Penn Treebank is used. The RMRS semantic repre-sentations are generated from the partial parses with different selec-tion models. To compare, we used RASP 2 (Briscoe et al., 2006), a

domain-independent robust parsing system for English. According to Briscoe and Carroll (2006), the parser achieves a fairly good accuracy of around 80%. The reasons why we choose RASP for the evaluation are: i) RASP has reasonable coverage and accuracy; ii) its output can be converted into RMRS representation with the LKB system. Since there is no large scale (R)MRS treebank with sentences not covered by the DELPH-IN precision grammars, we hope to use the RASP’s RMRS output as a standalone annotation to help the evaluation of the different partial parse selection models. However, we do not claim that the RASP output is the “gold standard” from any aspect. In fact, a shortest path algorithm similar to ours is used in the system to achieve maximal robustness. The output from RASP is used as ref-erence to help us compare whether there is a significant performance difference between our partial parse selection models. But none of the following results should be taken as an absolute quantitative measure.

In future research, we do see an emerging need for a platform inde-pendent standard evaluation for deep linguistic processing systems.

For both deep and shallow parsing systems, we have seen that in re-cent years more and more researchers have expressed a similar opinion in different ways (e.g., Carroll, 1998; Carroll et al., 2002). More re-cently, we have also seen that for the shallow parsing community, de-pendency structure based evaluation is becoming a de facto standard (Buchholz and Marsi, 2006). Some of the deep processing systems can produce compatible dependency structures, which allow cross plat-form evaluation. However, due to its limitation in expression power, fine grain linguistic description of subtle meanings is not always avail-able. For the deep processing systems which adopt richer semantic representations (e.g., (R)MRS), conversion to dependency structures is a workaround, rather than an optimal solution. One possible direc-tion is to explore the methods to convert from dependency structures into (R)MRS, and use (R)MRS as the basis for evaluation. It is also necessary to create a larger gold standard (R)MRS treebank which is manually corrected and independent from other specific language resources. Also, the development of SEM-I (stands for semantic in-terface) for DELPH-IN deep grammars is an initial step in such an

direction, where the clearly defined semantic output will largely fa-cilitate the platform independent parser evaluation.

Back to our evaluation, in order to compare the RMRS from the RASP and the partial parse selection models, we used the similarity measurement proposed by Dridan and Bond (2006). The comparison outputs a distance value between two different RMRSs. We normalized the distance value to be between 0 and 1. For each selection model, the averageRMRS distance from theRASP output is listed in Table 6.2.

RMRS Dist.(φ)

SP 0.674

M-I 0.330

M-II 0.296

Table 6.2: RMRS distance to RASP outputs

Again, we see that the outputs of model II achieve the highest similarity when compared to the RASP output. With some manual validation, we do confirm that the different similarity does imply a significant difference in the quality of the output RMRS. The shortest path with heuristic weights yielded very poor semantic similarity.

The main reason is that not every edge with the same span generates the same semantics. Therefore, although the SP receives reasonable bracketing accuracy, it has less idea of the goodness of different edges with the same span. By incorporating P(ti|ωi) in the scoring model, models I and II can produce RMRSs with much higher quality.

Im Dokument Robust Deep Linguistic Processing (Seite 122-127)