Discussion - Robust Deep Linguistic Processing

Multiword expressions are a class of heterogeneous language phe-nomena which are prevalent in use, but lack systematic treatment. In this dissertation, we started from a grammar engineering perspective, aiming at the maximal robustness of deep processing. Comparing to the other studies of multiword expressions, we see that our approach has the following advantages:

• It is highly automatic. With various statistical measures over large corpora, as well as learning mechanisms, the entire process requires minimal human intervention.

• It is not phenomenon specific. Our methods provide a prac-tical solution to effectively handle different types of MWEs in a consistent way. Although fine-grained classification of different MWE phenomena can be helpful in improving the performance on specific types of MWEs, our approach is much more general.

• It is not language specific. Our approach does not rely on any language specific presumption. Therefore, the methods can be easily adapted to work for different languages and grammars.

4.6 Summary

In this chapter, we have described techniques to automatically dis-cover and handle multiword expressions from the grammar engineer-ing perspective for maximal robustness. The error minengineer-ing results pro-vide us with MWE candidates in the form of low parsability n-grams.

The statistical analyses on large corpora (BNC and WWW) help discrimi-nate good candidates from noise. With good MWE candidates, we take either the “words-with-spaces” or the compositional approach in or-der to generate new lexical entries. Through a series of experiments we have shown that both methods are able to improve the grammar coverage significantly, while the compositional approach is also able to maintain the grammar accuracy.

Deep Lexical Acquisition

Our deepest fear is not that we are inadequate. Our deepest fear is that we are powerful beyond measure. It is our Light, not our Darkness, that most frightens us.

— Marianne Williamson, “Return to Love, 1992”

In the previous two chapters, we have discussed techniques for im-proving the lexicon coverage and robustness. The ultimate goal is to help boost deep grammar coverage and robustness in deep process-ing. The evaluation presented in the previous chapters shows that the techniques developed so far deliver promising performance on the module level. However, it is still unclear how the lexicon precision and recall may be related to the grammar performance. In this chap-ter, a series of experiments are reported in an attempt to unveil the correlation between precision and recall, on the one hand, and deep grammar performance, on the other.

5.1 The “Goodness” of Deep Lexical Acquisition

As mentioned earlier in Section 3.3.1, there are two types of lexical errors: i) a lexical entry is missing from the lexicon; ii) an erro-neous entry enters the lexicon. The first type of error normally leads to undergeneration of the grammar, while the latter usually causes overgeneration.

Without considering the interaction with the grammar performance, the quality of the lexicon can be measured with standard precision and recall metrics. The precision is usually defined as the proportion of the correct entries among all the entries in the lexicon. The recall is defined as the proportion of correct entries in the lexicon among all the correct entries for the language (which are not necessarily in the lexicon).

However, in evaluating a lexicon, such definitions are difficult to follow. On the one hand, the “correctness” of the existing lexical en-tries is difficult to judge. Also, it heavily depends on the grammatical analysis. Therefore the precision of the lexicon is, at best, a subjec-tive measurement. On the other hand, the measurement of recall is even more difficult, as there is no direct way to properly estimate how many lexical entries are missing from the lexicon.

Therefore, precision and recall are usually defined relative to a gold standard lexicon (denoted byL). The gold standard lexicon is usually built manually, therefore contains very few erroneous entries. Also, it must cover most of the lexical usage for a specific corpus. Therefore the recall of the lexicon for this specific corpus is also high. For a given lexicon L⁰ as a set of lexical entries, it can be partitioned into two subsets:

L⁰ = G∪E (G∩E = φ) (5.1)

where Gis the set of the entries that also belong to the gold standard lexicon L:

G= {g ∈ L⁰|g ∈ L} (5.2)

and E is the set of the entries that do not occur in L, and are con-sidered to be errors:

E = {e ∈ L⁰|e /∈ L} (5.3) .

The precision P and recall R of the lexicon L⁰ relative to gold standard lexicon L are defined as:

P = |G|

|L⁰| (5.4)

R = |G|

|L| (5.5)

The relative precision and recall more or less reflect the similarity of the lexicon to the gold standard. However, their limitations are very obvious.

First, the availability of the so-called gold standard lexicon is ques-tionable. The starting motivation of deep lexical acquisition is to help build the lexicon. The imperfection of the “gold” lexicon renders the similarity based evaluation less reliable. Any entry in the “gold” lex-icon is taken for granted as correct. Any entry that is not in the

“gold” lexicon is considered to be an erroneous entry.

One way to balance this bias is to restrict the evaluation on a sub-language, bounded with a corpus. A high quality “gold” sub-lexicon can be extracted from an existing larger lexicon. This sub-lexicon satisfies the following two conditions:

• All the lexical usage in the corpus is included in the sub-lexicon;

• All the entries in the sub-lexicon correspond to at least one usage (instance) in the corpus.

If the deep lexical acquisition models are built and evaluated on this corpus using the relative precision and recall to the “gold” sub-lexicon, and if the corpus is well balanced and representative of the entire language, the evaluation results can be indicative of the true precision and recall of the model.

If the above question is still amendable, the following question is even more crucial. Suppose the relative precision and recall of the lexicon reflects its true precision and recall, it is still largely unclear how these figures are related to the grammar performance. Simple conjecture will be that the erroneous lexical entries lead to overgen-eration of the grammar, and therefore making the parser outcome less precise. On the other hand, missing lexical entries lead to under-generation of the grammar, hence hurting the coverage of the parser.

Therefore, the precision of the lexicon should correspond to the accu-racy of the grammar (in parsing tasks), while lexicon recall is related

to the coverage. Unfortunately, in practice the interaction between the lexicon and the grammar is much more subtle. For example, be-sides the overgenerating effect, erroneous lexical entries might also cause parser failure (e.g., triggering recursive unary rules, exhaust-ing parser memory by higher lexical ambiguities, etc.). On the other hand, the undergenerating effect of different missing lexical entries varies, depending on their frequency in the text corpus. A more frequent missing entry has a much larger effect on the grammar cov-erage. This correlation is not directly reflected by the recall of the lexicon, either.

In summary, the precision and recall based evaluations of the lexi-con are neither reliable by themselves, nor indicative about the gram-mar performance.

Then what about the token accuracy based evaluation used in Chapter 3? By working with a lexically annotated corpus, the to-ken accuracy takes the lexical entry frequency into account. Also, by assuming that frequent words are already in the lexicon, the to-ken accuracy is only measured for infrequent words, hence the risk of running into high lexical ambiguity and abnormal grammar behav-ior gets reduced. However, there is still no direct correlation to the grammar performance.

Despite all the difficulties, a good evaluation metric is crucial for the development of deep lexical acquisition. Different measurement will eventually lead to different design. Eventually we hope to maxi-mally improve the overall average performance of the grammar. There-fore, a thorough investigation of how the lexicon performance corre-lates to the grammar performance should be done.

Im Dokument Robust Deep Linguistic Processing (Seite 90-94)