• Keine Ergebnisse gefunden

3.3 Experimental Setup

3.3.3 Instantiation

The implementation of the coreference system employs the framework from Chapter 2.

Note that in addition to beam search and its array of update methods for approximate search, we can also carry out exact decoding assuming the feature set is arc-factored, i.e., that features only involve pairs of mentions connected by an arc. In terms of the required hyperparameters and functions required, some of these have been discussed above already, e.g., PERMISSIBLE and ORACLE. The beam size, update methods, and number of epochs will be varied during the experiments but 20 and 25 are used as default values for beam size and number of epochs, respectively.4

Loss Function. The loss function we use is shown in Equation 3.1. It is composed of two parts, mention mistakes and root mistakes. When evaluating a predictionˆz, every men-tion that has received an erroneous incoming arc is considered a mistake. If the correct antecedent is another mention it is classified as a MENTIONMISTAKE. If the correct an-tecedent is the root mentionm0, then it is classified as a ROOTMISTAKE. The loss function in Equation 3.1 computes a linear combination of these mistakes, where root mistakes are weighted by a factor1.5following Fernandes et al. (2012). We did not attempt to tune this factor and it is kept constant throughout.

LOSS(˜z,ˆz) =MENTIONMISTAKES(˜z,ˆ˜z) + 1.5·ROOTMISTAKES(˜z,ˆz) (3.1)

4As stated in Chapter 1 the implementation we used for the experiments in this chapter is available on the author’s website.

Feature Extraction. The feature sets are language-specific and manually tuned. We ab-stain from a long discussion of each and every possible template but summarize briefly what categories of features we use and their origin.

Traditional machine-learning approaches to coreference resolution have relied on small sets ofcategoricalfeatures that are typically computed by some more or less com-plex rules over the input (Soon et al., 2001; Ng and Cardie, 2002b). This includes, e.g., binary indicators as to whether a mention is a pronoun or not; string matching under certain relaxations such as removing determiners and quotation marks; or rules that at-tempt to extract aliases (for instance, International Business Machines can be referred to asIBM). Similarly, assigning gender and number can be done by tables of pronouns, or which we do in English, extended also to cover common nouns by looking them up in automatically mined frequency statistics (Bergsma and Lin, 2006). Moreover, we also in-clude binary indicators for whether a mention appears within quotes, or whether two mentions along an arc are nested. Finally, every mention is assigned a particulartype:

pronoun, common noun phrase, or name. Pronouns and names are detected by part of speech tags, and all other mentions are regarded as common noun phrases.

A second set of features are lexicalized features that extract surface forms of the words in, and possibly surrounding, a mention. This can be either the full surface string of the mention or target a specific words such as the surface form of the head word, the first or last words of a mention, or even a word preceding or following the mention.

While lexical features had been used (Rahman and Ng, 2009) before the CoNLL 2011 and 2012 Shared Tasks, these shared tasks and the much larger data set OntoNotes increased the usage of lexicalized features for coreference resolution (Bj¨orkelund and Nugues, 2011;

Durrett and Klein, 2013).

We also usegrammaticalfeatures drawn from the parse trees in the data (Yang et al., 2006; Rahman and Ng, 2011a). This includes encoding the labels of the non-terminals corresponding to a mention or the parent non-terminal, but also the subcategorization of the parent constituent which provides information on how this mention is situated in the parse tree. Additionally, the path through the parse tree between a pair of mentions is also used (Gildea and Jurafsky, 2002). These types of features may capture typical grammatical constraints on coreference, e.g., such as Chomsky’s (1981) binding princi-ples. Feature templates that extract paths in the parse tree remove the work of manually constructing heuristic rules corresponding to these principles and may also capture fur-ther generalizations. In that vein, we do not only extract paths when both mentions occur in the same sentence. When two mentions are in separate sentences, the path traverses the parse trees through an auxiliary non-terminal that sits above each sentence’s parse

tree. These features give a rough indication to how deeply embedded each mention is in their corresponding sentences.

Numerical features ofdistanceare also included. The numerical values are always discretized into buckets. We use two flavors of distance between two mentions: The distance in terms of mentions, and the distance in terms of sentences (Soon et al., 2001).

Additionally, the particular composition of the data sets motivate two more features.

As the OntoNotes is drawn from multiple genres the peculiarities of the genres may cause certain typical coreference relations to be more or less frequent. We therefore include the genre as a feature. Moreover, some of the genres include transcribed speech from television where speaker information is provided. For this reason we also use a binary indicator feature that flags if the mentions were said by the same speaker or not. This is critical to get the usage of first and second person pronouns right in spoken discourse.

Finally, as we extended the arc-factored model to include features that can span more than two mentions we also experimented withnon-localfeatures that are drawn from a greater context in the coreference tree. Of course, all the features described above are also applicable in this context. For instance, a lexicalized template that extracts the head word of the antecedent’s antecedent could be used. During development we also experimented with non-local features drawn from previous work on entity-mention models (Luo et al., 2004; Rahman and Ng, 2009), however they did not improve performance in preliminary experiments. The one exception is thesizeof a cluster under consideration (Culotta et al., 2007). As with the distance features, these numerical values are discretized into buckets.

Similarly, we also use a feature that indicates thecluster startdistance, which is the dis-tance in mentions from the beginning of the document to where a potential entity is first mentioned. Again, these numerical values are bucketed.

More generally, theories on discourse and information structure have suggested that certain patterns occur as to how an entity is introduced and subsequently referred to throughout discourse (Prince, 1981; Grosz et al., 1983, 1995). For instance, a new entity might be introduced by a long common noun phrase or a name in object position. Subse-quent references to the same entity may be shorter (common) noun phrases or pronouns occurring syntactically in subject position. We therefore compute theshapeof a cluster by encoding the linear “shape” in terms of mention types as a sequence. For instance, the clusters representingGary WilberandDrug Emporium Inc. from the example in Chap-ter 1, would be represented asNPN andNCCC, respectively. WhereN, P, and C denote the names, pronouns, and common noun phrases, respectively. Moreover, inspired by the Entity Grid (Barzilay and Lapata, 2008), which models the evolution of entities with respect to whether they are in subject or object position, we aim to capture thecluster

syntactic context. We approximate the grammatical functions of a mention by the path in the parse tree from the mention to the root of its parse tree, similar to the syntactic features mentioned above. The partial paths of mentions in a cluster inform the model about the local syntactic context of a potential cluster.

Armed with the above selection of feature templates we tuned the feature sets for each language. First, as a baseline we started from the feature set from a previous coref-erence system we had developed (Bj¨orkelund and Farkas, 2012) which roughly encom-passes features of all categories discussed above except the non-local features. We then optimized the feature sets for an arc-factored model by doing greedy forward/backward feature selection over a pool of templates discussed above, as well as conjunctions among them. In order not to taint the development set, this was performed over the training set of each language, split into two parts where 75% was used for training, and 25% for test-ing. Feature templates were incrementally added or removed in order to optimize the CoNLL average. The idea is that thislocalfeature set is the strongest arc-factored model that allows for exact search and can then be used as a baseline when comparing with models that usenon-localfeatures and beam search. After freezing the local feature set, the feature selection procedure was repeated to find the optimal non-local feature set.