• Keine Ergebnisse gefunden

The idea of latent structure is also closely related to other ideas building on the per-ceptron such as guided learning (Shen et al., 2007; Shen and Joshi, 2008; Goldberg and Elhadad, 2010) andtraining with exploration(Goldberg and Nivre, 2012, 2013). In guided learning, a classifier is used to select from a large set of decisions (e.g., introducing an arc in a dependency tree). Among the many options, several ones may be correct. As long as the classifier selects correct decisions during training, no updates are made, but when the classifier makes a mistake, the current weights are used to decide on a latent gold decision to use for update. Training with exploration is more or less the same idea, with the addition that sometimes mistakes are accepted in order to explore erroneous parts of the search space. While these techniques primarily (at least with this terminology, see next paragraph) have been used in the parsing community, guided learning has also been applied to the task of coreference resolution (Stoyanov and Eisner, 2012).

Training with exploration can in turn be regarded as an instance of imitation learning.

Here, the idea is that a search space is explored as learning progresses but anexpert policy can be queried for the right action at any point. Typically the search space is explored in a stochastic manner, where incorrect actions sometimes are taken. The expert policy then provides the best given action at any point. A classifier is initially trained to mimic the expert policy, but gradually the classifier is left to traverse the search space on its own (as in training with exploration) and both updates and instance selection are driven by gradually moving from the expert policy to the learned policy guided by the reward function. Two well-known approaches to imitation learning include frameworks such as SEARN (Daum´e III et al., 2009) and DAGGER (Ross et al., 2011). A similar approach is that of reinforcement learning (Sutton and Barto, 1998; Maes et al., 2009). The main dif-ference between imitation and reinforcement learning is the requirements on the expert policy. In imitation learning, the assumption is that the expert knows the search space well enough to guide the search in the best possible direction. In reinforcement learning this assumption is weakened, and the expert is replaced with a reward function that gives a measure on the goodness of a given state. The reward function is again used to guide the exploration of the search space, however with the caveat that it might not properly approximate an expert in a given state.

ve-hicle for learning that we will rely on in subsequent chapters. In particular, we have described five different types of update strategies for the structured perceptron with ap-proximate search: early, max-violation, and latest updates, as well as LaSO and DLaSO.

In addition to the update strategy, instantiation of the framework relies on the definition of a number of task-specific functions and data structures. First, the user needs to for-mulate the search problem and the relationship between output spaceY and prediction spaceZ-space. This involves defining a sequential search problem, where a well-formed output structurey ∈ Y can be created by a sequence of stepsz ∈ Z. Additionally, the user needs to provide the following functions and data structures:

• The mappingπ:Z 7→ Ythat maps a prediction structurezto output spaceY.

• A feature extraction functionφ:X ×(S×D)7→Rmwhich extracts appropriate fea-tures in order to learn and predict the correct decisions. It can exploit all available partial structure in order to create a high-dimensional representation of the current state.

• The function PERMISSIBLE which returns the set of all permissible decisions in a given state.

• The function ORACLEwhich returns the set of correct decisions in a given state.

• The function LOSSwhich computes a structured loss between a correct sequence and a prediction. It may be arbitrarily complex, but must be applicable also on a partial sequencez1:i, with or without the aid of the mappingπ.

• A state representation that includes the basic requirements stated earlier, i.e., back-pointers to previous state, decision taken in previous state, and score of the state.

The state representation then needs to be extended in a task-specific manner to ac-commodate the other user-defined functions including the feature extraction func-tionφas well as PERMISSIBLEand ORACLE.

In addition, two hyperparameters are also left open to the user. They include the number of training epochs for the perceptron and the beam width used for beam search.

In the remainder of this dissertation we will carry out a careful evaluation on learn-ing in this framework uslearn-ing real NLP problems. We will define the necessary functions and vary the set of hyperparameters and update methods: Chapter 3 will carry out a comprehensive analysis of the different update methods for coreference resolution with latent structure. In Chapter 4 we will introduce latent structure into transition-based

dependency parsing, while keeping the update strategy fixed. Finally, in Chapter 5 we will drop the latent structure and instead focus on the update methods when sequences become very large.

Chapter 3

Coreference Resolution

3.1 Introduction

In recent years much work on coreference resolution has been devoted to increasing the expressivity of the classicalmention-pair model, in which each coreference classification decision is limited to information about two mentions that make up a pair. This short-coming has been addressed byentity-mention models, which relate a candidate mention to a (potentially partial) cluster of mentions predicted to be coreferent so far (Ng, 2010). As discussed in the introduction already, these two approaches sit on opposing sides with regard to inference: In the mention-pair model exact search can be carried out to yield the optimal output structure, however at the expense of richer features that could consider a greater structural context. The entity-mention model, on the other hand, extends the scope of features to consider multiple mentions at the same time, although this requires abandoning exact search.

Nevertheless, models that are based on pairs of mentions have empirically proven very competitive. This has, at least in part, been driven by the introduction of latent structure to coreference. Specifically, for a given mention in a document, which mention, orantecedent, on its left is the best match in terms of learning and generalizability.1 The use of latent antecedents, i.e., that the mention on the left is derived dynamically by the machine-learning model was part of training, has received considerable attention since it was used by the winning system (Fernandes et al., 2012) on the CoNLL 2012 Shared Task (Pradhan et al., 2012) on multilingual coreference resolution.

The system of Fernandes et al. (2012) basically implemented the model that was

dis-1We note in passing that, following standard practice in the coreference resolution literature, we overload the term antecedent to mean any mention to the left with which a mention is coreferent.

cussed in the Chapter 1: Model the mentions of a document as a tree, let the antecedents of each mention be latent, and try to predict a tree as a structure (as opposed to learning a classifier that classifies individual pairs of mentions). There was, however, one major caveat to their implementation – they relied on graph algorithms to do inference over the possible trees. Specifically, they used the Chu-Liu-Edmonds (CLE) algorithm (Chu and Liu, 1965; Edmonds, 1967) to find the maximum scoring directed tree assuming scores over all potential arcs are provided. The very nature of this inference algorithm requires an arc-factored model that is difficult to extend to richer features. However, under the constrained setting discussed in Chapter 1, where the tree is only right-branching, the inference problem can equally well be solved using a left-to-right greedy pass (also dis-cussed in Chapter 1). While these two approaches have the same asymptotic time com-plexity (O(n2), in the number of mentions, assuming Tarjan’s (1977) improved version of CLE), an inference procedure that makes a left-to-right pass is much more suitable to be extended with beam search.

In this chapter, we follow Fernandes et al. (2012) and develop a coreference model building on latent antecedents, or rather, latent trees, but replace the inference algorithm with such a left-to-right pass. Our starting point will be an arc-factored model in which the search problem can be solved exactly. We will then extend this model to include fea-tures that span more than two mentions. Since exact inference now becomes intractable, we will apply the framework from the previous chapter using beam search during the left-to-right pass and couple this with the update methods we have discussed. From the perspective of Chapter 2, this will enable us to carry out careful analysis of the up-date methods. We will compare the situation where exact search but weak features are available (i.e., an arc-factored model) to the approximate search with rich features. The evaluation shows that update methods that do not commit to using all available train-ing data – early update, max-violation, and latest – underperform compared to LaSO and DLaSO, sometimes so severely that the simpler arc-factored model cannot be beaten.

With regard to the area of coreference resolution research, the tree formulation can be considered an entity-mention model, where single mentions are compared to (partially) completed entities. However, unlike most entity-mention models, the proposed one in-cludes mention-pair models as a special case and moving between the two settings can be done seamlessly, without requiring a change in choice of machine-learning or features.

In that regard, we demonstrate that our entity-mention model, when trained using ap-propriate update methods, outperforms our strong mention-pair baseline.2

2The numbers from the experimental section of this chapter differ slightly from those presented in the paper on this topic (Bj¨orkelund and Kuhn, 2014), as we modified the system to properly implement the