• Keine Ergebnisse gefunden

7.4 Memory-based Graph Manipulation Models

7.4.2 Sequence-to-Graph Networks

As explained earlier, the key idea of our sequence-to-graph network is that it learns to map from an input sequence to a labeled graph via the structured memory that we just introduced. More formally, let x be the input sequence that should be summarized. We then formulate CM-MDS as the structured prediction problem of finding

𝐺 =arg max

πΊβ€²βˆˆπ’’(x)

π‘ π‘π‘œπ‘Ÿπ‘’(𝐺′) (7.2)

where𝒒(x)denotes the (infinite) set of possible valid graphs that can be constructed from xandπ‘ π‘π‘œπ‘Ÿπ‘’()is a function determining how well each of them summarizes the input.

To make that search tractable, we factor the scoring function into several probability distributions conditioned on the input that are learned by the sequence-to-graph network.

In the following section, we describe each of its components in detail. Figure 7.2 shows an unrolled version of the model, illustrating how the encoder, memory and decoder interact.

7.4.2.1 Sequence Encoder

The task of the sequence encoder is to compose a hidden representation of the input se-quence and learn to recognize mentions of concepts and relations in it. Each tokenπ‘₯1, … , π‘₯β„“ inxis represented by a word embedding. These representations are then recurrently pro-cessed by a GRU (Cho et al., 2014). The input representation at timestep𝑑is given as

h𝐹𝑑 = πΊπ‘…π‘ˆπΉ(hπΉπ‘‘βˆ’1,E[π‘₯𝑑]) (7.3) whereEis the word embedding matrix. In addition toπΊπ‘…π‘ˆπΉ, we use a second cellπΊπ‘…π‘ˆπ΅ to process the sequence backwards and concatenate the hidden states toh𝑑 =h𝐹𝑑 ∣∣h𝐡𝑑 . 7.4.2.2 Memory Module

Similar to the sequence encoder, the memory module is also a recurrent component. At each timestep𝑑, it receives the current encoder stateh𝑑 as input and updates the memory state. If the tokenπ‘₯𝑑 is part of a concept or relation mention, its task is to compare it with

Chapter 7. End-to-End Modeling Approaches

Embedding Sequence Encoder Memory Module Graph Decoder

Eπ‘₯1 Eπ‘₯2 Eπ‘₯3 h0 h1 h2 h3 M𝑁0

M𝐸0 M𝑆0 M𝑇0

M𝑁1 M𝐸1 M𝑆1 M𝑇1

M𝑁2 M𝐸2 M𝑆2 M𝑇2

M𝑁3 M𝐸3 M𝑆3 M𝑇3

GO

h𝐷1,1 𝑐1,1

h𝐷1,2 𝑐1,2

h𝐷1,3 𝑐1,3

h𝐷1,4 STOP

h𝐷2,1 𝑐2,1

h𝐷2,2 𝑐2,2

h𝐷2,3

STOP h𝐷3,1

π‘Ÿ1,1

h𝐷3,2 π‘Ÿ1,2

h𝐷3,3 π‘Ÿ1,3

h𝐷3,4

STOP Figure 7.2: Sequence-to-graph network unrolled for a small example. An input sequence of three tokens is encoded and the memory is updated at each step. Conditioned on the final memory state, the graph decoder then generates two concept labels and one relation label token by token..

the current memory state to resolve coreferences, create a new node or edge in the graph if necessary and forget parts of the graph if the memory reaches its capacity.

Addressing Mechanism To decide which cell of the memory to access at a specific time-step, the model computes 𝛼 = π‘Žπ‘‘π‘‘(M,h), an attention vector of values 𝛼[𝑖] ∈ [0, 1] per cell summing to 1, i.e. a probability distribution over the memory cells. Following previous work on NTMs (Graves et al., 2014, Gulcehre et al., 2018), we use the idea of content-based addressing for this step.73 The attention vector is computed as the cosine similarity𝐢(β‹… ; β‹…) between a transformed encoder state and every cell followed by a softmax.

k=W𝑐 h+b𝑐 (7.4)

s[𝑖] = 𝐢(M[𝑖] ; k) (7.5)

𝛼𝑐 =softmax(𝛽𝑐s) (7.6)

ParametersW𝑐 ∈ β„π‘šΓ—β„Ž,b𝑐 ∈ β„π‘šand𝛽𝑐 ∈ ℝare learned,β„Žis the encoder state size.

Note that initializing the memory with zeros at 𝑑 = 0 would result in a uniform ad-dressing distribution over all cells at𝑑 = 1. To make sure the model clearly chooses a cell even at this point, we initialize a memory matrixM at𝑑 = 0to a variable M0 which is of the same size and a trained variable. In other words, a desirable initial state of the memory is learned by the model during training.

73We also experimented with additional location-based addressing, but did not observe substantial differences in the experimental setup described later in this section.

7.4. Memory-based Graph Manipulation Models

Nodes At timestep𝑑, we use the previously described addressing mechanism to obtain the attention vector𝛼𝑁𝑑 = π‘Žπ‘‘π‘‘π‘(Mπ‘π‘‘βˆ’1,h𝑑)and then update the𝑖-th memory cell following the memory update rule proposed by Graves et al. (2014):

M𝑁𝑑[𝑖] = (1 βˆ’ei𝛼𝑁𝑑[𝑖]) βŠ™Mπ‘π‘‘βˆ’1[𝑖] +a[𝑖]𝛼𝑁𝑑[𝑖] (7.7) The twoπ‘š-dimensional gate vectors are given by

e𝑖 = 𝜎(W𝑒[Mπ‘π‘‘βˆ’1[𝑖] ∣∣h𝑑] +b𝑒) (7.8) a𝑖 =tanh(Wπ‘Ž[Mπ‘π‘‘βˆ’1[𝑖] ∣∣h𝑑] +bπ‘Ž) (7.9) with parametersW𝑒,Wπ‘Ž ∈ β„π‘šΓ—(π‘š+β„Ž)and corresponding bias vectors. While the first gate e𝑖 controls what should be erased from the memory, the gatea𝑖 defines the newly added information. This is very similar to updating the hidden state of a GRU.

Inspired by Gulcehre et al. (2018), we also add a NOP (No Operation) cell to each memory matrix. If the model does not want to update any cell at some point, it can simply attend to this additional one which is ignored during the later decoding step.

Edges Similar to nodes, the model also computes an attention𝛼𝐸𝑑 = π‘Žπ‘‘π‘‘πΈ(MπΈπ‘‘βˆ’1,h𝑑)and computesM𝐸𝑑 using the same gated update mechanism. Own sets of parameters are used for node and edge updates. For the edge’s source and target nodes represented byM𝑆𝑑 and M𝑇𝑑, the model uses the attention mechanism common in sequence transduction models (Luong et al., 2015) over the previous timesteps𝑑′ ∈ [1, 𝑑 βˆ’ 1]to compute

𝑠𝑑′ =h𝑑W𝑆 h𝑑′ (7.10)

𝛾 =softmax(s1βˆΆπ‘‘βˆ’1) (7.11)

with parametersW𝑆 ∈ β„β„ŽΓ—β„Ž and then uses the attention weights to compute a weighted average of previous node memory attention vectors

𝛼𝑆𝑑 = βˆ‘

𝑑′ 𝛼𝑁𝑑′𝛾𝑑′ (7.12)

Correspondingly, a timestep-weighted average𝛼𝑇𝑑 of node memory attentions is computed over the following timesteps 𝑑′ ∈ [𝑑 + 1, β„“]. The motivation for this is that whenever a relation mention occurs, the two concepts it refers to are usually mentioned right before and after it. We identify these concepts by looking at the previous and following node memory attentions. With the attention mechanism, the model can learn how far to look backwards and forward to find the relevant concepts.

Chapter 7. End-to-End Modeling Approaches

The resulting distribution vectors are then stored for the currently modified edge as determined by the addressing vector𝛼𝐸𝑑 over the edge memory:

M𝑆𝑑 = (1 βˆ’ 𝛼𝐸𝑑 )Mπ‘†π‘‘βˆ’1+ 𝛼𝐸𝑑 𝛼𝑆𝑑 (7.13) M𝑇𝑑 = (1 βˆ’ 𝛼𝐸𝑑 )Mπ‘‡π‘‘βˆ’1+ 𝛼𝐸𝑑 𝛼𝑇𝑑 (7.14) 7.4.2.3 Graph Decoder

Once the complete input sequence has been encoded, we use the final memory state to decode a graph. For each node𝑖, we decode a label𝑐𝑖 by conditioning a GRU on the 𝑖-th node memory cell to model the label probability as:

𝑝(𝑐𝑖 ∣x) = ∏

𝑗𝑝(𝑐𝑖,𝑗 ∣ 𝑐𝑖,0βˆΆπ‘—βˆ’1,M𝑁ℓ[𝑖]) (7.15) 𝑝(𝑐𝑖,𝑗 ∣ 𝑐𝑖,0βˆΆπ‘—βˆ’1,M𝑁ℓ[𝑖]) =softmax(E𝐷h𝐷𝑖,𝑗) (7.16)

h𝐷𝑖,𝑗 = πΊπ‘…π‘ˆπ·(h𝐷𝑖,π‘—βˆ’1, [E𝐷[𝑐

𝑖,π‘—βˆ’1] ∣∣M𝑁ℓ[𝑖]]) (7.17)

Here, 𝑐𝑖,𝑗 denotes the 𝑗-th token of the label 𝑐𝑖, 𝑐𝑖,0 is a special start symbol and E𝐷 a learned embedding matrix of the output vocabulary. We decode for each node until the stop symbol is produced and use the same GRU across cells.

Analogously, we also decode labels π‘Ÿπ‘– from the edge memory cells, reusing the same GRU decoder as for the nodes to find the most likely sequence according to:

𝑝(π‘Ÿπ‘– ∣x) = ∏

𝑗𝑝(π‘Ÿπ‘–,𝑗 ∣ π‘Ÿπ‘–,0βˆΆπ‘—βˆ’1,M𝐸ℓ[𝑖]) (7.18) Note that the decoding process until this point is similar to a sequence transduction model with the only difference that instead of generating just a single sequence, we apply the RNN repeatedly to all node and edge memory cells.

To determine the source node𝑠(π‘Ÿπ‘–)and target𝑑(π‘Ÿπ‘–)for a decoded edge π‘Ÿπ‘–, we model them as a probability distributions over all nodes as follows:

𝑝(𝑠(π‘Ÿπ‘–) ∣x) =M𝑆ℓ[𝑖] (7.19) 𝑝(𝑑(π‘Ÿπ‘–) ∣x) =M𝑇ℓ[𝑖] (7.20) Note that due to their computation, the cells of M𝑆 and M𝑇 are already valid probability distributions and can be used as present in the memory.

7.4.2.4 Training

We train the model with pairs(x, πΊβˆ—)of input sequences and target graphs. For the CM-MDS task, that meansπΊβˆ— is the summary concept map whilex, similar as in the previous

7.4. Memory-based Graph Manipulation Models

chapter, is a sequence produced by pre-summarizing the multi-document input to a length that can be processed by the neural model.

For each type of prediction the model makes, i.e. node (7.15), edge (7.18), source (7.19) and target (7.20) predictions, we compute the cross-entropy loss against the gold dataπΊβˆ—. For labels, we normalize over the length of the sequence. These partial losses are aver-aged over nodes and edges and then combined into the overall training loss for an instance (𝐺, πΊβˆ—)with respect to the set of all parametersπœƒ:

β„’(𝐺, πΊβˆ—, πœƒ) = 13ℒ𝑁 +13ℒ𝐸+16ℒ𝑆+ 16ℒ𝑇 (7.21) The loss can be minimized using stochastic gradient descent or methods such as Adam (Kingma and Ba, 2015). Note that although we only train the model to make several inde-pendent local predictions, the shared encoding network and the joint training still enable the model to learn dependencies between these parts. Similar training paradigms were shown to be effective in other structured prediction tasks (Goldberg, 2017).

An important question when defining the loss is how nodes and edges, which are sets with no particular order, should be handled. We sort them according to their first mention and assign them to ascending memory cells. This ordering requires the model to also use the cells in ascending order during creation, a consistent strategy that is easy to learn (Vinyals et al., 2015a). If a target graph has less nodes or edges than the memory capacity, we train the decoder to immediately generate the stop-symbol from the unused cells.

We train the model with minibatches of similarly long sequences and shuffle their order between epochs. The label decoder is trained with teacher forcing, i.e. we provide the previous gold token at every decoding timestep. For regularization, we add dropout after embedding the input, the sequence encoder and the decoder input.

7.4.2.5 Inference

At inference time, we use the predictions of our model to find the best graph 𝐺 given x as given in Equation 7.2. We define a graph’s score as the sum of the probabilities of its components:

π‘ π‘π‘œπ‘Ÿπ‘’(𝐺) =

|𝐢|

βˆ‘

𝑖=1

𝑝(𝑐𝑖 ∣x) +

|𝑅|

βˆ‘

𝑖=1

(𝑝(π‘Ÿπ‘– ∣x) + 𝑝(𝑠(π‘Ÿπ‘–) ∣x) + 𝑝(𝑑(π‘Ÿπ‘–) ∣x)) (7.22) Although this function decomposes nicely along the different predictions, the maximization is non-trivial as it has to be done under the connectedness constraint of CM-MDS.

We propose an ILP to find the best graph given the predictions. Let π‘₯𝑠,𝑖𝑗, π‘₯𝑑,𝑖𝑗, 𝑖 ∈ [1, 𝑒], 𝑗 ∈ [1, 𝑛]be binary decision variables for the source and target assignment of each edge. We want to find values for these variables that maximize Equation 7.22 under several

Chapter 7. End-to-End Modeling Approaches

constraints. Every edge can only have one source and one target node

βˆ‘

𝑗

π‘₯𝑠,𝑖𝑗 = 1 βˆ€ 𝑖 ∈ [1, 𝑒] (7.23)

βˆ‘

𝑗

π‘₯𝑑,𝑖𝑗 = 1 βˆ€ 𝑖 ∈ [1, 𝑒] (7.24) and the same node cannot be selected as the source and target of one edge.

π‘₯𝑠,𝑖𝑗+ π‘₯𝑑,𝑖𝑗= 1 βˆ€ 𝑖 ∈ [1, 𝑒], 𝑗 ∈ [1, 𝑛] (7.25) Additional constraints define that the resulting graph should be connected, which can be expressed using flow variables as shown in Section 6.4.1.

Further, we also add binary decision variablesπ‘₯𝑐,𝑖𝑗, 𝑖 ∈ [1, 𝑛], 𝑗 ∈ [1, π‘˜]for theπ‘˜-best concept labels, of which at most one can be selected per node. They serve two purposes:

First, it allows the ILP to select between the possible labels for a node. In cases where the most probable labels for two nodes are the same, i.e. they would be the same concept in the resulting graph, it can choose a second-best label for one of them to obtain a graph with two distinct concepts and thus a potentially higher score. And second, since the connectedness constraint might lead to a best graph that does not contain all nodes, these variables are needed to reflect that in the scoring function. All edges, on the other hand, will always be part of an optimal solution and therefore do not have to be modeled explicitly in the ILP.

7.4.2.6 Subtasks in End-to-End Approach

Having described all aspects of the proposed neural model, we want to briefly summa-rize how it addresses the different subtasks of CM-MDS. Naturally, due to the approach of modeling the task end-to-end, no explicit components for different subtasks exist. Never-theless, the design of the architecture was done with the task in mind and different features are meant to enable the model to handle specific challenges of CM-MDS.

Concept Mention Extraction Detecting concept mentions in the input sequence is mainly the task of the RNN encoder as well as the addressing and update mechanism of the memory. If a processed input token is found to be part of a concept mention, the node memory can be updated accordingly, whereas for other tokens, using the NOP cell or corresponding gate outputs can keep it unchanged.

Concept Mention Grouping Using the content-based addressing mechanism, the model compares the RNN representation of the current input token with each cell of the memory, allowing it to determine when concepts are mentioned repeatedly. It can thus learn to update the corresponding cells rather than to use additional ones, effec-tively resolving coreferences using the memory.

7.4. Memory-based Graph Manipulation Models

Relation Mention Extraction and Grouping The extraction and grouping of relation men-tions is handled similar to concepts, using the corresponding memory matrix. By updating the source and target node matrices, a relation is associated to its concepts.

Concept and Relation Labeling Since the model is generative, labels are created during decoding and can be any sequence of tokens from the vocabulary. Thus, the model is able to generate abstractive concept maps.

Importance Estimation No explicit importance estimation is done. Instead, due to the lim-ited capacity, the model always keeps just a summary-sized subset of the input in its memory. This can be seen as an alternative approach to summarization where the main task is to make local decisions of what to keep in the memory, add to it or what to override when reaching capacity. As we mentioned earlier, this is inspired by Kintsch and van Dijk (1978)’s cognitive model and summarization systems designed accordingly (Fang and Teufel, 2016, Fang et al., 2016). In addition, some part of se-lecting the summary-worthy subset of the multi-document input is also handled by the pre-summarizer in our setup.

Concept Map Construction The construction of the summary concept map is handled by the model’s decoder and the subsequent ILP which makes sure a connected graph is created. Due to the limited memory size, the selection of concepts and relations already happens during encoding and the size constraint is always satisfied.

While these considerations are the motivation behind the different parts of the architecture and provide some intuition as to why the model should be able to handle the task in its full scope, there is of course no guarantee that the model, when being trained end-to-end, is actually able to learn the task nor that the different subtasks are handled as intended.