Sequence-to-Graph Networks - Memory-based Graph Manipulation Models

7.4 Memory-based Graph Manipulation Models

7.4.2 Sequence-to-Graph Networks

As explained earlier, the key idea of our sequence-to-graph network is that it learns to map from an input sequence to a labeled graph via the structured memory that we just introduced. More formally, let x be the input sequence that should be summarized. We then formulate CM-MDS as the structured prediction problem of finding

𝐺 =arg max

𝐺^′∈𝒢(x)

𝑠𝑐𝑜𝑟𝑒(𝐺^′) (7.2)

where𝒢(x)denotes the (infinite) set of possible valid graphs that can be constructed from xand𝑠𝑐𝑜𝑟𝑒()is a function determining how well each of them summarizes the input.

To make that search tractable, we factor the scoring function into several probability distributions conditioned on the input that are learned by the sequence-to-graph network.

In the following section, we describe each of its components in detail. Figure 7.2 shows an unrolled version of the model, illustrating how the encoder, memory and decoder interact.

7.4.2.1 Sequence Encoder

The task of the sequence encoder is to compose a hidden representation of the input se-quence and learn to recognize mentions of concepts and relations in it. Each token𝑥₁, … , 𝑥_ℓ inxis represented by a word embedding. These representations are then recurrently pro-cessed by a GRU (Cho et al., 2014). The input representation at timestep𝑡is given as

h^𝐹_𝑡 = 𝐺𝑅𝑈^𝐹(h^𝐹_𝑡−1,E_[𝑥_𝑡_]) (7.3) whereEis the word embedding matrix. In addition to𝐺𝑅𝑈^𝐹, we use a second cell𝐺𝑅𝑈^𝐵 to process the sequence backwards and concatenate the hidden states toh_𝑡 =h^𝐹_𝑡 ∣∣h^𝐵_𝑡 . 7.4.2.2 Memory Module

Similar to the sequence encoder, the memory module is also a recurrent component. At each timestep𝑡, it receives the current encoder stateh_𝑡 as input and updates the memory state. If the token𝑥_𝑡 is part of a concept or relation mention, its task is to compare it with

Chapter 7. End-to-End Modeling Approaches

Embedding Sequence Encoder Memory Module Graph Decoder

E_𝑥₁ E_𝑥₂ E_𝑥₃ h₀ h₁ h₂ h₃ M^𝑁₀

M^𝐸₀ M^𝑆₀ M^𝑇₀

M^𝑁₁ M^𝐸₁ M^𝑆₁ M^𝑇₁

M^𝑁₂ M^𝐸₂ M^𝑆₂ M^𝑇₂

M^𝑁₃ M^𝐸₃ M^𝑆₃ M^𝑇₃

h^𝐷_1,1 𝑐_1,1

h^𝐷_1,2 𝑐_1,2

h^𝐷_1,3 𝑐_1,3

h^𝐷_1,4 STOP

h^𝐷_2,1 𝑐_2,1

h^𝐷_2,2 𝑐_2,2

h^𝐷_2,3

STOP h^𝐷_3,1

𝑟_1,1

h^𝐷_3,2 𝑟_1,2

h^𝐷_3,3 𝑟_1,3

h^𝐷_3,4

STOP Figure 7.2: Sequence-to-graph network unrolled for a small example. An input sequence of three tokens is encoded and the memory is updated at each step. Conditioned on the final memory state, the graph decoder then generates two concept labels and one relation label token by token..

the current memory state to resolve coreferences, create a new node or edge in the graph if necessary and forget parts of the graph if the memory reaches its capacity.

Addressing Mechanism To decide which cell of the memory to access at a specific time-step, the model computes 𝛼 = 𝑎𝑑𝑑(M,h), an attention vector of values 𝛼_[𝑖] ∈ [0, 1] per cell summing to 1, i.e. a probability distribution over the memory cells. Following previous work on NTMs (Graves et al., 2014, Gulcehre et al., 2018), we use the idea of content-based addressing for this step.⁷³ The attention vector is computed as the cosine similarity𝐶(⋅ ; ⋅) between a transformed encoder state and every cell followed by a softmax.

k=W_𝑐 h+b_𝑐 (7.4)

s_[𝑖] = 𝐶(M_[𝑖] ; k) (7.5)

𝛼^𝑐 =softmax(𝛽_𝑐s) (7.6)

ParametersW_𝑐 ∈ ℝ^𝑚×ℎ,b_𝑐 ∈ ℝ^𝑚and𝛽_𝑐 ∈ ℝare learned,ℎis the encoder state size.

Note that initializing the memory with zeros at 𝑡 = 0 would result in a uniform ad-dressing distribution over all cells at𝑡 = 1. To make sure the model clearly chooses a cell even at this point, we initialize a memory matrixM at𝑡 = 0to a variable M₀ which is of the same size and a trained variable. In other words, a desirable initial state of the memory is learned by the model during training.

73We also experimented with additional location-based addressing, but did not observe substantial differences in the experimental setup described later in this section.

7.4. Memory-based Graph Manipulation Models

Nodes At timestep𝑡, we use the previously described addressing mechanism to obtain the attention vector𝛼^𝑁_𝑡 = 𝑎𝑑𝑑^𝑁(M^𝑁_𝑡−1,h_𝑡)and then update the𝑖-th memory cell following the memory update rule proposed by Graves et al. (2014):

M^𝑁_𝑡[𝑖] = (1 −e_i𝛼^𝑁_𝑡[𝑖]) ⊙M^𝑁_{𝑡−1[𝑖]} +a_[𝑖]𝛼^𝑁_𝑡[𝑖] (7.7) The two𝑚-dimensional gate vectors are given by

e_𝑖 = 𝜎(W_𝑒[M^𝑁_{𝑡−1[𝑖]} ∣∣h_𝑡] +b_𝑒) (7.8) a_𝑖 =tanh(W_𝑎[M^𝑁_{𝑡−1[𝑖]} ∣∣h_𝑡] +b_𝑎) (7.9) with parametersW_𝑒,W_𝑎 ∈ ℝ^{𝑚×(𝑚+ℎ)}and corresponding bias vectors. While the first gate e_𝑖 controls what should be erased from the memory, the gatea_𝑖 defines the newly added information. This is very similar to updating the hidden state of a GRU.

Inspired by Gulcehre et al. (2018), we also add a NOP (No Operation) cell to each memory matrix. If the model does not want to update any cell at some point, it can simply attend to this additional one which is ignored during the later decoding step.

Edges Similar to nodes, the model also computes an attention𝛼^𝐸_𝑡 = 𝑎𝑑𝑑^𝐸(M^𝐸_𝑡−1,h_𝑡)and computesM^𝐸_𝑡 using the same gated update mechanism. Own sets of parameters are used for node and edge updates. For the edge’s source and target nodes represented byM^𝑆_𝑡 and M^𝑇_𝑡, the model uses the attention mechanism common in sequence transduction models (Luong et al., 2015) over the previous timesteps𝑡^′ ∈ [1, 𝑡 − 1]to compute

𝑠_𝑡′ =h_𝑡W_𝑆 h_𝑡^′ (7.10)

𝛾 =softmax(s_1∶𝑡−1) (7.11)

with parametersW_𝑆 ∈ ℝ^ℎ×ℎ and then uses the attention weights to compute a weighted average of previous node memory attention vectors

𝛼^𝑆_𝑡 = ∑

𝑡^′ 𝛼^𝑁_𝑡′𝛾_𝑡′ (7.12)

Correspondingly, a timestep-weighted average𝛼^𝑇_𝑡 of node memory attentions is computed over the following timesteps 𝑡^′ ∈ [𝑡 + 1, ℓ]. The motivation for this is that whenever a relation mention occurs, the two concepts it refers to are usually mentioned right before and after it. We identify these concepts by looking at the previous and following node memory attentions. With the attention mechanism, the model can learn how far to look backwards and forward to find the relevant concepts.

Chapter 7. End-to-End Modeling Approaches

The resulting distribution vectors are then stored for the currently modified edge as determined by the addressing vector𝛼^𝐸_𝑡 over the edge memory:

M^𝑆_𝑡 = (1 − 𝛼^𝐸_𝑡 )M^𝑆_𝑡−1+ 𝛼^𝐸_𝑡 𝛼^𝑆_𝑡 (7.13) M^𝑇_𝑡 = (1 − 𝛼^𝐸_𝑡 )M^𝑇_𝑡−1+ 𝛼^𝐸_𝑡 𝛼^𝑇_𝑡 (7.14) 7.4.2.3 Graph Decoder

Once the complete input sequence has been encoded, we use the final memory state to decode a graph. For each node𝑖, we decode a label𝑐_𝑖 by conditioning a GRU on the 𝑖-th node memory cell to model the label probability as:

𝑝(𝑐_𝑖 ∣x) = ∏

𝑗𝑝(𝑐_𝑖,𝑗 ∣ 𝑐_{𝑖,0∶𝑗−1},M^𝑁_ℓ[𝑖]) (7.15) 𝑝(𝑐_𝑖,𝑗 ∣ 𝑐_{𝑖,0∶𝑗−1},M^𝑁_ℓ[𝑖]) =softmax(E^𝐷h^𝐷_𝑖,𝑗) (7.16)

h^𝐷_𝑖,𝑗 = 𝐺𝑅𝑈^𝐷(h^𝐷_{𝑖,𝑗−1}, [E^𝐷_[𝑐

𝑖,𝑗−1] ∣∣M^𝑁_ℓ[𝑖]]) (7.17)

Here, 𝑐_𝑖,𝑗 denotes the 𝑗-th token of the label 𝑐_𝑖, 𝑐_𝑖,0 is a special start symbol and E^𝐷 a learned embedding matrix of the output vocabulary. We decode for each node until the stop symbol is produced and use the same GRU across cells.

Analogously, we also decode labels 𝑟_𝑖 from the edge memory cells, reusing the same GRU decoder as for the nodes to find the most likely sequence according to:

𝑝(𝑟_𝑖 ∣x) = ∏

𝑗𝑝(𝑟_𝑖,𝑗 ∣ 𝑟_{𝑖,0∶𝑗−1},M^𝐸_ℓ[𝑖]) (7.18) Note that the decoding process until this point is similar to a sequence transduction model with the only difference that instead of generating just a single sequence, we apply the RNN repeatedly to all node and edge memory cells.

To determine the source node𝑠(𝑟_𝑖)and target𝑡(𝑟_𝑖)for a decoded edge 𝑟_𝑖, we model them as a probability distributions over all nodes as follows:

𝑝(𝑠(𝑟_𝑖) ∣x) =M^𝑆_ℓ[𝑖] (7.19) 𝑝(𝑡(𝑟_𝑖) ∣x) =M^𝑇_ℓ[𝑖] (7.20) Note that due to their computation, the cells of M^𝑆 and M^𝑇 are already valid probability distributions and can be used as present in the memory.

7.4.2.4 Training

We train the model with pairs(x, 𝐺^∗)of input sequences and target graphs. For the CM-MDS task, that means𝐺^∗ is the summary concept map whilex, similar as in the previous

7.4. Memory-based Graph Manipulation Models

chapter, is a sequence produced by pre-summarizing the multi-document input to a length that can be processed by the neural model.

For each type of prediction the model makes, i.e. node (7.15), edge (7.18), source (7.19) and target (7.20) predictions, we compute the cross-entropy loss against the gold data𝐺^∗. For labels, we normalize over the length of the sequence. These partial losses are aver-aged over nodes and edges and then combined into the overall training loss for an instance (𝐺, 𝐺^∗)with respect to the set of all parameters𝜃:

ℒ(𝐺, 𝐺^∗, 𝜃) = ¹₃ℒ_𝑁 +¹₃ℒ_𝐸+¹₆ℒ_𝑆+ ¹₆ℒ_𝑇 (7.21) The loss can be minimized using stochastic gradient descent or methods such as Adam (Kingma and Ba, 2015). Note that although we only train the model to make several inde-pendent local predictions, the shared encoding network and the joint training still enable the model to learn dependencies between these parts. Similar training paradigms were shown to be effective in other structured prediction tasks (Goldberg, 2017).

An important question when defining the loss is how nodes and edges, which are sets with no particular order, should be handled. We sort them according to their first mention and assign them to ascending memory cells. This ordering requires the model to also use the cells in ascending order during creation, a consistent strategy that is easy to learn (Vinyals et al., 2015a). If a target graph has less nodes or edges than the memory capacity, we train the decoder to immediately generate the stop-symbol from the unused cells.

We train the model with minibatches of similarly long sequences and shuffle their order between epochs. The label decoder is trained with teacher forcing, i.e. we provide the previous gold token at every decoding timestep. For regularization, we add dropout after embedding the input, the sequence encoder and the decoder input.

7.4.2.5 Inference

At inference time, we use the predictions of our model to find the best graph 𝐺 given x as given in Equation 7.2. We define a graph’s score as the sum of the probabilities of its components:

𝑠𝑐𝑜𝑟𝑒(𝐺) =

|𝐶|

∑

𝑖=1

𝑝(𝑐_𝑖 ∣x) +

|𝑅|

∑

𝑖=1

(𝑝(𝑟_𝑖 ∣x) + 𝑝(𝑠(𝑟_𝑖) ∣x) + 𝑝(𝑡(𝑟_𝑖) ∣x)) (7.22) Although this function decomposes nicely along the different predictions, the maximization is non-trivial as it has to be done under the connectedness constraint of CM-MDS.

We propose an ILP to find the best graph given the predictions. Let 𝑥_{𝑠,𝑖𝑗}, 𝑥_{𝑡,𝑖𝑗}, 𝑖 ∈ [1, 𝑒], 𝑗 ∈ [1, 𝑛]be binary decision variables for the source and target assignment of each edge. We want to find values for these variables that maximize Equation 7.22 under several

Chapter 7. End-to-End Modeling Approaches

constraints. Every edge can only have one source and one target node

∑

𝑗

𝑥_{𝑠,𝑖𝑗} = 1 ∀ 𝑖 ∈ [1, 𝑒] (7.23)

∑

𝑗

𝑥_{𝑡,𝑖𝑗} = 1 ∀ 𝑖 ∈ [1, 𝑒] (7.24) and the same node cannot be selected as the source and target of one edge.

𝑥_{𝑠,𝑖𝑗}+ 𝑥_{𝑡,𝑖𝑗}= 1 ∀ 𝑖 ∈ [1, 𝑒], 𝑗 ∈ [1, 𝑛] (7.25) Additional constraints define that the resulting graph should be connected, which can be expressed using flow variables as shown in Section 6.4.1.

Further, we also add binary decision variables𝑥_{𝑐,𝑖𝑗}, 𝑖 ∈ [1, 𝑛], 𝑗 ∈ [1, 𝑘]for the𝑘-best concept labels, of which at most one can be selected per node. They serve two purposes:

First, it allows the ILP to select between the possible labels for a node. In cases where the most probable labels for two nodes are the same, i.e. they would be the same concept in the resulting graph, it can choose a second-best label for one of them to obtain a graph with two distinct concepts and thus a potentially higher score. And second, since the connectedness constraint might lead to a best graph that does not contain all nodes, these variables are needed to reflect that in the scoring function. All edges, on the other hand, will always be part of an optimal solution and therefore do not have to be modeled explicitly in the ILP.

7.4.2.6 Subtasks in End-to-End Approach

Having described all aspects of the proposed neural model, we want to briefly summa-rize how it addresses the different subtasks of CM-MDS. Naturally, due to the approach of modeling the task end-to-end, no explicit components for different subtasks exist. Never-theless, the design of the architecture was done with the task in mind and different features are meant to enable the model to handle specific challenges of CM-MDS.

Concept Mention Extraction Detecting concept mentions in the input sequence is mainly the task of the RNN encoder as well as the addressing and update mechanism of the memory. If a processed input token is found to be part of a concept mention, the node memory can be updated accordingly, whereas for other tokens, using the NOP cell or corresponding gate outputs can keep it unchanged.

Concept Mention Grouping Using the content-based addressing mechanism, the model compares the RNN representation of the current input token with each cell of the memory, allowing it to determine when concepts are mentioned repeatedly. It can thus learn to update the corresponding cells rather than to use additional ones, effec-tively resolving coreferences using the memory.

7.4. Memory-based Graph Manipulation Models

Relation Mention Extraction and Grouping The extraction and grouping of relation men-tions is handled similar to concepts, using the corresponding memory matrix. By updating the source and target node matrices, a relation is associated to its concepts.

Concept and Relation Labeling Since the model is generative, labels are created during decoding and can be any sequence of tokens from the vocabulary. Thus, the model is able to generate abstractive concept maps.

Importance Estimation No explicit importance estimation is done. Instead, due to the lim-ited capacity, the model always keeps just a summary-sized subset of the input in its memory. This can be seen as an alternative approach to summarization where the main task is to make local decisions of what to keep in the memory, add to it or what to override when reaching capacity. As we mentioned earlier, this is inspired by Kintsch and van Dijk (1978)’s cognitive model and summarization systems designed accordingly (Fang and Teufel, 2016, Fang et al., 2016). In addition, some part of se-lecting the summary-worthy subset of the multi-document input is also handled by the pre-summarizer in our setup.

Concept Map Construction The construction of the summary concept map is handled by the model’s decoder and the subsequent ILP which makes sure a connected graph is created. Due to the limited memory size, the selection of concepts and relations already happens during encoding and the size constraint is always satisfied.

While these considerations are the motivation behind the different parts of the architecture and provide some intuition as to why the model should be able to handle the task in its full scope, there is of course no guarantee that the model, when being trained end-to-end, is actually able to learn the task nor that the different subtasks are handled as intended.

Im Dokument Automatic Structured Text Summarization with Concept Maps (Seite 165-171)