• Keine Ergebnisse gefunden

2.3 NLP Methods Supporting Document Exploration

2.3.2 Text Summarization

2.3. NLP Methods Supporting Document Exploration

calculate concept scores with the Hub Authority Root Distance (HARD) model (Leake et al., 2004), choosing concepts that are central and highly connected. With regard to relation se-lection, Qasim et al. (2013) choose among multiple relations for the same pair of concepts with a VF-ICF metric, preferring verbs that often occur (verb frequency) but only co-occur with a small number of concepts (“inverse co-occurrence frequency”).

All the scoring strategies described above aim at determining the importance or rele-vance of a concept (or relation) to select a subset that is representative for the input. The map construction becomes more difficult if one also tries to optimize for other objectives, such as producing a well-connected map. Simply selecting the most important concepts can yield many unconnected ones, as there might not be any relations between them. Zubrinic et al. (2015) try to avoid this: They pre-select a subset of the 100 most important concepts according to their CF-IDF metric, build a graph from them and then iteratively remove nodes with the lowest degree until reaching a target size of 25 to 30 concepts. Choosing by node degree, their approach tries to keep the concept map as connected as possible.

Chapter 2. Background

In the remainder of this section, we present the main computational approaches to the summarization problem and point to seminal or exemplary papers for each direction. For a comprehensive review of all related work, we refer the reader to the surveys by Nenkova and McKeown (2011), Yao et al. (2017) and Gambhir and Gupta (2017).

2.3.2.1 Extractive Summarization

Extractive summarization systemsproduce summaries reusing parts — mostly complete sen-tences — taken from the input documents without modifications. More formally, let𝐷be a set of documents,𝒮(𝐷)the set of all sentences in𝐷andℒthe maximal length of the desired summary. The task is then to select a subset of sentences𝑆 ⊂ 𝒮(𝐷)with∑

𝑠∈𝑆𝑙𝑒𝑛(𝑠) ≤ ℒ, where𝑙𝑒𝑛(𝑠)is the length of𝑠in words. Two subtasks,importance estimationandsentence selection, are usually modeled to create extractive summaries.

Importance Estimation In order to include the most important information in a summary, the importance𝑖(𝑠)of each sentence𝑠 ∈ 𝒮(𝐷)needs to be estimated. Luhn (1958), the very first work on automatic summarization, used word frequencies to derive importance esti-mates for sentences. Almost 60 years later, summarization systems using frequency as the only indicator for importance still yield competitive results (Boudin et al., 2015). Edmund-son (1969) added the position of a sentence in the document and the presence of predefined cue words as additional indicators. Among many other metrics explored in later work, im-portance estimates derived from graph structures with the PageRank algorithm (Page et al., 1999) had a particularly large impact. Both TextRank (Mihalcea and Tarau, 2004), which uses a graph representing co-occurring words, and LexRank (Erkan and Radev, 2004), using a graph of sentence similarities, have been regularly used as benchmarks. A commonality of all these approaches is that they use a hand-designed indicator (or several ones) to derive importance estimates, which makes these approachesunsupervised summarizationmodels.

Given the large number of suggested metrics that indicate importance,supervised sum-marization systems that use annotated data to learn how to combine different indicators to make the best estimate have been explored as well. Early work in this direction was by Kupiec et al. (1995), who combine several features in a Bayesian binary classifier trained to decide if a sentence should be in a summary or not. Later work modeled the problem with probabilistic models such as hidden Markov models (Conroy and O’Leary, 2001) or logistic regression (Hong and Nenkova, 2014) and with support vector machines in classification (Yang et al., 2017) and regression (Li et al., 2007) setups. Typical features include term and document frequencies, sentence lengths, sentence positions, unigrams, bigrams, parts-of-speech, named entities, capitalization and stopwords (Berg-Kirkpatrick et al., 2011, Hong and Nenkova, 2014, Li et al., 2016a, Yang et al., 2017).

Recently,neural supervised models for importance estimation have been proposed by several authors. Cheng and Lapata (2016) use a combination of convolutional neural

net-2.3. NLP Methods Supporting Document Exploration

works (CNNs), recurrent neural networks (RNNs) and attention to classify sentences for SDS. Cao et al. propose a regression model based on recursive neural networks for MDS (Cao et al., 2015) and a CNN-based model with attention and a ranking loss for query-focused summarization (Cao et al., 2016). A two-layer RNN with a set of hand-crafted features is developed by Nallapati et al. (2017). Al-Sabahi et al. (2018) propose a similar hi-erarchical encoder in combination with an attention mechanism. Compared to traditional supervised models, all of these approaches seem to benefit from the powerful distributed representations that neural networks can learn (Goldberg, 2017). A common trend is the use of an attention mechanism. Apart from that, a broad range of neural architectures has been proposed and none of them has so far been identified as being consistently superior.

Sentence Selection Once importance estimates for all sentences are available, the remain-ing task is to select the subset 𝑆 ⊂ 𝒮(𝐷) that makes the best summary. This is usually formulated as an optimization problem maximizing the importance within the size limit:

𝑆 = arg max

𝑆⊂𝒮(𝐷)

𝑠∈𝑆

𝑖(𝑠) s.t. ∑

𝑠∈𝑆

𝑙𝑒𝑛(𝑠) ≤ ℒ

In other words, one tries to include as many important sentences as possible while not exceeding the size limit. This optimization is difficult, as one has to decide whether it is better to add an important and long sentence to the summary or instead a less important but also shorter sentence, leaving more space for additional sentences. To make the best decision, one has to consider the full search space of all subsequent decisions, i.e. optimize globally. The optimization problem is known as the 0-1knapsack problem and is NP-hard (McDonald, 2007). In the case of MDS, an additional challenge is that sentences from dif-ferent documents might contain the same information. Thus, only one of them should be in the summary — although all of them are estimated to be equally important. This is typically handled by adding a redundancy penalty to the objective function, leading to an optimiza-tion problem that is also NP-hard (McDonald, 2007).11

Carbonell and Goldstein (1998) proposed a greedy optimization approach called max-imal marginal relevance (MMR). Sentences are added iteratively until the length limit is reached, choosing them based on their importance and redundancy with what is already in the summary. That does not necessary yield the optimal subset, but was shown to work well in practice. Other approaches, such as Hatzivassiloglou et al. (2001), rely on sentence clustering to first group redundant sentences together and then use only one sentence per cluster in the summary. Lin and Bilmes (2011) point out that the objective functions dis-cussed here are submodular. For submodular objective functions, greedy optimization

al-11For the easier version without the redundancy term, there is a pseudo-polynomial algorithm (Kellerer et al., 2010). However, it cannot solve the extended MDS problem including redundancy (McDonald, 2007).

Chapter 2. Background

gorithms with provable lower bounds exist that guarantee that a greedy solution is at most a constant factor worse than the optimal solution.

Exact solutions can be found by formulating the problem as an integer linear program (ILP), for which a broad range of off-the-shelf solver software exists. McDonald (2007) pi-oneered this approach, but also showed that it is much more computationally expensive than the greedy alternatives. Gillick and Favre (2009) proposed a new objective function that computes importance and redundancy in terms of included concepts rather than sen-tences. This has the advantage that the importance and redundancy terms simplify to a single term and yields ILPs that are more efficient to solve.

2.3.2.2 Abstractive Summarization

Extractive summarization methods have several problems. Using just the existing sen-tences, they might need to include unimportant details in a summary if something more important only occurs in a sentence together with these details. Moreover, extractive sum-maries can lack fluency and clarity, as the selected sentences might contain unresolvable pronouns or miss important context. Ordering the sentences in the most coherent way is a difficult problem on its own (Nenkova and McKeown, 2011). Abstractive summarization methods try to circumvent these problems by going beyond the set of existing sentences.

Sentence Modification Most of the early work has focused on compressing single or fus-ing multiple of the original sentences. By droppfus-ing unimportant parts from the sentences, the length budget of the summary can be used more efficiently. Both rule-based (Jing, 2000, Zajic et al., 2007) and learned (Knight and Marcu, 2002, Clarke and Lapata, 2007) models were proposed to compress sentences. Sentence fusion techniques (Barzilay and McKeown, 2005, Filippova and Strube, 2008) have also been explored since compressing only can lead to having unnaturally many short sentences in a summary. Rather than using these tech-niques as preprocessing for extractive models, joint models for selection and compression have also been proposed (Berg-Kirkpatrick et al., 2011, Chali et al., 2017).

Traditional Generation A summarization paradigm differing more radically from extrac-tion is the generaextrac-tion of completely new sentences. Such models typically first parse the input documents into a symbolic meaning representation, then summarize that representa-tion and finally generate a realizarepresenta-tion of the summary from it. While this approach gives a system more freedom to produce a good summary, a crucial point is that the intermediate representation offers enough representational capacity as well as good enough parsing and generation models. An early attempt in this direction was the system of Vanderwende et al.

(2004) in DUC 2004. Li (Li, 2015, Li et al., 2016a) proposes an entity-based graph represen-tation well-suited for news documents from which they successfully generate summaries.

2.3. NLP Methods Supporting Document Exploration

Liu et al. (2015) used abstract meaning representation (AMR) as their intermediate repre-sentation, but left the generation step for future work. A proposition-based representation was shown to work well for educational texts, covering the full pipeline of parsing, sum-marization and generation (Fang and Teufel, 2016, Fang et al., 2016).

Neural Generation In recent years, the use of neural network models and large-scale training data led to improved performance in various NLP tasks, including text genera-tion tasks such as language modeling (Mikolov, 2012) or machine translagenera-tion (Cho et al., 2014, Sutskever et al., 2014). The predominant approach of generating text with word-level RNNs has been first applied to summarization by Rush et al. (2015). Their framework of using RNN encoder and decoder modules with attention was quickly adopted and refined (Nallapati et al., 2016, Chopra et al., 2016, Wang and Ling, 2016). These models are able to produce much more fluent summaries than previous generative models, and thereby sub-stantially renewed the interest in abstractive summarization. Important extensions to this architecture are copy mechanisms that allow a model to include unknown words from the input in the summaries (Gu et al., 2016, See et al., 2017) and strategies to avoid repetitions in the generated sequences (Suzuki and Nagata, 2017, See et al., 2017). The greatest lim-itation so far is that most work focuses on SDS from a few sentences to short headlines, as training models for bigger inputs and outputs requires huge amounts of computational resources. In addition, no large-scale training corpora are available for MDS. Very recently, strategies such as pre-summarizing documents with extractive methods (Tan et al., 2017, Liu et al., 2018) or hierarchical encoders (Cohan et al., 2018, Celikyilmaz et al., 2018, Zhang et al., 2018) have been proposed to improve the scalability. These neural models are able to handle SDS examples with on average 5,000 input and 220 output words (Cohan et al., 2018) and MDS examples with 10k input and 100 output words (Liu et al., 2018).