• Keine Ergebnisse gefunden

6.3 Importance Estimation

6.3.2 Features

For importance scoring, we explore a broad range of features that are commonly used for summarization. We describe the groups of features and their motivation in the following:

6.3.2.1 Common Summarization Features

Frequency Frequencies, starting with the work of Luhn (1958), are the most popular fea-tures to determine important elements. To estimate the importance of concepts, we evaluate the following frequency features:

• absolute frequency of a concept, i.e. the number of its mentions|𝐶|

• relative frequency of a concept with regard to the total document set

• the fraction of documents containing a concept

In addition, we use inverse document frequencies as a measure of how distinctive terms are as done by TF-IDF. However, instead of only relying on the documents given for a topic, we use term counts computed on a much larger background corpus58which were found to correlate well with document frequencies (Klein and Nelson, 2009). Using them, we obtain the following additional features:

• minimum, maximum and average of the log-scaled inverse document frequency of the terms in a concept’s label

• CF-IDF, i.e. the relative concept frequency as defined above multiplied with the either the min, max or average variant of inverse document frequency defined before Position The second most popular feature for summarization, introduced by Edmund-son (1969), is the position of an element in the source documents. We compute the rela-tive position of a concept mention𝑚in a document𝑑independent of document length as 𝑟𝑒𝑙_𝑝𝑜𝑠(𝑚) = 1 − 𝑝𝑜𝑠(𝑚)/|𝑑|such that a mention at the beginning is close to 1 and at the end close to 0. Based on that, our position features are:

• minimum, maximum and average relative positions among all mentions

• relative position of the mention used as the label

• position spread, i.e. the difference between the first and the last position

58We use the Google Web 1T 5-gram corpus, available through LDC at https://catalog.ldc.upenn.edu/

products/LDC2006T13, which provides n-gram counts computed on a corpus of one trillion web pages.

Chapter 6. Pipeline-based Approaches

Length The length of a concept can also be indicative of a concept’s importance and has been used for instance by Li et al. (2016a). In addition, length can also be an indicator of concept extraction quality in the case of overly long mentions or labels, which we would not want to be part of the summary. Similar to the position features, we use the minimum, maximum and average length among all mentions, the length of the selected label and the spread between minimum and maximum, all computed on a token and character level.

Part-of-Speech Similar to length, the presence of different part-of-speech categories can be indicative of importance and quality. In MDS, part-of-speech features are used regularly, for instance by Li (2015) and Hong and Nenkova (2014). We use two binary features for each part-of-speech category in the PennTreebank tagset that indicate whether a token of that category occurs in the concept’s label and whether the token is the label’s head (i.e. the root node in a dependency parse of the label).

Annotations Similar to previous work on summarization, such as Berg-Kirkpatrick et al.

(2011), Li et al. (2013), Hong and Nenkova (2014) and Li et al. (2016a), we include additional features based on automatic linguistic annotations of the concept label:

• whether named entities of type person, organization or location are present in the concept’s label or in its head token

• the depth of the concept label’s head token in the dependency parse of the sentence it was extracted from in the source documents

• whether all or some of the characters in the concept label are uppercase

• the absolute and relative number of stopwords in the concept label

Topic Relatedness As some concepts can be prominent in a document cluster but un-related to its topic, we introduce additional features capturing the un-relatedness between a concept and the topic. Note that all corpora created for CM-MDS in Chapter 4 have short descriptions of the topic of each document set, such as “alternative ADHD treatments” or

“student loans without credit history”.59 We compute three features that measure the se-mantic similarity between a concept’s label and the topic description, using the Word2Vec, WordNet Rus-Resnik and Jaccard measures introduced in Section 6.2.1.

6.3.2.2 Extended Features

In addition to the set of features described before, which have been commonly used in many summarization systems, we also explore a few less common features as well as features specifically designed for our concept map task.

59Despite having a topic description, our corpora do not constitute a classic query-focused MDS scenario, since there are no documents in the document sets that are not related to the topic.

6.3. Importance Estimation

Graph TextRank (Mihalcea and Tarau, 2004) and LexRank (Erkan and Radev, 2004), which are popular MDS approaches, build a graph representing the relations of sentences or words to derive an importance metric from that structure. In our case, we already have a graph available, formed by the grouped concepts𝐶and relations𝑅between them. Therefore, we can easily use its structure to derive additional features for concepts such as how centrally a concept is positioned in the graph. Interestingly, the abstractive MDS systems that build graphs as intermediate representations (Liu et al., 2015, Li, 2015, Li et al., 2016a) use only little or none of such features.

Given the graph𝐺 = (𝐶, 𝑅), we compute the following features for a concept𝑐 ∈ 𝐶:

• in and out degree of the node𝑐

• fraction of nodes to which𝑐is connected

• PageRank (Page et al., 1999), the centrality measure used in TextRank and LexRank

• betweenness, closeness, eigenvector and katz centrality (Koschützki et al., 2005), al-ternative standard graph metrics to characterize the centrality of a node in a graph As we mentioned in Section 2.3.1.5, Reichherzer and Leake (2006) evaluated models specif-ically designed for the structural analysis of concept maps. The first, HARD, is a linear combination of upper and lower node scores proposed by Cañas and Leake (2001), charac-terizing a concept’s distance to the main concept of a map, and hub and authority scores as provided by Hyperlink-Induced Topic Search (HITS) (Kleinberg, 1999), an alternative link analysis algorithm to PageRank proposed at the same time. The second measure, connec-tivity root-distance (CRD), combines a concept’s in and out degree with its distance to the main concept. We include HARD and CRD as well as the underlying authority, hub, lower and upper node scores in our feature set.

And finally, we follow the work of Tixier et al. (2016) who show that graph degeneracy is a useful feature to identify keyphrases. We use the graph core number and core rank as proposed by them, two metrics based on the membership of a node to a k-core of the graph, the maximal subgraph in which all nodes have a degree of at least k.

Extraction Since the concept and relation mentions are extracted from the source doc-uments in preceding steps, we also rely on features obtained from the extraction process.

In the pipeline described in Section 6.5, the extraction is done using OIE. We include the confidence with which an extraction was made as well as the type of argument a concept mention is part of — simple, spatial or temporal — as additional features. However, intu-itively, they offer little signal for importance but are rather indicative of extraction quality.

Word Categories As suggested in recent work by Yang et al. (2017), dictionary-based fea-tures that capture general properties of words such as concreteness, familiarity or imagery can also be helpful for summarization. In particular, whereas all other features discussed

Chapter 6. Pipeline-based Approaches

so far are mainly based on the content of the source documents, these features bring in additional external knowledge about how certain words are generally used and perceived by people. As we argued in Section 3.3.6, such information can be crucial to fully capture the human notion of importance. We use the MRC Psycholinguistic Database (Coltheart, 1981), which scores words for concreteness, familiarity, imagability, meaningfulness and age of acquisition, the LIWC dictionary, which groups words into 65 different categories, and an additional, bigger list of concreteness values for words by Brysbaert et al. (2014).