• Keine Ergebnisse gefunden

6.3 Importance Estimation

6.3.3 Experiments

Chapter 6. Pipeline-based Approaches

so far are mainly based on the content of the source documents, these features bring in additional external knowledge about how certain words are generally used and perceived by people. As we argued in Section 3.3.6, such information can be crucial to fully capture the human notion of importance. We use the MRC Psycholinguistic Database (Coltheart, 1981), which scores words for concreteness, familiarity, imagability, meaningfulness and age of acquisition, the LIWC dictionary, which groups words into 65 different categories, and an additional, bigger list of concreteness values for words by Brysbaert et al. (2014).

6.3. Importance Estimation

linear kernels and LibSVM (Chang and Lin, 2011) for a version with radial basis function kernels61. We further include Ranking SVMs as implemented in Dlib (King, 2009).62 We also performed preliminary experiments with structured prediction, using a structured percep-tron (Collins, 2002), but did not observe competitive results, presumably due to too little training data, as we can only train on 15 (graph-level) instances in that setup instead of 54k and 16k (concept-level) instances in the other settings. For classification, we subsam-ple negative instances, as the dataset is highly skewed, and, similar to previous work on summarization (Li et al., 2016a), discretize the mostly Zipf-distributed continuous features into bins. We use five equal-frequency bins per feature in the regression setup and rely on Weka’s implementation of the minimum description length principle to find optimal bins in the classification setup. All models use L2-regularization. During grid search, we tune the regularization constant of each model, the SVM’s kernel parameter, the𝜖of its regres-sion variant and the amount of subsampling.63 All source documents are preprocessed with Stanford CoreNLP64 (Manning et al., 2014) to obtain the part-of-speech, named entity and dependency annotations required for some of the features. Graph metrics were computed with standard algorithms as implemented in networkx.65

Features Before looking at importance estimation results, we analyze the usefulness of different features. In Table 6.5, we show features with the highest Pearson correlation, both positive and negative, per group measured against the crowdsourced importance scores that we use to train regression models. The strongest features correlating with impor-tance are, in line with previous work on summarization, from the frequency group. But the graph-based features, in particular core numbers, degrees and centrality measures, are almost as good. The HARD model and its components show low correlations (< 0.1) on our dataset. Note that position features are less useful than suggested by Table 6.5, as only position spread, which correlates strongly with the number of mentions of a concept and thus frequency, has a high correlation in that group. Another useful feature seems to be concreteness values obtained from external lexicons.

With regard to identifying unimportant concepts, the highest negative correlation with importance can be observed for the length of a concept’s label. This is in line with our intuition, since concepts with very long labels tend to be very specific and detailed and are therefore less useful for a summary. The number of stopwords and the presence of certain parts-of-speech, e.g. determiners, also correlate with unimportance. We believe that this is

61Substantially higher cross-validation runtimes did not allow us to use RBF-SVMs in the regression setup.

62WEKA version 3.8.1, LibLINEAR version 2.20, LibSVM version 1.0.10 and Dlib version 18.17.100.

63Ranges used for grid search: subsampling to 2%, 5%, 10%, 30%, 50% or 100% (no sampling) of full size, regularization constants of 1E-2, 1E-1, 1, 1E+1, 3E+1, 1E+2, 3E+2, 1E+3,𝛾-values for radial basis function kernels of 1E-3, 3E-3, 1E-2,𝜖-values for𝜖-SVR of 1E-4 to 1E-0.

64Version 3.6.0

65networkx version 1.11 (https://networkx.github.io/).

Chapter 6. Pipeline-based Approaches

Group # Most Positive Most Negative

Feature r Feature r

Frequency 9 document frequency 0.258 max.token IDF -0.054

Graph 17 core number 0.246 lower node score -0.088

Position 5 position spread 0.218 min. position -0.098

Lexicon 85 min. concreteness 0.135 has function word -0.071

Length 10 token spread 0.098 number of tokens -0.121

Part-of-Speech 78 head has NNS tag 0.064 label has DET token -0.085 Topic 3 jaccard similarity 0.059 word2vec similarity -0.021 Extraction 4 is temporal argument 0.019 is simple argument -0.016 Annotations 13 head is person NE 0.014 number of stopwords -0.097 Table 6.5: Pearson correlation between features and true importance scores. Shown are the two features with highest positive and negative correlation per group, using the regression dataset.

due to the fact that there are some concepts with noisier labels in the dataset among the less important ones. Using these features, we can lower the chance that they will be included in the summary concept map.

Modeling Approaches Table 6.6 shows the concept selection performance using differ-ent classification, regression and ranking models. The list-wise evaluation compares, as described before, the 25 concepts with the highest estimated importance against the con-cepts of the reference summary concept map. Unfortunately, as the differences between the modeling approaches are marginal and not significant, we cannot draw any conclu-sions regarding the superiority of any approach from these results. We also performed a second, graph-wise evaluation in which the importance estimates for all concepts are used to select the best summary subgraph with 25 concepts, using the methods proposed in Sec-tion 6.4, and then compared the concepts in that map against the ones in the reference map. Such an evaluation can potentially show different results, as importance estimation functions learned with a classification or ranking loss, as opposed to regression, do not necessarily produce scores whose differences are meaningful and comparable, which can be problematic when these scores are used to calculate aggregated measures during selec-tion. However, the results are similar and differences between approaches are even smaller.

Overall Performance and Errors While the primary purpose of this experiment is to compare features and modeling approaches, it also gives insights into how difficult the importance estimation task is in general. With precision scores between 10% and 13%,

6.3. Importance Estimation

Model List Graph

R-Precision p-value R-Precision p-value Classification

Logistic Regression 10.93 .0967 11.73 .1826

Linear SVM 11.47 .2100 11.47 .1284

Kernel SVM 10.13 .0157 10.40 .0381

Regression

Linear Regression 13.07 13.33

Linear SVM 11.73 .0703 12.27 .2422

Ranking

Ranking SVM 12.27 .3877 12.80 .5176

Upper Bound 56.00 46.67

Table 6.6: List- and graph-wise concept selection performance with different importance estima-tion models evaluated with leave-one-out cross-validaestima-tion. The upper bound is determined by the recall of concept extraction. P-values are computed with a permutation test.

only 3 out of the 25 selected concepts match a concept in the reference map, which leaves a lot of room for improvement. We expect the achievable performance for this subtask to be in the range of 30% to 40%, since 47% is the hard upper bound due to extraction performance (see Table 6.6) and the human agreement on importance annotation is around 0.8 (see Section 4.3.2).66 Following that reasoning, improvements of 20 to 30 percentage points should still be possible on this subtask.

Looking at the selections made by the models, we found a large fraction of the correct selections (which are only 3 per topic on average) are a small number of concepts that occur in several reference concept maps. For instance, maps for many topics have the concepts

parentsandchildren, which the model also consistently selects. Important concepts which the model misses to select are in particular those that occur in only one topic. During the manual inspection of the selections, we also observed that some selected concepts do not match any reference concept exactly but are very close, e.g. ADHD remediesmight have been selected while the reference hasADHD medication. While this is not rewarded by the exact matching used to compute precision here, it is considered by the ROUGE and METEOR metrics used in task-level evaluations. In that sense, the precision and upper bound reported here characterize importance estimation performance rather pessimistically.

66Note that the human agreement for the importance annotation was measured by correlation. In contrast to percentage agreements, which are often directly compared to F-scores to define upper bounds, there is no direct connection between correlation and the precision metric used in this experiment.

Chapter 6. Pipeline-based Approaches

Conclusion Based on our experiments, we observe that there is no significant difference between modeling importance estimation as regression, classification or ranking on our data. While one could probably find significant differences using larger datasets, the fact that they can only be observed — if at all — with more data shows that they are presumably rather small and will thus have only a small impact on the overall task-level performance.

With regard to features, the graph-based measures that we included in addition to tradi-tional summarization features seem to be particularly useful for our task. Nevertheless, performance on importance estimation is still limited and should be further improved. We suspect that adding more external knowledge on what people generally consider to be im-portant can be particularly helpful. In the task-level experiments in Section 6.5, we also assess how well the supervised model with the current features performs against the unsu-pervised methods suggested in previous work.