• Keine Ergebnisse gefunden

5.2 Extraction with Predicate-Argument Analysis

5.2.2 Experiments

Chapter 5. Concept and Relation Extraction

also alleviates the need for mention extraction approaches to deal with different syntactic realizations when looking for concepts and relations.

In our experiments, we use OpenIE4 (Mausam, 2016). It is one of the state-of-the-art OIE systems (Stanovsky and Dagan, 2016b). Using it, the extracted propositions for the example sentence are the same as for PropS:

(1) (Caffeine - is - a mild CNS stimulant) (2) (Caffeine - reduces - ADHD symptoms)

With regard to the motivation of avoiding laborious definitions of large sets of rules, we note in passing that PropS as well as many OIE systems do indeed make their extrac-tions using hand-written rules. Thus, relying on them for concept and relation mention extraction does not completely remove the need for rules, instead, it shifts the responsibil-ity for rule creation from researchers working on the specific task of concept and relation extraction to the authors of more generally applicable predicate-argument analysis tools.

5.2. Extraction with Predicate-Argument Analysis

5.2.2.2 Reimplementation of Previous Work

From the previous work reviewed in Section 2.3.1, we selected and reimplemented39several representative examples for concept and relation mention extraction techniques. As the descriptions of these methods in the respective papers are often missing relevant details, we had to make a few assumptions on how exactly a method should work. To ensure full reproducibility, we document the exact implementation used for the experiments here.

Valerio et al. (2006) As an example for extraction approaches that rely on constituency parse trees, we reimplement Valerio and Leake (2006). Following them, we extract as con-cept mentions all NP-constituents that do not contain any smaller NPs and that have at least one token tagged as a noun (tag N*) or adjective (tag J*). Their proposed approach for relation extraction is unfortunately less clear: they write that they extract a relation for “all pairs of concepts that have an indirect dependency link through a verb phrase.” Our specific implementation of that idea looks for NPs followed by a VP containing another NP, where both NPs have previously been identified as concept mentions. In that case, the tokens from the beginning of the VP until the start of the inner NP form the relation mention.

Qasim et al. (2013) We reimplement the method proposed by Qasim et al. (2013) as an ex-ample for dependency-based extraction, as they provide the most comprehensive descrip-tion of their work. For concept mendescrip-tion extracdescrip-tion, we implement Algorithm 1 of their paper (Qasim et al., 2013). It extracts mentions consisting of two tokens that are connected bynn(noun compound) oramod (adjectival modifier) dependencies and single tokens that are the child of ansubj(noun subject) dependency. In addition, for the two collapsed depen-denciesconj_and (conjunction) andprep_of (preposition), both tokens with the additional conjunction or preposition in between are extracted.

For relation extraction, we follow Qasim et al. (2013) and extract all verbs that are parent tokens of ansubj, nsubjpass, xcomp orrcmod dependency, including preceding auxiliaries (aux, auxpass). Then, we determine pairs of concept mentions that co-occur in a sentence with one of the verbs. As an additional criterion, one of the concept mentions has to act as a subject (child token of a *subj* dependency) and the other as an object (child of a

*obj* or parent of acop dependency). If all these conditions are satisfied, the verb is used as a relation mention for that pair of concept mentions.40

39There is no previous work that makes their implementation available.

40If several verbs co-occur with a pair of concepts and all occurrences fulfill the subject-object-criteria, we select among them by frequency, using a VF-ICF metric that favors verbs occurring often but not with many other concepts. We refer to Qasim et al. (2013) for the exact formula. For our experiments, this detail is less relevant as such a selection is rarely necessary. We also note that the full approach by Qasim et al. (2013) uses an additional complex clustering step that further restricts the number of pairs of concepts that are considered for relations at all. It is not part of our reimplementation.

Chapter 5. Concept and Relation Extraction

Villalon (2012) Villalon (2012) proposed a different method to extract mentions from de-pendency parses that we include in our experiments as well. Given the dede-pendency graph, the method first applies the following operations that merge or remove graph nodes:

• If two nodesAandBare connected with anamod, nn, number ornum dependency and they are directly adjacent in the text, merge them into a single nodeA B.

• If two nodesAandBare connected with aadvmod, aux orauxpass dependency, they are directly adjacent in the text and their part-of-speech tags are VB*, merge them into a single nodeA B.

• If two nodesAandBare connected with aconj_and orprep_of dependency, they have a distance of 2 in the text and their part-of-speech tags are NN*, merge them into a single nodeA and BorA of B.

• If a node has adet dependency to its parent, remove it.

After this set of transformations, all nodes that have at least one NN* part-of-speech tag are extracted as concept mentions. Note that due to the merging operations, these can be multi-token mentions. While the extraction procedure is slightly different from Qasim et al.

(2013)’s approach, one can easily see that both methods use a similar set of patterns. What differs more is the relation extraction approach: Here, for each pair of concept mentions in a sentence, the shortest path between them in the transformed graph is calculated and the tokens along that path are used as a relation mention. If another concept is crossed on that path, the method tries to find the shortest path excluding that concept.

5.2.2.3 Concept Extraction

In the first experiment, we test the performance of concept mention extraction for the three techniques from previous work described above and the three approaches based on predicate-argument analysis introduced earlier.

Experimental Setup To evaluate the coverage of concept mention extraction, we compare extracted concepts𝐶with the concepts𝐶𝑅 of the reference concept maps. The two main metrics, recall and yield, are defined as:

Recall= |𝐶 ∩ 𝐶𝑅|

|𝐶𝑅| Concept Yield = |𝐶|

|𝐶𝑅|

While recall measures the fraction of covered reference concepts, concept yield indicates the amount of over-generation. The more common usage of precision is less useful in this case because the reference for comparison,𝐶𝑅, are only the concepts included in the sum-mary concept map, not all concept mentions in the text (which would be required for a proper precision calculation). Since the set𝐶is typically much larger than𝐶𝑅, computing

5.2. Extraction with Predicate-Argument Analysis

Approach Educ Biology ACL

Yield Recall Len Yield Recall Len Yield Recall Len Noun Tokens 167.38 25.07 1.0 51.53 75.61 1.0 48.99 42.48 1.0 Valerio et al. (2006) 406.53 55.73 2.4 69.50 69.60 2.2 77.53 62.21 2.3 Qasim et al. (2013) 467.15 48.13 2.4 74.61 61.46 2.3 78.04 74.39 2.3 Villalon (2012) 351.85 51.20 2.5 60.75 76.37 2.3 69.10 76.09 2.3

OIE 277.70 58.00 5.9 44.53 41.80 4.9 41.99 28.67 5.3

SRL 481.59 46.93 6.5 66.28 50.99 4.2 77.21 44.14 5.0

PropS 451.20 46.27 4.3 76.55 73.50 3.2 55.87 58.41 3.7

Reference 3.2 1.2 1.9

Table 5.1: Concept extraction performance by dataset. Bold indicates best recall per group. Recall is given in percentages, yield and length (number of tokens per concept) are absolute values.

a precision metric against𝐶𝑅 leads to very low scores that are hard to interpret, in par-ticular when combined with recall in an F-score. However, proper precision scores based on𝐶𝑅can be computed after the selection of a summary subset from all extractions, as we will do in Section 5.2.2.5. All metrics are averaged over all pairs per dataset.

Note that we also use a simple form of mention grouping, based on exact matches of stemmed mentions, to obtain𝐶. This avoids that mentions that can be easily seen to refer to the same concept are counted as separate concepts for the yield metric. As an extraction baseline, we include a strategy that extracts only single tokens tagged as nouns.

Recall and Yield Table 5.1 reports results for the Educ, ACL and Biology dataset. Out of the different tools used to obtain predicate-argument structures, we observe the highest recall using PropS on two datasets and OIE on the other. From previous work, Villalon’s method shows the best results on two datasets, while Valerio et al.’s method is best on Educ.

Overall, we conclude that concept extraction based on predicate-argument structures is competitive, giving slightly better (Educ) or slightly worse (Biology) results, except for the performance on the ACL dataset. With regard to yield, predicate-argument structures are even more competitive, producing less candidate concepts in most cases. For all approaches, the yield correlates with the size of the input documents, producing most concepts for the multi-document inputs of the Educ dataset.

One interesting observation is that on Educ the best performing method, among pre-vious work as well as predicate-argument structures, differs from the one on the other two datasets. The reason is that the concept labels in this corpus tend to be longer (3.2 tokens

Chapter 5. Concept and Relation Extraction

1 2 3 4

0.4 0.5 0.6 0.7 0.8 0.9 1

k

Recall

Educ

PropS SRL OIE

1 2 3 4

0.4 0.5 0.6 0.7 0.8 0.9 1

Biology

1 2 3 4

0.4 0.5 0.6 0.7 0.8 0.9 1

ACL

Figure 5.1: Concept extraction recall for inclusive matches at increasing thresholds of𝑘.

vs. 1.2 and 1.9), including more complex noun phrases and also verbal phrases describing activities, while concept labels are mostly single nouns in Biology or noun compounds in ACL. This explains the generally lower recall and the better performance of approaches focusing on full noun phrases rather than nouns and noun compounds. Of course, the dif-ferent style of concept mentions in Educ is to some extent a result of the fact that OIE has been used to create the corpus (see Section 4.3.2), contributing to the good performance of OIE in this experiment. However, we want to emphasize that in later steps of the corpus creation process, all concept labels have been manually verified and often revised, ensuring that they are in fact good references for how humans would label concepts.

Analyzing the results on the ACL corpus, we found that the automatic extraction of text from the PDFs produced very noisy data, causing a lot of the dependency parses to be of low quality due to wrong sentence segmentation. Interestingly, while this reduced the performance of all approaches using predicate-argument structures, it did not influence the methods from previous work. We hypothesize that these approaches are more robust against these parsing errors because they only extract from dependencies locally, while the other approaches globally process the full parse to derive predicate-argument structures.

Added Concepts We further compared the concepts extracted by the best approach from previous work with those produced by predicate-argument structures. To assess whether the latter identified previously uncaptured concepts, we joined both sets and compared the combined recall against the method from previous work alone. In all cases, predicate-argument structures extract at least some concepts that are not covered by previous ap-proaches, with the best approach adding 15.20 (Educ), 6.90 (Biology) and 5.81 (ACL) points of recall. This shows that the predicate-argument structures are not only competitive but can also be used to extend the coverage of previous methods.

Concept Length Finally, we looked at the length of extracted concept mentions. As Table 5.1 shows, the extractions made by previous work tend to be around 2.3 tokens long, while the arguments of predicate-argument structures are up to three times as long. In order

5.2. Extraction with Predicate-Argument Analysis

to assess whether missing concepts might be subsequences of these longer mentions, we define a new evaluation metric: An extracted concept 𝑐and reference concept 𝑐𝑅 match inclusively at𝑘 if𝑐𝑅 is contained in𝑐and𝑐is at most𝑘 ⋅ |𝑐𝑅|tokens long.

Figure 5.1 shows the corresponding recall when increasing𝑘. For all three approaches, but especially for SRL and OIE, the recall increases dramatically when considering longer arguments. This indicates that the concept extraction performance could be further im-proved by learning how to reduce longer arguments to the desired parts. Corresponding techniques have been studied, e.g. by Stanovsky and Dagan (2016a) and Stanovsky et al.

(2016a). But as we already mentioned in Section 3.3, the compression of arguments has to ensure that propositions are still asserted by the text, making this direction non-trivial.

5.2.2.4 Relation Extraction

In the second experiment, we perform a similar comparison between the relation extraction techniques from previous work and the ones based on predicate-argument structures.

Experimental Setup To ensure a fair comparison independent of concept extraction, all strategies use reference concepts in this experiment. We compute recall and relation yield as our metrics, counting relations that were extracted with a correct label for the correct pair of reference concepts. To account for the fact that some approaches extract complex phrases, e.g. including prepositions or auxiliaries, while others extract only single tokens, we use a lenient matching criterion, requiring that the stemmed heads of the relation labels have to match. For instance,is located inandlocatedare considered a match.

Results As shown in Table 5.2, the shortest path method of Villalon is the best performing method from previous work. However, it also creates a comparably large set of relations.

Using predicate-argument structures, we see a substantial improvement on both datasets, and again, PropS performs well on Biology and OIE on Educ. Note that on both datasets even the other method is on par with Villalon’s approach while producing a substantially smaller number of extracted relations.

We found that the low recall of Valerio et al.’s and Qasim et al.’s methods is due to the small coverage of their patterns, whereas Villalon’s method is very noisy and extracts many meaningless relation mentions. On the other hand, predicate-argument structures benefit from their main advantage in this evaluation: Since all extractions are made on the level of propositions, every concept has at least one meaningful relation to another concept.

With regard to the length of relation labels, we found a similar picture: Villalon’s method provides very long labels (4.5 and 4.4 tokens), as it simply takes all tokens along a path in the dependency graph. SRL yields the shortest labels (1.0), since predicates are re-stricted to one token by design, PropS finds a bit longer ones (1.5), including auxiliaries and light-verb constructions, and OIE extracts the longest labels (2.9 and 2.3), also containing

Chapter 5. Concept and Relation Extraction

Approach Educ Biology

Yield Recall Len Yield Recall Len Valerio et al. (2006) 2.70 8.32 1.8 2.02 17.75 1.8 Qasim et al. (2013) 2.24 3.30 1.7 1.43 8.08 1.4 Villalon (2012) 17.28 21.53 4.5 9.64 32.34 4.4

OIE 6.47 25.76 2.9 3.52 31.63 2.3

SRL 11.60 17.97 1.0 4.22 17.57 1.0

PropS 11.62 21.20 1.5 6.48 40.95 1.5

Reference 3.2 1.9

Table 5.2: Relation extraction performance by dataset. Bold indicates best recall per group. Recall is given in percentages, yield and length (number of tokens per concept) are absolute values.

prepositions as inwas made for. Overall, considering both recall and the style of labels, we consider OIE to provide the most useful relations for concept maps.

5.2.2.5 Concept Selection

Finally, we present a third experiment on concept selection. While we found predicate-argument structures to be superior for relation extraction in terms of both recall and yield, the picture is less clear for concepts. By scoring the extracted concepts and selecting a subset of summary-worthy ones, we try to shed more light on the usefulness of certain trade-offs between recall and yield for subsequent steps in the pipeline. However, we want to emphasize that the experimental setup used here is not directly a subtask of CM-MDS:

For CM-MDS, importance estimates for concepts are not used to simply select a subset of the concepts but to select a subgraph of concepts and relations.

Experimental Setup We use the concept sets𝐶obtained from the different methods stud-ied in the first experiment (see Section 5.2.2.3), assign a score to each concept, select the

|𝐶𝑅|concepts with the highest score and compare them to the reference concepts𝐶𝑅. This variant of precision is known as r-precision:

R-Precision= |𝐶𝑡𝑜𝑝−𝑘∩ 𝐶𝑅|

|𝐶𝑅| with𝑘 = |𝐶𝑅|

As the score, we use the concept’s frequency, i.e. the number of its mentions in the docu-ments, a metric that has been previously proposed to find important concepts (Valerio and