• Keine Ergebnisse gefunden

6.2 Concept Mention Grouping

6.2.3 Experiments

6.2. Concept Mention Grouping

Partitioning Optimization Runtime Space

Transitive Closure no explicit 𝒪(|𝑀 |2) 𝒪(|𝑀 |)

Greedy Search approximate 𝒪(|𝑀 |4) 𝒪(|𝑀 |)

Beam Search approximate 𝒪(𝑑𝑘𝑏 |𝑀 |2) 𝒪(𝑑𝑘 + |𝑀 |)

ILP w/ or w/o Column Generation exact polynomial –

Table 6.1: Comparison of mention partitioning algorithms with regard to optimization behavior and time and space complexity. For ILPs, runtime and space depend on the solver’s implementation.

Since the beam search can still be prohibitively expensive, we propose a second,greedy search algorithm shown in Algorithm 2. It neither checks all possible neighbors to find the best pair to remove nor does it keep a beam of𝑘-best solutions. Instead, starting with the initial solution, it iterates over all pairs in𝑃 𝑜𝑠only once, continuing with the revised solution if removing the pair is beneficial. Thus, it explores only a single path in the search tree. Further, it computes a transitive reduction of𝑃 𝑜𝑠in the beginning and only considers neighbors from these pairs to avoid removing edges that do not change the transitive clo-sure. But, since multiple transitive reductions of a relation exist, the arbitrary choice of one of them also influences the explored search space. Algorithm 2 has a worst-case runtime complexity of𝒪(|𝑀 |4)and needs𝒪(|𝑀 |)additional space for the solution.

Table 6.1 compares the different partitioning algorithms discussed in the preceding sec-tion. From top to bottom, the quality of the solution with regard to the objective function improves, while the runtime and space requirements increase. We note that the complexity behavior is a worst-case analysis. In practice, the set𝑃 𝑜𝑠 tends to be much smaller than 𝑀2. We present an empirical analysis of this behavior in the next section.

Chapter 6. Pipeline-based Approaches

Features Performance Computation

Pr Re F1 AUC sec/100k

Exact Match 98.6 22.8 37.1 29.3 0.11

Lemma Match 98.0 24.5 39.2 30.6 0.82

Jaccard Coefficient 88.3 27.8 42.2 49.2 0.88

Edit Distance 83.4 32.1 46.4 53.0 0.46

Word2Vec 81.8 34.7 48.8 57.5 8.23

Rus-LSA 81.2 36.8 50.6 51.6 2.16

WN Rus-Resnik 82.1 36.9 50.9 54.2 5.40

WN Corley-Mihalcea 85.9 36.5 51.2 53.2 21.79

ADW 84.2 42.0 56.0 59.0 2057.03

All 81.8 45.2 58.2 63.6 2096.87

All w/o ADW + Corley 79.3 44.7 57.1 62.6 18.05

Table 6.2: Classification performance and feature computation time for pairwise mention classi-fications. The upper part reports results for models trained on just a single feature.

pair receives a positive label, all other pairs are negative examples. The resulting dataset consists of 17,500 mention pairs of which 1,218 (7%) are coreferent and 16,283 (93%) are not.

We train the log-linear classifier using Weka’s (Hall et al., 2009) implementation of lo-gistic regression50 and report metrics obtained with stratified 10-fold cross-validation on the training data. As features, the similarity measures discussed in Section 6.2.1 are used.

We use Pilehvar et al. (2013)’s original implementation51to compute ADW and the SEMI-LAR library52 (Rus et al., 2013) to compute the other WordNet-based similarities, both re-lying on WordNet 3.0. We also use SEMILAR to compute the LSA-based similarity with a dimensional model provided with the library. As word embeddings, we use 300-dimensional Word2Vec embeddings trained on GoogleNews53 as well as 50, 100, 200 and 300-dimensional GloVe embeddings trained on Wikipedia and Gigaword 554.

Table 6.2 shows the classification performance observed in this experiment in terms of precision, recall and F1-score on positive classifications and the area under the precision-recall-curve. In the upper part, we report results for models that were trained on only a single feature to understand the contribution of each similarity measure. Exact and lemma

50We use the default regularization constant of10−8, as we have not seen an effect of using more or less regularization, presumably due to the small number of features.

51Commit 3298b6b, available athttps://github.com/pilehvar/ADW.

52Version 1.0, available athttp://deeptutor2.memphis.edu/Semilar-Web.

53Available athttps://code.google.com/archive/p/word2vec/.

54Available athttps://nlp.stanford.edu/projects/glove/.

6.2. Concept Mention Grouping

Dimension Topic 1001 Topic 1042

Mentions 1,135 16,822

Unique Mentions 858 10,317

Pairs 367,653 53,215,086

Positive Pairs 699 22,074

Table 6.3: Size of the partitioning problem on the smallest and biggest document sets of the training part of Educ. Unique mentions are mentions after applying exact and lemma matching.

matches show, as one would expect, a high precision but low recall. All other measures add additional recall by trading off some portion of precision, but generally improve the over-all performance in terms of F1-score and AUC. Interestingly, WordNet-based similarities work surprisingly well on our dataset and outperform the more recent approach of word embeddings. The ADW measure outperforms all others by a substantial margin. Note that we did not include results for GloVe embeddings, which are outperformed by Word2Vec, and also not the alternative WordNet-based measures described in Section 6.2.1, which are outperformed by Resnik (1995)’s approach.

The last column of Table 6.2 reports the necessary time to compute the similarity mea-sures for 100,000 mention pairs. In general, better performing features tend to be more expensive to compute. As the number of mention pairs that have to be classified grows quadratically with the length on the input documents, long computation times can severely limit the applicability of a classifier to our task. Therefore, the ADW similarity, despite showing the best performance, is not very useful due to its orders of magnitude larger computation time. Corley and Mihalcea (2005)’s similarity, while being much cheaper than ADW, still adds a large amount of computation. A good trade-off between classification performance and feature computation time seems to be a model combining all features ex-cept the two expensive ones. As the lower part of Table 6.2 shows, excluding them leads to a drop in F1-score and AUC of just one point.

Mention Partitioning To compare the different partitioning algorithms discussed before, we apply them to the concept mention sets obtained for document sets of the training part of Educ. We extract them using an OIE-based method described in Section 6.5. For efficiency, we first group mentions by exact and lemma matching55and denote with𝑀this revised mention set. All mention pairs for𝑀are then classified with the log-linear model discussed in the previous section using only five similarities — neither the two expensive

55Note that for exact and lemma matching, a mention can be directly mapped to its subset without compar-isons to other mentions. Thus, using a hash map, this partitioning can be implemented in linear time. The resulting reduced set of mentions then also decreases the number of pairs in the subsequent steps.

Chapter 6. Pipeline-based Approaches

Algorithm Topic 1001 Topic 1042

Time Value |𝐶| Time Value |𝐶|

Trans. Closure 0.2s 0.9261 495 33.7s 0.7383 4078

Greedy Search 3.4s 0.9659 664 2h 33m 0.9612 7741

Beam Search k=1, d=10, b=100 10s 0.9367 505 4h 30m 0.7394 4085 Beam Search k=1, d=500, b=100 3m 39s 0.9660 669 9d 10h 0.7458 4130 Beam Search k=5, d=500, b=100 17m 3s 0.9660 669 44d 13h 0.7458 4130

ILP w/ CG 1d 8h 0.9661 621 Out of Memory

ILP Out of Memory Out of Memory

Table 6.4: Runtime and optimization results for mention partitioning on the smallest and biggest document sets of the training part of Educ. Objective function values are normalized by pairs.

ones nor the already applied ones — as features. The resulting pairs and their classification probabilities are the input for the partitioning algorithms. We solve ILPs using CPLEX56 and use our own implementations of the other approaches. All experiments are carried out on a compute server with 500 GB of memory and Intel Xeon ES-2620 2.1GHz processors of which CPLEX uses 24 cores and the other algorithms only a single core.

Table 6.3 shows the size of the concept grouping problem for the smallest and biggest topics in the training part of Educ. Note that while topic 1042 has only ten times as many mentions as 1001, the resulting number of pairs is larger by a factor of 150, illustrating the scalability problem of pairwise comparisons. However, as we showed before, our par-titioning algorithms have theoretical runtimes that grow even worse than quadratically, which is in line with the partitioning results in Table 6.4. On the smaller document set, we could obtain an optimal solution with the column generation variant of the ILP, whereas the plain ILP was not solvable with the available memory. On the bigger document set, neither variant succeeded. The transitive closure partitioning, on the other hand, while being fast, leads to the aforementioned lumping effect as illustrated by the small number of resulting concepts and the low objective function values. Our two heuristic search algorithms make a better trade-off between runtime and solution quality, finding partitionings close to the optimal solution on the smaller topic substantially faster than the ILP. Comparing between them, we observe that even the fastest instantiation of beam search, which searches to a depth of only 10 and thus yields only marginally better solutions than transitive closure, takes longer than the greedy search. Thus, overall, finding a mention partitioning with greedy search appears to be the best approach on our data.

56Version 12.7, available athttps://www.ibm.com/analytics/cplex-optimizer.