• Keine Ergebnisse gefunden

Evaluation

Im Dokument Entity Linking to Wikipedia (Seite 91-105)

3.6 Thematic Context Distance

3.6.4 Evaluation

3.6 Thematic Context Distance

Particularly, for WikiPersonsG, we alter the category condition in Eq. 3.52 to Wc={e∈W |(c="Frau"∈c(e))∨(c="Mann"∈c(e))} (3.54) and arrive at candidate pool Wc containing 18024 distinct, randomly selected en-tities fulfilling this condition. WikiPersonsG then contains 44338 example docu-ments, with 35367 contexts of linkable mentions, 8971 contexts of uncovered men-tions and an average ambiguity of 2.91.

ForWikiPersonsF we alter the category condition in Eq. 3.52 to a partial match on category tags

Wc={e ∈W | ∃c∈c(e) :hasSubstring(c,"Naissance")}. (3.55) This means that it is sufficient that the word"Naissance" is contained as a substring in any of the category tags. ForWikiPersonsF we then have a candidate poolWc of 7201 different entities and a reference dataset of 15159 example documents, with 12284 contexts of linkable mentions, 2875 contexts of uncovered mentions and an average ambiguity of 1.88. Again, for both datasets, the average ambiguity does not include NIL as a candidate.

Analogously to the observations described in Section 3.5.4, we also find prob-lematic links in these datasets. Some links are rather conceptional and point to a thematically related article, which does not imply identity. For example, the term client can be linked to the article Lawyer.

Chapter 3 Topic Models for Person Linking

88.79

88.33 88.36

89.11 89.34 88.44

90.30 90.65 90.65

89.38 88.48

89.37 89.81

linear quadratic gaussian

88 89 90 91

DH DJS DKL DsKL DH DJS DKL DsKL DH DJS DKL DsKL

distance Fmicro

distance DH

DJS

DKL

DsKL

Figure 3.8: Fmicro performance for entity linking on WikiPersonsE (all values in %). Here, uncovered entity mentions are simulated at random. We compare thematic distance representations in combination with different kernel types in 5-fold cross-validations. The mean of each sample is given above the boxes, the best performance (in bold) is obtained forDsKL with a quadratic kernel.

Thematic Distances

To find the best distance representation, we evaluate the distances described in Section 3.6.1 with different kernels using SVMLight standard parameters in five fold cross-validations on the dataset WikiPersonsE. For a better comparison, we use here the results obtained with standard instance based splitting for cross-validation instead of the figures given in Pilz and Paaß [2011] that were obtained using the entity based splitting for cross-validation.

Fig. 3.8 visualizes the results obtained for each distance in a linear, a quadratic and a Gaussian kernel. On the first glance, there are no striking differences among the different combinations. To statistically compare the different representations, we therefore employ paired t-tests on the Fmicro results over the cross-validation folds.

This showed that the best result is obtained with the symmetric Kullback-Leibler distanceDsKL in a quadratic kernel.

With a linear kernel, the Hellinger distance DH and the symmetric Kullback-Leibler distance DsKL perform best. Using the more complex quadratic kernel in-creases performance for all distances, most notably for DsKL. The Gaussian kernel with standard parameters however is not superior. Also, we find the symmetric Kullback-Leibler distance representation DsKL superior to the asymmetric variant DKL (p < 0.02) in all kernels. The same is true for the comparison ofDsKL with the Jensen-Shannon distance DJS (p < 0.01), the latter giving the lowest performance in all cases. Comparing DsKL with the Hellinger distance DH, we find significantly better results for DsKL (p < 0.03) in the linear and quadratic kernel, while there is no significant difference when using a Gaussian kernel. The asymmetric DKL is inferior to DHonly for the linear kernel. From this evaluation we conclude that the

78

3.6 Thematic Context Distance

90.53 90.19

90.55 90.98

90.99 90.09

91.95 92.1692.16

90.67

90.06 90.18 90.59

linear quadratic gaussian

89 90 91 92 93

DH DJS DKL DsKL DH DJS DKL DsKL DH DJS DKL DsKL

distance Fmicro

distance DH

DJS

DKL

DsKL

Figure 3.9: Fmicro performance for entity linking on WikiPersonsE (all values in

%). Here, uncovered entity mentions are simulated taking into account article text length. We compare thematic distance representations in combination with different kernel types in 5-fold cross-validations. The mean of each sample is given above the boxes, the best performance (in bold) is obtained for DsKL with a quadratic kernel.

symmetric Kullback-Leibler distance DsKL is most suitable for entity linking based on thematic distance. Since the Gaussian kernel does not provide superior results, we decide to not evaluate it further but present results in later experiments for the simpler representations in linear and quadratic kernels.

In this evaluation, we simulated uncovered entities for WikiPersonsE by re-moving every 5th entity from the candidate pool. Alternatively, we can simulate uncovered entity mentions taking into account article text length, for example by removing entities with short article texts where only few contextual information is available. In an additional experiment, we therefore simulate uncovered entity men-tions by removing all the ground truth entities with an article text of less than 50 words. Following the described extraction strategy, we obtain less examples for un-covered entity mentions, 1674 instead of 3068, since short articles naturally tend to have fewer inlinks. The results obtained on this dataset are depicted in Fig. 3.9. The performance shows similar behaviour for both simulation strategies, even though the results are with one to two points in percentage slightly superior to those obtained with random simulation of uncovered entities. There are two possible reasons for this. First, when removing entities by article length, we can expect to find more well described candidates in the candidate pool and thus topic distributions are also more stable over their respective article texts. Since furthermore the number of mentions for uncovered entities is also lower, this dataset can be considered less difficult.

Since simulating uncovered entity mentions via article length puts a bias towards popular, well described entities and furthermore renders the linking problem artifi-cially slightly easier, we decide on the random strategy for further experiments.

Chapter 3 Topic Models for Person Linking

Context Properties

Bagga and Baldwin [1998], Gooi and Allan [2004] and Bunescu and Pasca [2006]

observed that the words in a mention’s close neighbourhood often contain most of the information necessary for its disambiguation. These authors therefore focus on localized context windows, usually with a width of 25 to 50 words centred around the mention, i.e. text(m)−25,25 or text(m)−50,50. In contrast, due to the sampling over words, topic distributions tend to be more representative when more context is available. Therefore, we evaluated different context widths for our method in initial experiments. We found that reducing the available context to local windows around the mention yields a slight decrease in predictive performance. This goes along with our assumption that we obtain a higher stability in the topic probability distribution when a larger context is used. Consequently, we use the full text of the link source ls as mention context text(m) to infer the topic distribution Tm.

However, to put more emphasis on the local context, we propose a local boosting.

Local boosting uses a context window around the mention and adds the terms from this window repeatedly to the overall document words. We found that boosting the ten word context window around the mention yields the best result. The terms from this window are then added five times to the words of the example document, i.e.

text(m) =text(ls)∪text(m)−10,10. . .∪text(m)−10,10

| {z }

×5.

.

We found that boosting the local context in this manner increases performance sig-nificantly (p < 0.05) in comparison with the standard, non-boosted version. Hence, we use the local boosting on mention contexts for all experiments following in this chapter. Note that this does not affect the entity distributions Te, for those we use the full contexttext(e) without boosting.

Having described the parameters for distances and contexts, we now evaluate thematic context distance for entity linking against the word-category correlation method proposed by Bunescu and Pasca [2006] (WCC was described in Section 3.5) on the English datasets WikiPersonsE and WikiMiscE. In Pilz and Paaß [2009]

and Pilz and Paaß [2011], we compared against the version cWCC using only com-mon words with the feature representation as in Eq. 3.6. For a more thorough comparison, we here additionally provide the results obtained for the original for-mulation as in Eq. 3.5. When referring to this method, we use WCC to denote the full and cWCC to denote the version restricted to common words. For the implemen-tation of cWCC and WCC, we extract 5825 categories analogously to Section 3.5.4 from the English Wikipedia and furthermore always add the required candidateNIL that has no attributes apart from the NIL-feature as in Eq. 3.19.

We use DsKLto denote our proposed method which exploits thematic context dis-tance through the symmetric Kullback-Leibler disdis-tance over the topic distributions Tm and Te. Since we also evaluate different kernels, we index this notation with

80

3.6 Thematic Context Distance

Fmicro Pmicro Rmicro

0.85 0.90

DsKL,lDsKL,qWCC cWCC DsKL,lDsKL,qWCC cWCC DsKL,lDsKL,qWCC cWCC method

performance

method DsKL,l DsKL,q WCC cWCC

Micro Performance

Fmacro Pmacro Rmacro

0.86 0.88 0.90 0.92

DsKL,lDsKL,qWCC cWCC DsKL,lDsKL,qWCC cWCC DsKL,lDsKL,qWCC cWCC method

performance

method DsKL,l DsKL,q WCC cWCC

Macro Performance

Figure 3.10: Micro and macro performance on WikiPersonsE for the methods DsKL,l, DsKL,q and the competitor methods cWCC and WCC (all values in %).

DsKL,l to indicate the usage of a linear kernel and DsKL,q to indicate the usage of a quadratic kernel. As described previously and depicted in Fig. 3.8, the Gaussian kernel did not perform better, so we omitted experiments using this kernel.

All evaluations are performed in five-fold cross-validations with instance based splitting and compared for significant differences through paired t-tests with p <

0.05. We start with the results obtained on the dataset WikiPersonsE for which we evaluated our approach also in a Ranking SVM instead of a standard classification SVM but obtained remarkably inferior results.

Evaluation for Person Name Mentions

To emphasize the performance for uncovered entity mentions, we also report separate accuracy values for covered and uncovered entity mentions. The accuracy for covered entity mentions AccuracyWc is given by the ratio of mentions that were correctly assigned to an entity in Wc and the overall number of covered entity mentions, i.e.

AccuracyWc = |{ˆe(m) =e+(m)∈Wc}|

|{e+(m)∈Wc}| . (3.56) Analogously, the accuracy for uncovered entity mentions AccuracyNIL is given by the ratio of mentions that were correctly assigned to NILand the overall number of uncovered entity mentions, i.e.

AccuracyNIL = |{ˆe(m) = e+(m) =NIL}|

|{e+(m) = NIL}| . (3.57) Fig. 3.10 visualizes the results obtained on the dataset WikiPersonsE, the ex-plicit figures are given in Tab. 3.2. In all cases, our methods using thematic context

Chapter 3 Topic Models for Person Linking

Table 3.2: Results on WikiPersonsE (all values in %). The best result for each measure is in bold and marked with an asterisk if the difference towards the 2nd best method is significant (p < 0.05). As our methods DsKL,l and DsKL,q overall perform significantly better than cWCC (p < 0.05), we indicate differences only towards WCC for the sake of readability. In terms of AccuracyNIL, the overall best result is obtained with DsKL,l in a Ranking SVM (significant superiority to DsKL,l in a standard SVM is indicated by †).

Bunescu and Pasca Thematic Context Distance

SVM Ranking SVM

measure cWCC WCC DsKL,l DsKL,q DsKL,l

Fmicro 87.17 86.90 89.11 90.65 83.19

Pmicro 90.37 92.10 91.00 92.59 84.48

Rmicro 84.20 82.25 87.30 88.79 81.95

Fmacro 86.85 89.50 89.25 90.93 82.57

Pmacro 88.13 89.70 90.75 91.99 85.60

Rmacro 87.55 91.46 89.37 91.33 81.72

AccuracyWc 87.58 91.57 89.11 92.14 81.13 AccuracyNIL 69.20 40.93 79.30 78.00 85.08

in a symmetric Kullback-Leibler distance in a linear (DsKL,l) or a quadratic kernel (DsKL,q) perform significantly better than cWCC (p < 0.05). Comparing the linear and quadratic variant, we find that the quadratic variant is significantly (p <0.05) superior to the linear variant in most cases. Only regarding the accuracy of un-covered entity mentions AccuracyNIL, the linear variant DsKL,l is superior to the quadratic variant DsKL,q. Interestingly, the full version WCC obtains with 40.93%

a notably lower accuracy for uncovered entity mentions AccuracyNIL than the re-stricted version cWCC with 69.20%.

Comparing the full version WCC and the linear DsKL,l, we find that WCC achieves a significantly (p < 0.05) higher Pmicro of 92.10 % and Rmacro of 91.46% compared to the respective values of Pmicro of 91% and Rmacro of 89.37% for DsKL,l. However the difference in Fmacro among these methods is not significant. The Rmacro of DsKL,q is with 91.33% higher than that of DsKL,l and we find that then there is no more significant difference between DsKL,q and WCC. The same is true comparing the accuracy for covered entity mentions AccuracyWc for the two methods. DsKL,q achieves an AccuracyWc of 92.14% and WCC achieves an AccuracyWc of 91.57, but the difference in AccuracyWc among WCC and DsKL,q is not significant.

To summarize, our proposed method using thematic context distance over mention and entity contexts performs significantly (p < 0.05) better than the competitor method proposed in Bunescu and Pasca [2006] in most measures. We obtain with

82

3.6 Thematic Context Distance

0 100 200 300 400 500

DsKL,l DsKL,q WCC cWCC

method

CPU seconds

method DsKL,l DsKL,q WCC cWCC

Learning time in CPU seconds over cross−validation folds

Figure 3.11: SVM learning time per method in CPU seconds, aggregated over cross-validation folds on WikiPersonsE.

79.30% resp. 78% a significantly (p < 0.01) higher AccuracyNIL for uncovered entity mentions, even though we did not learn an adapted threshold and used the empirically determined τ = 0. The low Rmicro for WCC and cWCC results from the low AccuracyNIL for uncovered entity mentions that make up about 25% of all examples. WCC and cWCC both perform well for covered entity mentions.

Therefore Rmacro is notably higher as uncovered entity mentions are summarized by a NIL class which is again outweighed by the comparably high number of other entities. Since both DsKL,land DsKL,qobtain a high AccuracyNILfor uncovered entity mentions, Rmicro and Rmacro are consequently very close.

Tab. 3.2 also shows the results when we use a Ranking SVM instead of a standard SVM for our method DsKL,l. We see that using a Ranking SVM instead of a standard SVM results in notably lower performance. For the Ranking SVM, we used the same feature set as for the standard SVM but enabled threshold learning through NIL candidates in the same way as for cWCC and WCC. As a result, the AccuracyNILfor uncovered entity mentions is notably higher. However, since this is the only measure for which we find an improvement, we argue that this learner is inferior to the standard SVM with this feature setting. For future work, it would be interesting to evaluate the Ranking SVM and standard SVM in a joint model, where the threshold is learned by the Ranking SVM but classification is performed with the standard SVM.

We also evaluated the average learning time for the methods cWCC, WCC, DsKL,l, and DsKL,q. For this, we record the SVM’s computation time per cross-validation fold for each method on WikiPersonsE and depict the results in Fig. 3.11. This figure shows the SVM learning time for each method in CPU seconds, aggregated over the cross-validation folds. We see that the learning time is with an average of about 435 CPU seconds the longest for DsKL,q, i.e. the method using a quadratic kernel.

The shortest learning time is with an average of 38.71 CPU seconds observed for

Chapter 3 Topic Models for Person Linking

DsKL,l, i.e. the method using a linear kernel. This is also about three times faster than the average learning time for cWCC and about seven times faster than the average learning time for WCC. Given that the full variant WCC has a far higher feature dimensionality than the restricted variant cWCC, WCC has a notably higher complexity and consequently also an increased learning time (about 2.5 times that of cWCC).

It is not surprising that the learning time of the quadratic kernel variant is notably longer than the linear variants as more parameters need to be estimated from the training data and the complexity increases. Given the good performance of the linear variant DsKL,las detailed above and depicted in Tab. 3.2, we would thus recommend this variant for practical applications that have to obey certain time constraints.

Lastly, since we use cosine similarity as a baseline feature for all methods, we also evaluated this feature alone in preliminary experiments on WikiPersonsE. With a linear kernel, the SVM classifier could not determine an optimum value and aborted optimization. In that case, the cosine similarity consequently showed a poor performance of only 18.27% inFmicro. We assume that a linear kernel can not separate the feature vectors described by cosine similarity alone. In contrast, with a quadratic kernel, we obtained anFmicroof 78.24% using only this baseline feature.

Evaluation for General Entity Mentions

Unfortunately, there was a mistake in the experiment reported for the dataset WikiMiscE in Pilz and Paaß [2011]. We wrongly set a parameter of SVMLight and, instead of a Ranking SVM, we used a standard SVM as classifier for Bunescu and Pasca’s method. At the time of publication we were not aware of this and reported the obtained results to the best of our knowledge. When re-running ex-periments, this error became obvious and we accordingly report the correct results here in Tab. 3.3.

The high ambiguity and the more diverse entity types in WikiMiscE render this dataset more demanding for all methods. The high number of candidates results also in a high number of negative examples, which was approached through au-tomatic cost ratio adaption for all methods. Nevertheless, we find notably lower performance on this dataset for all methods. While we findPmicro to be comparable for our methods DsKL,land DsKL,q, all other measures drop by about 3 to 6 points in percentage. However, the decline in performance is with about 20 points in percent-age far stronger for cWCC and WCC. In contrast to the dataset WikiPersonsE, we also find that the restricted version cWCC is significantly (p < 0.05) superior to the full version WCC. The high value of WCC for AccuracyNIL in Tab. 3.3 must be interpreted taking into account the accuracy for covered entity mentions in order to avoid misleading interpretations. Basically, the method predicted NIL in most cases and therefore the accuracy for uncovered entity mentions AccuracyNIL is high, whereas the accuracy for covered entity mentions AccuracyWc is rather low. Also,

84

3.6 Thematic Context Distance

Table 3.3: Results on WikiMiscE (all values in %). The best result for each measure is in bold and marked with an asterisk if the difference towards the 2nd best method is significant (p < 0.05). Our methods DsKL,land DsKL,qare significantly (p < 0.05) superior to cWCC and WCC for all measures apart from AccuracyNIL.

Bunescu and Pasca Thematic Context Distance

measure cWCC WCC DsKL,l DsKL,q

Fmicro 67.02 64.38 86.74 87.12

Pmicro 69.32 66.58 91.91 92.43

Rmicro 64.87 62.33 82.13 82.39

Fmacro 65.44 62.80 86.35 86.83

Pmacro 68.51 66.08 87.06 87.45

Rmacro 64.35 61.50 86.66 87.27

AccuracyWc 63.67 60.82 86.91 87.50 AccuracyNIL 74.17 75.13 41.53 39.04

there is no significant difference in this measure for cWCC and WCC.

Even though our proposed methods show a decline in performance on the dataset WikiMiscE, we see that they are more favourable for entity linking than the com-petitor methods. Apart from the accuracy for uncovered entity mentions, neither measure drops below 86%, a figure that can be satisfactory in most use cases and applications. However, on this dataset, the threshold τ was not appropriate since the accuracy for uncovered entity mentions dropped significantly for our methods.

Again, we assume that the earlier proposed combination with the Ranking SVM to learn a threshold may promise more satisfying results.

We conclude that the proposed thematic context distance is a very good method for the disambiguation of name phrases but more suitable for the disambiguation of person names. Due to the often biographic nature of person descriptions, the thematic overlap with their reference contexts tends to be higher than for other entity types that may be mentioned off-topic (e.g. locations as geographic anchors of events in news documents).

As a side effect, our mistake using the ’wrong’ learner allows for an interesting observation, namely that WCC is very sensitive regarding the machine learning method. Using a standard SVM results in an average performance of about 16%, whereas a Ranking SVM as learner results in notably higher values of more than 60%. In contrast, our method dropped only by about 10 points in percentage on WikiPersonsE when substituting the standard SVM with a Ranking SVM.

Chapter 3 Topic Models for Person Linking

Table 3.4: WTC on WikiPersonsE and WikiMiscE using a standard SVM and a Ranking SVM as learner (all values in %). Values significantly (p < 0.05) higher than those obtained with DsKL,q in a standard SVM are marked in bold.

Word-topic correlation (WTC) WikiPersonsE WikiMiscE

measure SVM Ranking SVM SVM Ranking SVM

Fmicro 87.48 91.88 74.92 80.30

Pmicro 88.49 93.40 76.38 82.47

Rmicro 86.49 90.42 73.51 78.24

Fmacro 86.03 91.08 73.41 79.00

Pmacro 87.89 91.85 75.73 80.63

Rmacro 85.72 91.55 72.70 78.82

AccuracyWc 86.16 91.88 72.73 78.82

AccuracyNIL 87.99 83.94 80.07 73.26

Comparison with Word-Topic-Correlation

To show that thematic context distance is superior to the word-topic correlation approach WTC proposed in Section 3.5, we evaluated the latter also on the datasets WikiPersonsE and WikiMiscE. The results are given in Tab. 3.4.

Using a standard SVM classifier, the results obtained with WTC on WikiPer-sonsE are on average about 3 points in percentage (pp) lower than those obtained using DsKL,q (cf. Tab. 3.2). More specifically, WTC achieved an Fmicro of 87.48%

and an Fmacro of 86.03% on WikiPersonsE compared to the Fmicro of 90.65% and Fmacro of 90.93% for DsKL,q, the latter also in a standard SVM.

Replacing the learner with a Ranking SVM however could increase the perfor-mance to an Fmicro of 91.88% and an Fmacro of 91.08%. As we see in Tab. 3.4, the performance is then also superior to DsKL,q, however significantly (p < 0.05) only in micro performance values. OnWikiPersonsE, WTC is significantly superior to DsKL,qin AccuracyNILfor both learners. Considering other performance measures on WikiPersonsE, WTC is significantly superior to DsKL,q only in micro performance and then only with the Ranking SVM as learner. Comparing WTC to WCC and cWCC we find comparable performance on WikiPersonsE, again only the variant using the Ranking SVM stands out.

On WikiMiscE, we find again notably lower results with anFmicroof 74.92% and an Fmacro of 73.41% using the standard SVM as learner, and an Fmicro of 80.30%

resp. Fmacro of 79% using the Ranking SVM as learner. Comparing to the Fmicro of 87.12% and the Fmacro of 86.83% obtained for DsKL,q with a standard SVM, we find a notable and significant difference of consistently more than 4 pp. In any measure

86

Im Dokument Entity Linking to Wikipedia (Seite 91-105)