Evaluation - Thematic Context Distance - Entity Linking to Wikipedia

3.6 Thematic Context Distance

3.6.4 Evaluation

3.6 Thematic Context Distance

Particularly, for WikiPersons_G, we alter the category condition in Eq. 3.52 to W_c={e∈W |(c="Frau"∈c(e))∨(c="Mann"∈c(e))} (3.54) and arrive at candidate pool Wc containing 18024 distinct, randomly selected en-tities fulfilling this condition. WikiPersons_G then contains 44338 example docu-ments, with 35367 contexts of linkable mentions, 8971 contexts of uncovered men-tions and an average ambiguity of 2.91.

ForWikiPersons_F we alter the category condition in Eq. 3.52 to a partial match on category tags

W_c={e ∈W | ∃c∈c(e) :hasSubstring(c,"Naissance")}. (3.55) This means that it is sufficient that the word"Naissance" is contained as a substring in any of the category tags. ForWikiPersons_F we then have a candidate poolW_c of 7201 different entities and a reference dataset of 15159 example documents, with 12284 contexts of linkable mentions, 2875 contexts of uncovered mentions and an average ambiguity of 1.88. Again, for both datasets, the average ambiguity does not include NIL as a candidate.

Analogously to the observations described in Section 3.5.4, we also find prob-lematic links in these datasets. Some links are rather conceptional and point to a thematically related article, which does not imply identity. For example, the term client can be linked to the article Lawyer.

Chapter 3 Topic Models for Person Linking

●

88.79

88.33 88.36

89.11 89.34 88.44

90.30 90.65 90.65

89.38 88.48

89.37 89.81

linear quadratic gaussian

88 89 90 91

DH DJS DKL DsKL DH DJS DKL DsKL DH DJS DKL DsKL

distance Fmicro

distance DH

DJS

DKL

DsKL

Figure 3.8: F_micro performance for entity linking on WikiPersons_E (all values in %). Here, uncovered entity mentions are simulated at random. We compare thematic distance representations in combination with different kernel types in 5-fold cross-validations. The mean of each sample is given above the boxes, the best performance (in bold) is obtained forD_sKL with a quadratic kernel.

Thematic Distances

To find the best distance representation, we evaluate the distances described in Section 3.6.1 with different kernels using SVM^Light standard parameters in five fold cross-validations on the dataset WikiPersons_E. For a better comparison, we use here the results obtained with standard instance based splitting for cross-validation instead of the figures given in Pilz and Paaß [2011] that were obtained using the entity based splitting for cross-validation.

Fig. 3.8 visualizes the results obtained for each distance in a linear, a quadratic and a Gaussian kernel. On the first glance, there are no striking differences among the different combinations. To statistically compare the different representations, we therefore employ paired t-tests on the F_micro results over the cross-validation folds.

This showed that the best result is obtained with the symmetric Kullback-Leibler distanceD_sKL in a quadratic kernel.

With a linear kernel, the Hellinger distance D_H and the symmetric Kullback-Leibler distance D_sKL perform best. Using the more complex quadratic kernel in-creases performance for all distances, most notably for D_sKL. The Gaussian kernel with standard parameters however is not superior. Also, we find the symmetric Kullback-Leibler distance representation D_sKL superior to the asymmetric variant D_KL (p < 0.02) in all kernels. The same is true for the comparison ofD_sKL with the Jensen-Shannon distance D_JS (p < 0.01), the latter giving the lowest performance in all cases. Comparing D_sKL with the Hellinger distance D_H, we find significantly better results for D_sKL (p < 0.03) in the linear and quadratic kernel, while there is no significant difference when using a Gaussian kernel. The asymmetric D_KL is inferior to D_Honly for the linear kernel. From this evaluation we conclude that the

3.6 Thematic Context Distance

●

90.53 90.19

90.55 90.98

●

90.99 90.09

91.95 92.1692.16

●

90.67

90.06 90.18 90.59

linear quadratic gaussian

89 90 91 92 93

DH DJS DKL DsKL DH DJS DKL DsKL DH DJS DKL DsKL

distance Fmicro

distance DH

DJS

DKL

DsKL

Figure 3.9: F_micro performance for entity linking on WikiPersons_E (all values in

%). Here, uncovered entity mentions are simulated taking into account article text length. We compare thematic distance representations in combination with different kernel types in 5-fold cross-validations. The mean of each sample is given above the boxes, the best performance (in bold) is obtained for D_sKL with a quadratic kernel.

symmetric Kullback-Leibler distance D_sKL is most suitable for entity linking based on thematic distance. Since the Gaussian kernel does not provide superior results, we decide to not evaluate it further but present results in later experiments for the simpler representations in linear and quadratic kernels.

In this evaluation, we simulated uncovered entities for WikiPersons_E by re-moving every 5th entity from the candidate pool. Alternatively, we can simulate uncovered entity mentions taking into account article text length, for example by removing entities with short article texts where only few contextual information is available. In an additional experiment, we therefore simulate uncovered entity men-tions by removing all the ground truth entities with an article text of less than 50 words. Following the described extraction strategy, we obtain less examples for un-covered entity mentions, 1674 instead of 3068, since short articles naturally tend to have fewer inlinks. The results obtained on this dataset are depicted in Fig. 3.9. The performance shows similar behaviour for both simulation strategies, even though the results are with one to two points in percentage slightly superior to those obtained with random simulation of uncovered entities. There are two possible reasons for this. First, when removing entities by article length, we can expect to find more well described candidates in the candidate pool and thus topic distributions are also more stable over their respective article texts. Since furthermore the number of mentions for uncovered entities is also lower, this dataset can be considered less difficult.

Since simulating uncovered entity mentions via article length puts a bias towards popular, well described entities and furthermore renders the linking problem artifi-cially slightly easier, we decide on the random strategy for further experiments.

Chapter 3 Topic Models for Person Linking

Context Properties

Bagga and Baldwin [1998], Gooi and Allan [2004] and Bunescu and Pasca [2006]

observed that the words in a mention’s close neighbourhood often contain most of the information necessary for its disambiguation. These authors therefore focus on localized context windows, usually with a width of 25 to 50 words centred around the mention, i.e. text(m)−25,25 or text(m)−50,50. In contrast, due to the sampling over words, topic distributions tend to be more representative when more context is available. Therefore, we evaluated different context widths for our method in initial experiments. We found that reducing the available context to local windows around the mention yields a slight decrease in predictive performance. This goes along with our assumption that we obtain a higher stability in the topic probability distribution when a larger context is used. Consequently, we use the full text of the link source l_s as mention context text(m) to infer the topic distribution T_m.

However, to put more emphasis on the local context, we propose a local boosting.

Local boosting uses a context window around the mention and adds the terms from this window repeatedly to the overall document words. We found that boosting the ten word context window around the mention yields the best result. The terms from this window are then added five times to the words of the example document, i.e.

text(m) =text(l_s)∪text(m)_−10,10. . .∪text(m)_−10,10

| {z }

×5.

We found that boosting the local context in this manner increases performance sig-nificantly (p < 0.05) in comparison with the standard, non-boosted version. Hence, we use the local boosting on mention contexts for all experiments following in this chapter. Note that this does not affect the entity distributions T_e, for those we use the full contexttext(e) without boosting.

Having described the parameters for distances and contexts, we now evaluate thematic context distance for entity linking against the word-category correlation method proposed by Bunescu and Pasca [2006] (WCC was described in Section 3.5) on the English datasets WikiPersons_E and WikiMisc_E. In Pilz and Paaß [2009]

and Pilz and Paaß [2011], we compared against the version cWCC using only com-mon words with the feature representation as in Eq. 3.6. For a more thorough comparison, we here additionally provide the results obtained for the original for-mulation as in Eq. 3.5. When referring to this method, we use WCC to denote the full and cWCC to denote the version restricted to common words. For the implemen-tation of cWCC and WCC, we extract 5825 categories analogously to Section 3.5.4 from the English Wikipedia and furthermore always add the required candidateNIL that has no attributes apart from the NIL-feature as in Eq. 3.19.

We use D_sKLto denote our proposed method which exploits thematic context dis-tance through the symmetric Kullback-Leibler disdis-tance over the topic distributions T_m and T_e. Since we also evaluate different kernels, we index this notation with

3.6 Thematic Context Distance

Fmicro Pmicro Rmicro

●

0.85 0.90

DsKL,lDsKL,qWCC cWCC DsKL,lDsKL,qWCC cWCC DsKL,lDsKL,qWCC cWCC method

performance

method DsKL,l DsKL,q WCC cWCC

Micro Performance

Fmacro Pmacro Rmacro

●

0.86 0.88 0.90 0.92

DsKL,lDsKL,qWCC cWCC DsKL,lDsKL,qWCC cWCC DsKL,lDsKL,qWCC cWCC method

performance

method DsKL,l DsKL,q WCC cWCC

Macro Performance

Figure 3.10: Micro and macro performance on WikiPersons_E for the methods D_sKL,l, D_sKL,q and the competitor methods cWCC and WCC (all values in %).

D_sKL,l to indicate the usage of a linear kernel and D_sKL,q to indicate the usage of a quadratic kernel. As described previously and depicted in Fig. 3.8, the Gaussian kernel did not perform better, so we omitted experiments using this kernel.

All evaluations are performed in five-fold cross-validations with instance based splitting and compared for significant differences through paired t-tests with p <

0.05. We start with the results obtained on the dataset WikiPersons_E for which we evaluated our approach also in a Ranking SVM instead of a standard classification SVM but obtained remarkably inferior results.

Evaluation for Person Name Mentions

To emphasize the performance for uncovered entity mentions, we also report separate accuracy values for covered and uncovered entity mentions. The accuracy for covered entity mentions AccuracyWc is given by the ratio of mentions that were correctly assigned to an entity in W_c and the overall number of covered entity mentions, i.e.

AccuracyWc = |{ˆe(m) =e⁺(m)∈W_c}|

|{e⁺(m)∈W_c}| . (3.56) Analogously, the accuracy for uncovered entity mentions Accuracy_NIL is given by the ratio of mentions that were correctly assigned to NILand the overall number of uncovered entity mentions, i.e.

Accuracy_NIL = |{ˆe(m) = e⁺(m) =NIL}|

|{e⁺(m) = NIL}| . (3.57) Fig. 3.10 visualizes the results obtained on the dataset WikiPersons_E, the ex-plicit figures are given in Tab. 3.2. In all cases, our methods using thematic context

Chapter 3 Topic Models for Person Linking

Table 3.2: Results on WikiPersonsE (all values in %). The best result for each measure is in bold and marked with an asterisk if the difference towards the 2nd best method is significant (p < 0.05). As our methods D_sKL,l and D_sKL,q overall perform significantly better than cWCC (p < 0.05), we indicate differences only towards WCC for the sake of readability. In terms of Accuracy_NIL, the overall best result is obtained with D_sKL,l in a Ranking SVM (significant superiority to D_sKL,l in a standard SVM is indicated by †).

Bunescu and Pasca Thematic Context Distance

SVM Ranking SVM

measure cWCC WCC D_sKL,l D_sKL,q D_sKL,l

F_micro 87.17 86.90 89.11 90.65^∗ 83.19

P_micro 90.37 92.10 91.00 92.59 84.48

R_micro 84.20 82.25 87.30 88.79^∗ 81.95

F_macro 86.85 89.50 89.25 90.93^∗ 82.57

P_macro 88.13 89.70 90.75 91.99^∗ 85.60

R_macro 87.55 91.46 89.37 91.33 81.72

Accuracy_W_c 87.58 91.57 89.11 92.14 81.13 Accuracy_NIL 69.20 40.93 79.30^∗ 78.00 85.08^†

in a symmetric Kullback-Leibler distance in a linear (D_sKL,l) or a quadratic kernel (D_sKL,q) perform significantly better than cWCC (p < 0.05). Comparing the linear and quadratic variant, we find that the quadratic variant is significantly (p <0.05) superior to the linear variant in most cases. Only regarding the accuracy of un-covered entity mentions Accuracy_NIL, the linear variant D_sKL,l is superior to the quadratic variant D_sKL,q. Interestingly, the full version WCC obtains with 40.93%

a notably lower accuracy for uncovered entity mentions Accuracy_NIL than the re-stricted version cWCC with 69.20%.

Comparing the full version WCC and the linear D_sKL,l, we find that WCC achieves a significantly (p < 0.05) higher P_micro of 92.10 % and R_macro of 91.46% compared to the respective values of P_micro of 91% and R_macro of 89.37% for D_sKL,l. However the difference in F_macro among these methods is not significant. The R_macro of D_sKL,q is with 91.33% higher than that of D_sKL,l and we find that then there is no more significant difference between D_sKL,q and WCC. The same is true comparing the accuracy for covered entity mentions Accuracy_W_c for the two methods. D_sKL,q achieves an AccuracyWc of 92.14% and WCC achieves an AccuracyWc of 91.57, but the difference in AccuracyWc among WCC and D_sKL,q is not significant.

To summarize, our proposed method using thematic context distance over mention and entity contexts performs significantly (p < 0.05) better than the competitor method proposed in Bunescu and Pasca [2006] in most measures. We obtain with

3.6 Thematic Context Distance

●

0 100 200 300 400 500

DsKL,l DsKL,q WCC cWCC

method

CPU seconds

method DsKL,l DsKL,q WCC cWCC

Learning time in CPU seconds over cross−validation folds

Figure 3.11: SVM learning time per method in CPU seconds, aggregated over cross-validation folds on WikiPersons_E.

79.30% resp. 78% a significantly (p < 0.01) higher Accuracy_NIL for uncovered entity mentions, even though we did not learn an adapted threshold and used the empirically determined τ = 0. The low Rmicro for WCC and cWCC results from the low Accuracy_NIL for uncovered entity mentions that make up about 25% of all examples. WCC and cWCC both perform well for covered entity mentions.

Therefore Rmacro is notably higher as uncovered entity mentions are summarized by a NIL class which is again outweighed by the comparably high number of other entities. Since both D_sKL,land D_sKL,qobtain a high Accuracy_NILfor uncovered entity mentions, Rmicro and Rmacro are consequently very close.

Tab. 3.2 also shows the results when we use a Ranking SVM instead of a standard SVM for our method D_sKL,l. We see that using a Ranking SVM instead of a standard SVM results in notably lower performance. For the Ranking SVM, we used the same feature set as for the standard SVM but enabled threshold learning through NIL candidates in the same way as for cWCC and WCC. As a result, the Accuracy_NILfor uncovered entity mentions is notably higher. However, since this is the only measure for which we find an improvement, we argue that this learner is inferior to the standard SVM with this feature setting. For future work, it would be interesting to evaluate the Ranking SVM and standard SVM in a joint model, where the threshold is learned by the Ranking SVM but classification is performed with the standard SVM.

We also evaluated the average learning time for the methods cWCC, WCC, DsKL,l, and D_sKL,q. For this, we record the SVM’s computation time per cross-validation fold for each method on WikiPersons_E and depict the results in Fig. 3.11. This figure shows the SVM learning time for each method in CPU seconds, aggregated over the cross-validation folds. We see that the learning time is with an average of about 435 CPU seconds the longest for D_sKL,q, i.e. the method using a quadratic kernel.

The shortest learning time is with an average of 38.71 CPU seconds observed for

Chapter 3 Topic Models for Person Linking

D_sKL,l, i.e. the method using a linear kernel. This is also about three times faster than the average learning time for cWCC and about seven times faster than the average learning time for WCC. Given that the full variant WCC has a far higher feature dimensionality than the restricted variant cWCC, WCC has a notably higher complexity and consequently also an increased learning time (about 2.5 times that of cWCC).

It is not surprising that the learning time of the quadratic kernel variant is notably longer than the linear variants as more parameters need to be estimated from the training data and the complexity increases. Given the good performance of the linear variant D_sKL,las detailed above and depicted in Tab. 3.2, we would thus recommend this variant for practical applications that have to obey certain time constraints.

Lastly, since we use cosine similarity as a baseline feature for all methods, we also evaluated this feature alone in preliminary experiments on WikiPersons_E. With a linear kernel, the SVM classifier could not determine an optimum value and aborted optimization. In that case, the cosine similarity consequently showed a poor performance of only 18.27% inF_micro. We assume that a linear kernel can not separate the feature vectors described by cosine similarity alone. In contrast, with a quadratic kernel, we obtained anF_microof 78.24% using only this baseline feature.

Evaluation for General Entity Mentions

Unfortunately, there was a mistake in the experiment reported for the dataset WikiMisc_E in Pilz and Paaß [2011]. We wrongly set a parameter of SVM^Light and, instead of a Ranking SVM, we used a standard SVM as classifier for Bunescu and Pasca’s method. At the time of publication we were not aware of this and reported the obtained results to the best of our knowledge. When re-running ex-periments, this error became obvious and we accordingly report the correct results here in Tab. 3.3.

The high ambiguity and the more diverse entity types in WikiMisc_E render this dataset more demanding for all methods. The high number of candidates results also in a high number of negative examples, which was approached through au-tomatic cost ratio adaption for all methods. Nevertheless, we find notably lower performance on this dataset for all methods. While we findP_micro to be comparable for our methods D_sKL,land D_sKL,q, all other measures drop by about 3 to 6 points in percentage. However, the decline in performance is with about 20 points in percent-age far stronger for cWCC and WCC. In contrast to the dataset WikiPersons_E, we also find that the restricted version cWCC is significantly (p < 0.05) superior to the full version WCC. The high value of WCC for Accuracy_NIL in Tab. 3.3 must be interpreted taking into account the accuracy for covered entity mentions in order to avoid misleading interpretations. Basically, the method predicted NIL in most cases and therefore the accuracy for uncovered entity mentions Accuracy_NIL is high, whereas the accuracy for covered entity mentions AccuracyWc is rather low. Also,

3.6 Thematic Context Distance

Table 3.3: Results on WikiMiscE (all values in %). The best result for each measure is in bold and marked with an asterisk if the difference towards the 2nd best method is significant (p < 0.05). Our methods D_sKL,land D_sKL,qare significantly (p < 0.05) superior to cWCC and WCC for all measures apart from Accuracy_NIL.

Bunescu and Pasca Thematic Context Distance

measure cWCC WCC D_sKL,l D_sKL,q

F_micro 67.02 64.38 86.74 87.12^∗

P_micro 69.32 66.58 91.91 92.43^∗

R_micro 64.87 62.33 82.13 82.39^∗

F_macro 65.44 62.80 86.35 86.83^∗

P_macro 68.51 66.08 87.06 87.45^∗

R_macro 64.35 61.50 86.66 87.27^∗

Accuracy_W_c 63.67 60.82 86.91 87.50^∗ Accuracy_NIL 74.17 75.13 41.53 39.04

there is no significant difference in this measure for cWCC and WCC.

Even though our proposed methods show a decline in performance on the dataset WikiMisc_E, we see that they are more favourable for entity linking than the com-petitor methods. Apart from the accuracy for uncovered entity mentions, neither measure drops below 86%, a figure that can be satisfactory in most use cases and applications. However, on this dataset, the threshold τ was not appropriate since the accuracy for uncovered entity mentions dropped significantly for our methods.

Again, we assume that the earlier proposed combination with the Ranking SVM to learn a threshold may promise more satisfying results.

We conclude that the proposed thematic context distance is a very good method for the disambiguation of name phrases but more suitable for the disambiguation of person names. Due to the often biographic nature of person descriptions, the thematic overlap with their reference contexts tends to be higher than for other entity types that may be mentioned off-topic (e.g. locations as geographic anchors of events in news documents).

As a side effect, our mistake using the ’wrong’ learner allows for an interesting observation, namely that WCC is very sensitive regarding the machine learning method. Using a standard SVM results in an average performance of about 16%, whereas a Ranking SVM as learner results in notably higher values of more than 60%. In contrast, our method dropped only by about 10 points in percentage on WikiPersons_E when substituting the standard SVM with a Ranking SVM.

Chapter 3 Topic Models for Person Linking

Table 3.4: WTC on WikiPersonsE and WikiMiscE using a standard SVM and a Ranking SVM as learner (all values in %). Values significantly (p < 0.05) higher than those obtained with D_sKL,q in a standard SVM are marked in bold.

Word-topic correlation (WTC) WikiPersons_E WikiMisc_E

measure SVM Ranking SVM SVM Ranking SVM

F_micro 87.48 91.88 74.92 80.30

P_micro 88.49 93.40 76.38 82.47

R_micro 86.49 90.42 73.51 78.24

F_macro 86.03 91.08 73.41 79.00

P_macro 87.89 91.85 75.73 80.63

R_macro 85.72 91.55 72.70 78.82

Accuracy_W_c 86.16 91.88 72.73 78.82

Accuracy_NIL 87.99 83.94 80.07 73.26

Comparison with Word-Topic-Correlation

To show that thematic context distance is superior to the word-topic correlation approach WTC proposed in Section 3.5, we evaluated the latter also on the datasets WikiPersons_E and WikiMisc_E. The results are given in Tab. 3.4.

Using a standard SVM classifier, the results obtained with WTC on WikiPer-sons_E are on average about 3 points in percentage (pp) lower than those obtained using D_sKL,q (cf. Tab. 3.2). More specifically, WTC achieved an F_micro of 87.48%

and an Fmacro of 86.03% on WikiPersonsE compared to the Fmicro of 90.65% and F_macro of 90.93% for D_sKL,q, the latter also in a standard SVM.

Replacing the learner with a Ranking SVM however could increase the perfor-mance to an Fmicro of 91.88% and an Fmacro of 91.08%. As we see in Tab. 3.4, the performance is then also superior to D_sKL,q, however significantly (p < 0.05) only in micro performance values. OnWikiPersons_E, WTC is significantly superior to DsKL,qin Accuracy_NILfor both learners. Considering other performance measures on WikiPersons_E, WTC is significantly superior to D_sKL,q only in micro performance and then only with the Ranking SVM as learner. Comparing WTC to WCC and cWCC we find comparable performance on WikiPersonsE, again only the variant using the Ranking SVM stands out.

On WikiMisc_E, we find again notably lower results with anF_microof 74.92% and an Fmacro of 73.41% using the standard SVM as learner, and an Fmicro of 80.30%

resp. F_macro of 79% using the Ranking SVM as learner. Comparing to the F_micro of 87.12% and the F_macro of 86.83% obtained for D_sKL,q with a standard SVM, we find a notable and significant difference of consistently more than 4 pp. In any measure

Im Dokument Entity Linking to Wikipedia (Seite 91-105)