Wikipedia Reference Datasets - Thematic Context Distance

3.6 Thematic Context Distance

3.6.3 Wikipedia Reference Datasets

Our aim is to build an entity linking model focused on persons that is applicable in more than one language. Even though recent work published a series of benchmark datasets, these datasets mostly consists of English documents. To the best of our knowledge, we are not aware of publicly available benchmark datasets with persons linked to Wikipedia for other languages such as German and French. Therefore, we resort to Wikipedia to provide disambiguated examples. While we are aware that Wikipedia documents are different from edited newspaper articles, regarding semantics, structure and topics, we assume that the model evaluated on this dataset can generalize to other corpora.

We aim at retrieving highly ambiguous datasets, i.e. datasets where mentions have many candidates in the candidate pool. To do so, we use two strategies to fill the candidate pools. This results in two different datasets, one consisting of mentions for persons, the other containing mentions of entities of diverse types. For the first strategy, we extract persons with ambiguous names by focusing on name phrases that refer to at least two distinct entities. This strategy enables a fair comparison with Bunescu and Pasca’s method. For the second strategy, we widen the candidate pool by allowing partial matches for the common English surnames Jones, Taylor and Smith and removing the constraint that a candidate must be a person. Doing so, we obtain a broad set of entities that each contain at least one of these highly ambiguous seed names in their name(e)but need not be of type person. Using this strategy, we empirically show that our method generalizes to other concepts apart

Chapter 3 Topic Models for Person Linking

Table 3.1: Wikipedia evaluation datasets for English, German and French (indi-cated by subscript). The table shows for each dataset the number of entities in the candidate pool W_c, the number of extracted contexts d ∈ D, the number of cov-ered (e⁺(m)∈W_c) and uncovered entity mentions (e⁺(m) =NIL) and the average ambiguity per mention given by the average cardinality |e(m)| of candidates sets.

dataset |W_c| |D| e⁺(m)∈W_c e⁺(m) =NIL avg. |e(m)|

WikiPersons_E 6213 16661 13593 3068 2.06 WikiMisc_E 10734 15481 13849 1632 26.76 WikiPersons_G 18024 44338 35367 8971 2.91 WikiPersons_F 7201 15159 12284 2875 1.88

from persons. Furthermore, the latter strategy accounts for cases where entities are referenced merely by the surname, which renders the distinction of candidates even more difficult.

Note that since all versions of Wikipedia are endowed with hyperlink structures, we may employ these strategies not only for the English version¹, but also for the German² and the French version³. From this, we obtain the datasets as summarized in Tab. 3.1. Their generation process will be further detailed next.

Using the first strategy, we start with the extraction of example mentions for per-sons with ambiguous names. We call the resulting corpus WikiPersonsin the fol-lowing and use an index to denote the language version of Wikipedia, i.e. WikiPer-sons_E for the examples from the English version of Wikipedia and WikiPersons_G for the German resp. WikiPersons_F for the French version.

First, we need to identify articles describing persons. For this, we use both the type system of YAGO and Wikipedia categories. YAGO’s type system provides the information whether an article describes a person which we use to determine person articles in our version of Wikipedia. Even though YAGO was built over a different version of Wikipedia, we may use it to determine persons in our version since older articles usually still exist and we may align them with our version via their unique titles. Articles in our version that previously not existed are consequently ignored and not used in the candidate pool for WikiPersons_E.

However, since YAGO is build over the English version of Wikipedia, we can not solely rely on it to detect all persons in the German and French versions via language links. Therefore we use simple heuristics such as the presence of the categories Mann or Frau to detect persons in the German Wikipedia and Naissance for the French version. While there are certainly more precise ways to determine persons

1http://www.en.wikipedia.org, retrieved on January 15, 2011.

2http://www.de.wikipedia.org, retrieved on January 31, 2011

3http://www.fr.wikipedia.org, retrieved on February 1st, 2011.

3.6 Thematic Context Distance

in other Wikipedia language versions, for example by analysing their individual category trees, this may require deeper understanding of these languages and further investigations that are not the focus of this thesis. For future work, we note that the multilingual entity taxonomy created in de Melo and Weikum [2010] may serve as a better alternative. As Tab. 3.1 shows, we could extract comparable number of persons in all versions of Wikipedia using YAGO types and our heuristics. The higher number of examples for the German version results from the comparably high amount of biographic entries in the German version.

From the Wikipedia articles describing persons, we select entities with ambiguous names. A person name is ambiguous in Wikipedia when at least two entities have the same name which is the title without disambiguation term. More specifically, when matching the name name(e) against the name of other persons, we obtain at least one other person:

W_{P er} ={e ∈W |c="person" ∈c(e)}

Wc={e ∈WP er | |e(name(e))⊆WP er| ≥2}, (3.52) where e(name(e)) ⊆ W_{P er} contains all persons in W_{P er} whose name completely matches (case-insensitive equality) the name of e. For example, Jonas Taylor does not match John Taylor but John Taylor (jazz) does. The condition c= "person"∈ c(e) relies on the alignment with YAGO that provides this specific type and is used by us as a category. With a random selection on entities fulfilling these conditions, we arrive at a candidate pool W_c with 6213 different entities for WikiPersons_E.

The dataset WikiMisc_E is created using the second strategy. Here, we use no constraint on the entity type but focus on frequent names and alter the matching technique from exact matches to partial name matches. Given the seed names {jones,taylor, smith}, an entity is added toWc if its name contains at least one of these names as a substring, i.e.:

W_c={e ∈W |hasSubstring(name(e),"smith")

∨ hasSubstring(name(e),"taylor") (3.53)

∨ hasSubstring(name(e),"jones")}

For instance, the entity Bruce Jones would be added to the candidate pool W_c defined above, since hasSubstring(Bruce Jones, "jones") is true. Using again a random selection of the entities fulfilling these conditions, we arrive at a candidate pool W_c with 10734 different entities for WikiMisc_E.

Aiming at high disambiguation performance not only for popular entities with many inlinks but also less popular entities with few inlinks, we set again a bound-ary on the number of examples per entity. This is achieved by using at most ten randomly selected inlinks from the set L_in(e) as examples, i.e. n = 10 in Alg. 1.

Chapter 3 Topic Models for Person Linking

Again, we argue that this restriction allows a more balanced model over all entities in W_c and avoids over-fitting towards high popularity entities. Here, each of the example documentsdcontains the complete article texttext(ls)of the link sourcels. Using the complete article text enables us to experimentally evaluate the influence of context width and we will discuss this in Section 3.6.4. In line with the corpus design described in Section 3.5.4, we treat only the mention of the entity from the candidate poolW_c and do not handle any other potentially appearing mention, i.e.

any other link.

Given a mention m, we select its candidate entities in the same way we generate the candidate pool. In the case of WikiPersons, an entity e is considered as candidate if its name fully matches the surface form of the link anchor text l_a associated with the mention m, i.e. l_a =name(e). In the case of WikiMisc_E, an entity e is considered as a candidate if this surface form is contained as a substring in the candidate’s name, i.e. if hasSubstring(l_a, name(e)) is true. Using a partial match for candidate selection, the surface nameJonesmay then matchAdam Jones or Catherine Zeta-Jones, but also Jones Soda or Jones, Oklahoma. This way we get on average more than 27 candidates per mention and thus a highly ambiguous dataset where references are not restricted to mentions of persons.

Apart from creating a different candidate poolW_cand using a different candidate selection method forWikiMisc_E, we also set a different boundary on the number of examplesn. While example extraction is performed analogously toWikiPersons_E, we use at most n = 5 randomly selected inlinks per entity to arrive at datasets of comparable size. Otherwise, the number of example documents in WikiMisc_E would be much higher, since the cardinality of the candidate pool here is nearly twice that of the candidate pool of WikiPersons_E.

For all datasets, we simulate mentions of uncovered entities by marking every fifth entity in the candidate poolW_cas uncovered, i.e. by settingz = 5in Alg. 1. Tab. 3.1 shows that forWikiPersons_E, we arrive at 16661 example documents where 13593 are contexts of linkable mentions, i.e. e⁺(m) ∈ W_c and 3068 are contexts for mentions with e⁺(m) = NIL. The average ambiguity per mention is 2.06 which does not include the symbolic entity NIL and concerns only candidates e ∈ W_c. ForWikiPersons_E, the entity pool W_c contains 6213 different candidate entities.

After the simulation of uncovered entities, we have a ratio of 1242 uncovered vs.

4971 covered entities. ForWikiMisc_E, the entity poolW_c contains 10734 different candidate entities. After the simulation of uncovered entities, we have a ratio of 2146 uncovered vs. 8588 covered entities.

As the proposed method is in general language independent, we evaluate name disambiguation also on German and French datasets. To do so, we extract example contexts both from the German and the French version of Wikipedia using the same extraction technique as for WikiPersons_E but adapt indicative categories. Then, both datasets contain references for persons with ambiguous names.

3.6 Thematic Context Distance

Particularly, for WikiPersons_G, we alter the category condition in Eq. 3.52 to W_c={e∈W |(c="Frau"∈c(e))∨(c="Mann"∈c(e))} (3.54) and arrive at candidate pool Wc containing 18024 distinct, randomly selected en-tities fulfilling this condition. WikiPersons_G then contains 44338 example docu-ments, with 35367 contexts of linkable mentions, 8971 contexts of uncovered men-tions and an average ambiguity of 2.91.

ForWikiPersons_F we alter the category condition in Eq. 3.52 to a partial match on category tags

W_c={e ∈W | ∃c∈c(e) :hasSubstring(c,"Naissance")}. (3.55) This means that it is sufficient that the word"Naissance" is contained as a substring in any of the category tags. ForWikiPersons_F we then have a candidate poolW_c of 7201 different entities and a reference dataset of 15159 example documents, with 12284 contexts of linkable mentions, 2875 contexts of uncovered mentions and an average ambiguity of 1.88. Again, for both datasets, the average ambiguity does not include NIL as a candidate.

Analogously to the observations described in Section 3.5.4, we also find prob-lematic links in these datasets. Some links are rather conceptional and point to a thematically related article, which does not imply identity. For example, the term client can be linked to the article Lawyer.

Im Dokument Entity Linking to Wikipedia (Seite 87-91)