Toponym resolution - Outline of the thesis

1.5 Outline of the thesis

2.1.1 Toponym resolution

Buscaldi (2011) (24) groups toponym resolution approaches into three categories: map-based methods, knowledge-map-based methods, and data-driven or supervised methods.

The disambiguation strategy of map-based methods relies mostly on geometric infor-mation such as the latitude and longitude of the possible candidates and the geographic distance between them. Knowledge-based methods exploit external knowledge for prop-erties of the geographic entities. Finally, data-driven or supervised methods are based on machine learning techniques and often require hand-annotated training data.

Map-based methods. One method whose strategy relies heavily on the use of ge-ometric features is Smith and Crane (2001) (89). They use rule-based strategies to identify toponyms in a text and then compute a geographic centroid from all the possible interpretations for each toponym, weighted by the number of occurrences.

Candidate locations that are more than twice the standard deviation away from the centroid are discarded, and the final disambiguation is based on the distance between each location and the calculated centroid and the rest of toponyms in the text (of which only those unambiguous or already disambiguated are taken into account). A similar method based on Smith and Crane (2001) (89) is Buscaldi and Magnini (2010) (25),

2.1 Toponym disambiguation

who additionally consider information about the geographical source of the text. The authors found out that most of the toponyms in a local publication (76.2% according to Buscaldi (2011) (24)) are located within 400 kilometers from the place where the newspaper is published. The authors concluded that ambiguous toponyms are spatially autocorrelated.

Knowledge-based methods. Knowledge-based methods are the most common in the literature. Rauch et al. (2003) (82) use a commercial system, MetaCarta, to build a disambiguation method based on the prominence of locations. Disambiguation in their approach is a training process based on how often a toponym refers to a location, and this starting point is only overruled if strong evidence exists, such as a significant population difference between the candidates. Amitay et al. (2004) developed Web-a-where, a rule-based system that geocodes the content of web pages based on the position and cooccurrence of a location in a taxonomy (e.g.Paris/France/Europe). The system deals only with countries (and some states and provinces) and cities of more than 5,000 inhabitants. Other methods that rely on the hierarchical structure of locations are Bensalem and Kholladi (2010) (17), who exploit the proximity between toponyms in a hierarchical tree structure of locations, Buscaldi and Rosso (2008) (26), who make use of the conceptual density of the hierarchical paths of the toponyms’ candidates from WordNet, and Volz et al. (2007) (96), who build an ontology based on the GNIS and GNS gazetteers and enriched with WordNet relations, and rank the candidates based on weights that are attached to the different classes of the ontology.

The toponym resolution strategy of Lieberman et al. (2010) (60) involves construct-ing local lexicons of toponyms that are likely to be known to the reader of a certain location. Local newspapers aim at a certain reduced audience for which certain knowl-edge is presupposed. The authors propose to automatically build lexicons of the loca-tions that might be known to the readers of a certain audience. The authors exemplify it with the following example: Columbus, the capital of Ohio, has in its vicinity places named ‘Dublin’, ‘Africa’, ‘Alexandria’, ‘Bremen’, ‘London’, ‘Washington’, etc. For a reader from Columbus, the first referent of these toponyms might not be their most prominent sense but the places neighboring their city. The authors automatically cre-ate lexicons of places with their most probable referent given a certain audience which must comply with the following three characteristics: the local lexicon must be constant

across articles from the news source, the toponyms in it must be close to each other, and the lexicon must contain a sensible number of toponyms, not too few and not too many. The authors also compute a global lexicon of locations that a general audience is likely to know. The toponym resolution is then achieved by combining a number of heuristic rules.

Data-driven methods. Data-driven methods enjoyed little popularity at the begin-ning due to the lack of annotated data and to the problem of unseen classes. Nev-ertheless, recent years have witnessed an increase in these kinds of methods. Most approaches that have looked at the document context in order to find disambiguation cues have restricted the textual context of the document to the set of nearby toponyms.

It is the case of Overell (2009) (72), who extracts training instances from Wikipedia, using only toponyms as features. In recent years, some methods have appeared that also use non-geographic context words for disambiguation. Roberts et al. (2010) (84) propose a probabilistic model that uses the spatial relationships between locations (col-lected from the GeoNames gazetteer) and the people and organizations related to these locations (extracted from Wikipedia). The presence of non-geographic entities adds disambiguation power to the task, the authors argue. Qin et al. (2010) (78) represent the different location candidates in a hierarchy tree mined from the Web and base their disambiguation strategy on a score propagation algorithm.

Speriosu and Baldridge (2013) (91) propose a supervised learning method by means of indirect supervision in which non-toponym words are also used for textual context.

The authors see the task as a traditional classification task and train models from geotagged Wikipedia articles. A text classifier is learned for each toponym and is used to disambiguate toponyms both in current news articles and historical texts from the American Civil War, in both cases with good results. More recently, DeLozier et al. (2015) (35) developed a method that does not rely on knowledge from gazetteers for the disambiguation of toponyms. Instead, they model the geographic distributions of words over the surface of the Earth: for each word, a geographic profile is com-puted based on local spatial statistics over a set of language models annotated with coordinates (the authors use GeoWiki, a subset of articles from Wikipedia that contain latitude and longitude). The disambiguator returns the most likely set of coordinates

2.1 Toponym disambiguation

and the closest referent in a gazetteer. This method significantly outperforms other state-of-the-art toponym resolvers.

Im Dokument Entity-Centric Text Mining for Historical Documents (Seite 34-37)