Extraction of context features - Building a resource

3.2 Resources

3.2.3 Building a resource

3.2.3.5 Extraction of context features

Additionally, for each location a series of semantic features are extracted, complement-ing the information about the location in natural language. These include the title of the Wikipedia entry of the location¹ and, if there is an apposition in the title, both the stripped title and the apposition (e.g. in ‘Trois-Rivi`eres, Martinique’, both ‘Trois-Rivi`eres’ and ‘Martinique’ should be stored as semantic features). Also the name of the country and the region of the location are included as semantic features. Finally, I also extract context words from parts of the body of the Wikipedia article that have

1The last part of the Wikipedia URL.

3.2 Resources

history content. In order to do so, some text processing is required, as is described below.

Obtaining context words The body of any Wikipedia article is edited in the Wikipedia Markup Syntax, which facilitates its visualization when displaying it in a browser. The markup contains some information that is relevant in terms of content (such as section splitting, lists, and internal links) but also some metadata that are irrelevant for the research here conducted (such as pronunciation aids, images, or refer-ences and citing sources). Figure 3.7 shows an example of wikitext from the Wikipedia article of G¨ottingen.

Figure 3.7: Fragment of the source text of the Wikipedia entry for G¨ottingen.

In order to remove unnecessary metadata, some preprocessing is wanted. Some steps to clean the wikitext from irrelevant information are common in all languages, such as removal of different xml tags and other markup elements. In order to reduce noise, I remove birth and death dates of some people with a very naive strategy. They are usually introduced the first time a person is mentioned and, assuming the person has an entry in Wikipedia, they are usually expressed in this form: [[person name]]

(birthyear-deathyear), as in[[Albert Einstein]] (1879-1955).

In any Wikipedia entry, the body of an article ideally consists of two parts: the lead section and the content. The lead section consists of an introduction to the entity in question and comes before the table of content, if existing. If this is the case, the content of the entity follows the table of contents and is distributed into different sections. The context information provided by some sections is more useful than that provided by others. The titles of all sections and subsections are marked by two or more surrounding equal signs (‘=’), as in ==History== or ===Modern history===. The names of the sections are not standardized and are therefore obviously different from one language to the other. I removed several full sections that I considered irrelevant for the task since they rarely contain context words of interest for the disambiguation strategy and could possibly introduce some noise. They are the following, in each language:

• English: Notes, See also, References, External links, Citations, Further read-ing, Bibliography, Sources, Footnotes, Gallery, Twin cities, Other sources, Web resources, Etymology, Twin towns, Sister cities.

• German: Literatur, Quellen, Einzelnachweise, Siehe auch, Bilder, Partnerschaft, Fußnoten, Namensherkunft, Etymologie, St¨adtepartnerschaften, Galerie, Weblinks, Allgemeine Quellen, Literatur und Quellen, Name, Filmografie, Weiterf¨uhrende Informationen, Anmerkungen, Bildergalerie, Namensgebung.

• Dutch: Stedenbanden, Zie ook, Externe link, Noot, Referenties, Citaten, Lit-eratuurlijst, Bronnen, Afbeeldingen, Naam, Foto’s, Noten, Literatuur, Galerij, Naamsgeschiedenis, Bronvermelding en referenties, De naam, Etymologie.

My aim is to extract historical contexts for the locations, specifically from late modern and contemporary history, spanning the 19th and 20th centuries. In extracting words related to these periods from the Wikipedia entry corresponding to a location, two scenarios are possible:

• The article does not have a ‘History’ section: If the article does not have an explicit ‘History’ section, all sentences from the remaining sections that contain a number that is between 1800 and 1999 are set aside. For each sentence, each mention of a number between 1800 and 1999 is converted into a decade (e.g. if there is the mention ‘1834’ in a sentence, this number is converted into 1830),

3.2 Resources

and the sentence is stored as part of this decade. If more than one decade is found in a sentence, this sentence is assigned to each decade that is present in the sentence. If there is a time period in a sentence, the sentence is stored for each decade the period spans. For example, the sentence ‘The beginning of the Napoleonic Wars (1804–15) is often chosen as a convenient point in time with which to date the end of the Enlightenment’ would be assigned to the decades 1800 and 1810, as would if the period was expressed as ‘between 1804 and 1815’

and in similar choices of word rephrasings.

• The article has a ‘History’ section: I approach articles that have an explicit

‘History’ section (‘Geschichte’ in German, ‘Geschiedenis’ in Dutch) in a different manner. In this case, only this section and its subsections are considered, whereas the rest of the text of the article is disregarded. Every sentence is processed one after the other. The first time a number between 1800 and 1999 appears in a paragraph, the decade to which it belongs is stored. If there is no date specified in the following sentences, the stored decade is assigned to them, and so forth until the next sentence in which a year is specified. This decision rests on the ground that the history section often follows a chronological order and coherence.

For each Wikipedia page referring to a location (i.e. for each Wikipedia page for which coordinates could be extracted), a set of tuples decade–sentence are obtained. For each decade, sentences are tokenized and function and frequent words and punctuation signs are removed. The rest of the words are considered context words.¹ Below are some examples of words extracted for some decades for two locations, G¨ottingen and Cura¸cao, according to GeoSemKB_en:

G¨ottingen

1800: westphalia, electorate, prussia, brunswick, hanover, l¨uneburg, napoleon, etc.

1860: province, prussia, voted, cause, war, kingdom, declaring, neutral, hannover, etc.

1930: war, housed, prevented, albert, kristallnacht, process, etc.

1940: heavily, caused, comparatively, kassel, bombing, impact, timbered, etc.

1Free links can be customized in Wikipedia articles, the text displayed on the browser might be different than the title of the article that the link points to. For example, the string [[Intercity-Express|ICE]]produces the text ‘ICE’ but links to the Intercity-Express page. In these cases, both elements before and after the vertical bar are taken as context words.

1970: dissolved, district, enlarged, orm, m¨unden, hannoversch, incorporating, etc.

Cura¸cao

1810: napoleonic, island, dutch, colony, wars, dependencies, incorporated, stable, etc.

1860: cane, owner, exchange, caribbean, harvest, sugar, plantation, master, slave, etc.

1910: government, maracaibo, traffic, panama, increase, attracted, future, dutch, etc.

1960: discontent, process, protest, uprising, islanders, rioting, antagonisms, social, etc.

1980: downturn, venezuela, independence, stagnation, atmospheric, tourism, creole, etc.

Besides, for each entity also the context words from the lead section (i.e. the in-troduction of the article) are stored as context features, limited to three paragraphs when it is longer than this. Some of the ‘intro’ context words for G¨ottingen are ‘town’,

‘lower’, ‘saxony’, ‘university’, ‘capital’, ‘leine’, and ‘germany’; and for Cura¸cao are

‘uninhabited’, ‘papiamentu’, ‘dissolution’, ‘sea’, ‘antilles’, and ‘caribbean’.

Im Dokument Entity-Centric Text Mining for Historical Documents (Seite 70-74)