• Keine Ergebnisse gefunden

Person name disambiguation

1.1 Background

1.1.2 Person name disambiguation

Resolving and disambiguating person names across documents is an open problem in natural language processing, its difficulty stemming from the high ambiguity which is often associated with person names.1 With the following examples,2 I illustrate the importance and difficulty of the task:

(3) UAW presidentStephen Yokichthen met separately for at least an hour with chief executives Robert Eaton of Chrysler Corp., Alex Trotman of Ford Motor Co. and finally withJohn Smith Jr. of General Motors Corp.

(4) Blair became Labour leader after the sudden death of his successor John Smith in 1994 and since then has steadily purged the party of its high-spend

1According to the U.S. Census Bureau, only 90,000 different names are shared by up to 100 million people, as stated in Artiles et al. (2009) (12).

2Source: John Smith Corpus, introduced in Bagga and Baldwin (1998) (14).

1.1 Background

and high-tax policies and its commitment to national ownership of industrial assets.

(5) Two years ago, Powell switched coaches from Randy Huntington toJohn Smith, who is renowned for his work with sprinters from 100 to 400 meters.

In the examples, person names have been marked in bold. There is one name, ‘John Smith’, which occurs in the three texts but which corresponds to a different real-world person in each of them: the CEO of General Motors, the Labour Party leader, and an athletics coach. The goal of person name disambiguation is to find the correct referent for each person name, i.e. the person that is actually meant by the writer of the text.

In the literature, there exist two main strategies to approach the task: as an entity linking task or as a cross-document coreference resolution task.

Entity linking resolves the different person name mentions by linking them to their respective entries in a knowledge base. Different kinds of knowledge bases have been used in the past, from encyclopedias like Wikipedia to specific databases cre-ated for a given collection. For instance, if the knowledge base is Wikipedia, the

‘John Smith’ mention of example 3 would be resolved tohttp://en.wikipedia.org/

wiki/John_F._Smith_Jr., the mention of example 4 to http://en.wikipedia.org/

wiki/John_Smith_(Labour_Party_leader), and the mention of example 5 tohttp://

en.wikipedia.org/wiki/John_Smith_(sprinter). Besides, the rest of person name mentions would also be linked to their corresponding entries in Wikipedia, if existing.

Cross-document coreference resolutiontakes a very different approach. Given a query consisting of a person name and given a collection of documents in which this name occurs (e.g. ‘John Smith’ in the three examples), the task of person name disambiguation is to group these documents according to the different real-world entities (i.e. persons) behind the identical person name. Given the collection where examples 3, 4, and 5 are taken from, a cross-document coreference resolution system would return one cluster for each different entity that answers to the name ‘John Smith’ in the collection, where each cluster would contain all the documents in which this particular person is being referred to.

Both approaches have clear advantages and disadvantages. Entity linking provides a faster and clearer retrieval of person entities, which are moreover linked to a knowledge base which straightforwardly informs about the entity in particular. However, it is a

classification task in which the potential target entities are limited to the ones present in the knowledge base.1 This is a major problem when working with historical news articles that very often come from very regional collections, and whose mentioned people might not be recorded in most knowledge bases. For this reason mainly, the approach chosen in this thesis is to treat the problem as a cross-document coreference resolution task.

1.1.2.1 Terminology

Before proceeding any further, for clarity I explain the different terms associated to the person name disambiguation task that are used in this thesis, distinguishing between concepts and tasks.

Concepts. I describe here the concepts used throughout this chapter, illustrated with an example:

(6) The character ofJohn Smithexpresses some of the confusion inAlexie’s own upbringing. He was raised in Wellpinit, the only town on the Spokane Indian Reservation.

Aperson nameis any named entity expression in a text referring to a person. The person names in example 6 have been marked in bold. An entity (or person) is the real-world referent that is referred to by a person name. In the example, ‘John Smith’

and ‘Alexie’ are person names, and the real persons behind these names are entities.

Thequery nameis the target person name to disambiguate, in this case ‘John Smith’, which is mentioned at least once per document. I proceed on the largely held ‘one sense per discourse’ assumption, according to which all occurrences of the same person name within a document are considered to always refer to the same entity. A mention name is any person name that is mentioned in a document and that is not the query name (i.e. ‘Alexie’ in the example). I call afull name any person name with at least two tokens (ideally a first and last name, even though this is not necessarily always the case), whereas a namepart is each of the tokens that form a full name. In the

1Most recent approaches allow marking entities also as NIL if they are not present in the knowledge base. This in practice often means that all entities that are not present (or found) in the knowledge base are classified together, regardless of how different they are.