Similarity of documents represented as social networks

4.3 Disambiguating person names

4.3.2 Similarity of documents represented as social networks

The core idea behind assessing the similarity of documents as social networks is that the social circle of people may be a useful indicator of who they are: it is their social context.

This similarity measure assists in the decision of whether two documents mentioning the same person name should be clustered together and their network representations

should therefore be merged. A very naive strategy could be to join together all the documents and merge their networks if they share at least one person name (apart from the query name, which is common in all the documents). This would mean that nobody knows two different people who share the same name. This is a naive and obviously dangerous strategy to follow, as person names in plain text may be introduced as unspecific as by just the first or the last name (e.g. two networks would also be joined if a very ambiguous name, such as ‘John’, is a node in both of them).

This strategy assumes that a node that is shared by two networks (henceforth overlapping node) corresponds to the same entity. This would mean that the same person name occurring in two different documents corresponds to the same person.

This is of course not necessarily the case. Besides, as already seen, mention names can range from single tokens to multiple tokens, and they can be extremely ambiguous (such as ‘John’) or very unambiguous (such as ‘Edward Cornelis Florentius Alfonsus Schillebeeckx’). The confidence that we are talking about the very same person varies greatly from the first case to the second case. The likelihood that two documents belong to the same cluster given a certain overlap of person names will therefore depend in great measure on the ‘quality’ of these overlaps. An overlapping person name that provides greater evidence that we are dealing with one only entity (i.e. a name that has a low ambiguity) is considered of higher quality than an overlapping name that provides little evidence that it corresponds to one only entity (i.e. a name that has a high ambiguity).

In section 4.2.2, I have described the process of assigning a numerical value that assesses the ambiguity of each person name that can be encountered in a text. I distinguish between three degrees of ambiguity: low, medium, and high. Low ambi-guityconsists of the multiple-token names and the least ambiguous two-token names.

High ambiguityconsists of the single-token names and the most ambiguous two-token names. Finally, medium ambiguity consists of the names that fall into the middle spectrum (see figure 4.3).

Each name in each document (and its respective node in the social network rep-resentation of the document) is assigned a numerical ambiguity value and given an ambiguity degree. Table 4.3 provides the resulting ambiguity values for the names from example 13.

4.3 Disambiguating person names

Figure 4.3: Person name ambiguity spectrum with ambiguity degrees.

Person name Numerical ambiguity value Ambiguity degree

Hobbs 80.06 High

Bert Hassell 20.36 Low

Parker Cramer 20.58 Low

Hubert Wilkins 21.16 Low

Table 4.3: Numerical ambiguity values and ambiguity degree of the names from example 13.

4.3.2.1 Learning clustering probabilities

In order to understand how reliable it is to cluster documents together when their social network representations share a certain number of nodes independently from the quality of these nodes, I decided to learn clustering probabilities from the development set from the CRIPCO corpus, consisting of 105 different query names. I computed the ambiguity for each of these names according to the method described in subsection 4.2.2 and classified them into ten different ranges (i.e. [0.0-0.1], [0.1-0.2], [0.2-0.3], etc.). For each query name in each ambiguity range, I learned the probability that two documents belong to the same entity if their social network representations had no nodes in common (apart from the query name), one node in common, two nodes in common, or three nodes in common. It was observed that, in all ambiguity ranges, whenever two networks shared four or more nodes, there was a near-unity probability that their respective documents belonged together.

The clustering probabilities learned from the development set revealed the impor-tance of assessing the ambiguity of the target person name to disambiguate. Whereas two documents mentioning a name of the lowest ambiguity range had a very high prob-ability (up to 0.96) of referring to the same entity even when no nodes (other than the one corresponding to the query name) existed in common in their network represen-tations, a highly ambiguous name had a much lower probability (0.13) to refer to the

same entity in the case of no nodes in common. The data showed that the larger the number of overlapping nodes between both networks is, the higher the probability is that two documents containing the same query name refer to the same person.

4.3.2.2 Penalizing lower quality overlaps

The learned probabilities do not take into account the quality of the person names in common. Returning to example 13, supposing ‘Hobbs’ is the query name, it is evident that two social networks whose names in common have a low ambiguity (such as ‘Parker Cramer’, ‘Bert Hassell’, and ‘Hubert Wilkins’) have a higher probability of referring to the same person (i.e. the very same person behind the name ‘Hobbs’) than two social networks whose names in common have a high ambiguity (such as ‘John’, ‘Mary’, and

‘William’). A penalty function is defined to lower the learned probabilities when applied to networks with overlapping nodes of lower quality. The probability of two networks being clustered together is lowered according to the decreasing probability function (dp) in equation 4.1:

dp= P r(n[i])−P r(n[i−1])

i+ 1 (4.1)

where i is the number of overlapping nodes (excluding the overlapping node corre-sponding to the query name) between two documents, nis the set of networks sharing a certain numberi of nodes, andP r(n[i]) is the probability that two networks belong together if they haveinodes in common. This is then applied to thepenalty function as shown in equation 4.2:

penalty =P r(n[i]) +ooq·dp (4.2) Thepenalty function adds to the probability that two networks belong together (given aninumber of nodes in common) the decreased probabilitydpmultiplied by the overall quality of the nodes in commonooq, which is the sum of the quality of the nodes that are shared by the two networks. The quality of the nodes is inversely proportional to the ambiguity of the person names of the nodes. A high-ambiguity person name has a low quality, whereas a low-ambiguity person name has a high quality. A numeric value is given to each node provided its quality: a node with a low-ambiguity person name (↓) is represented by a 0 value, a node with a medium-ambiguity person name (→) is

4.3 Disambiguating person names

represented by−1, and a node with a high-ambiguity person name (↑) is represented by −2. If there is more than one mention node in common in the two networks, the numeric values behind the ambiguity degree of each node are added up. Table 4.4 shows how probabilities are recalculated, according to the penalty function in equation 4.2, depending on the number of nodes in common and the ambiguity (i.e. inverse quality) of the nodes’ person names. The idea behind thepenalty function is that the probability that two documents that mention the same query name actually refer to the same entity is lower the more ambiguous the overlapping nodes in their social network representations are, and vice versa.

Im Dokument Entity-Centric Text Mining for Historical Documents (Seite 115-119)