• Keine Ergebnisse gefunden

Related Work

3.2 Entity Linking

In the previous section, we provide an overview of Relation Linking (RL) systems over structured data.

In addition to Relation Linking, we take a look at the progress made in the Entity Linking task and the progress that has been made in respect to including KG context to assist the task. This section concentrates on the approaches for entity linking, laying foundation for the contributions in chapters5 and6.

End-to-end Entity Linking

Research has traditionally treated Named Entity Recognition and Disambiguation (NERD) (also referred to as Entity Linking (EL) or Named Entity Resolution) as a three-step process involving Entity spotting, candidate selection, and disambiguation [169]. A wide range of tools and research work exists in the area of NER and NED which can mainly be attributed to the fact that NER/NED (jointly as NERD)

1http://www.okbqa.org/

task is closely similar for free text as well as question answering as restricted domains albeit with a few differences. The tool TagMe [170] is one of the popular works in this area that links elements in short texts to their corresponding Wikipedia pages. It uses a dictionary of entity surface forms extracted from Wikipedia to detect entity mentions in the parsed input text. These mentions are then passed through a voting scheme that computes the score for each mention-entity pair as the sum of votes given by candidate entities of all other mentions in the text [170]. Finally, a pruning step filters out less relevant annotations. Researchers in [171] used the term Wikification to refer to this process, the difference being that Wikification is not only restricted to named entities but rather any keyword terms in the text are identified and linked to a Wikipedia article. The process of extracting these keywords is similar to that of entities in that it involves a controlled vocabulary that acts as a look-up for n-grams which ultimately form candidate entities when matched to the vocabulary. Candidate entities are then ranked via various algorithms and passed through a special Word Sense Disambiguation (WSD) algorithm to resolve them into Wikipages [171]. We can observe the constant need to have background knowledge assisted models wherein these two early works; researchers sought to represent knowledge context through look-up tables and dictionaries.

More research that performs NED on Wikipedia includes the approach proposed in [172] for collective NED through local hill-climbing, rounding integer linear programs, and pre-clustering entities. In contrast, [25] maps the surface forms of entities together with contextual information then employs vector space models to perform the disambiguation. Likewise, researchers in [173] try to disambig-uate emergent entities. Many of the approaches rely on already existing NER tools such as Stanford NER2,CoreNLP3,GATE4or OpenNLP5 among others, for the spotting stage. Notwithstanding, there is existing work J-NERD [174] which attempts to jointly perform both NER and NED (NER/D) using probabilistic graphical models. In these probabilistic models, they treat both tasks as a sequence labelling task through linear-chain CRF and dependency parse tree based tree factor graphs.

Recently, following the popularity of knowledge graphs (KGs), scholars have shifted focus to use KGs such as DBpedia [1], Freebase [16] and Wikidata [17] for the NED task. DBpedia Spotlight [36] is one such tool that performs NED on DBpedia. After an initial step of entity spotting, DBpedia Spotlight uses contextual information to resolve an entity’s surface forms to corresponding DBpedia resources. The tool Babelify [175] on the other hand, employs an underlying graph structure of a lexicalised semantic network to perform word sense disambiguation of words in a given text. The next step is to generate a graph that is finally used for linking through graph walk algorithms. Another graph-based system AGDISTIS [176], relies on the number of hops between entities for disambiguation. These tools carry out NED as a two-step process utilising the Stanford NER6for the initial step. These tools offer interfaces for use by other applications in the form of APIs; however, NERD [177], proposes a framework for unifying results of these tools for easy usage and combination via an ontology for alignment. It is important to note that most of these approaches use state of the art machine learning techniques and require a large amount of training data. However, when these tools are applied to short text in a new domain such as question answering, the performance is limited [45]. This gives rise to the question concerning how extra knowledge context can be represented to enhance performance, especially in scenarios where the local context is not sufficient.

In general, relation and entity linking for short text remain open research areas for the community, and it would be expected that there shall be more tools to solve the RE/RL and NER/D tasks. These tools

2https://nlp.stanford.edu/ner/

3https://stanfordnlp.github.io/CoreNLP/

4https://gate.ac.uk/

5https://opennlp.apache.org/

6https://nlp.stanford.edu/ner/

find direct applicability areas of biomedical information extraction, or in collaborative component-based QA frameworks such as OKBQA [178], QANARY [179], and Frankenstein [18] where input is relatively very short.

Components for Named Entity Recognition and Disambiguation The Named Entity Recog-nition (NER) recognises the subsequence of text that refers to an entity while the entity disambiguation (NED) links these entities to their mentions in a referent knowledge base (e.g., for DBpedia [1]). For instance, in the example“Soccer: Late goal gives Japan win over Syria”, NER component ideally recognisesJapanas an entity while a tool for NED task link it to its Wikidata mentionwd:Q17056678. Below is a list of some of the NER and NED components.

1. Entity Classifieruses rule base grammar to extract entities in a text [180]. Its REST endpoint is available for wider use for NER task.

2. Stanford NLP Tool: Stanford named entity recogniser is an open-source tool that uses Gibbs sampling for information extraction to spot entities in a text [181].

3. Babelfy[175] Attempts multilingual EL through a graph-based approach that uses random walks and subgraph algorithm to identify and disambiguate entities present from text [175].

4. AGDISTIS[176] is a graph-based disambiguation tool that combines a novel HITS algorithm with label expansion strategies. Further, string similarity measures are used to disambiguate entities in a given text [176].

5. DBpedia Spotlightis a web service9that uses vector-space representation of entities and using the cosine similarity, recognise and disambiguate the entities [36].

6. Tag Mematches terms in a given text with Wikipedia, i.e., links text to recognise named entities.

Furthermore, it uses the in-link graph and the page dataset to disambiguate recognised entities to their Wikipedia URls [182]. Tag Me is open source, and its REST API endpoint10is available for further (re-)use.

7. Other APIs: Besides the available open-source components, there are many commercial APIs that also provide open access for the research community. Aylien API11is one of such APIs that use natural language processing and machine learning for text analysis. Its text analysis module also consists of spotting and disambiguation entities. TextRazor12, Dandelion13, Ontotext14[5], Ambi-verse15, and MeaningCloud16are other APIs that have been providing open access to researchers for their reuse.

Several comprehensive surveys exist that detail the techniques employed in entity linking (EL) research;

see, for example, [169]. An elaborate discussion on NER has been provided by Yadav Bethard [183].

7wdcorresponds tohttps://www.wikidata.org/wiki/

8Q170566is the Wikidata ID (Q-valu) for the entity“Japan National Football Team”

9https://github.com/dbpedia-spotlight/dbpedia-spotlight

10https://services.d4science.org/web/TagMe/documentation

11http://docs.aylien.com/docs/introduction

12https://www.textrazor.com/docs/rest

13https://dandelion.eu/docs/api/datatxt/nex/getting-started/

14http://docs.s4.ontotext.com/display/S4docs/REST+APIs

15https://developer.ambiverse.com/

16https://www.meaningcloud.com/developer

However, the use of knowledge graphs as background knowledge for EL task is a relatively recent approach. Here, a knowledge graph is not only used for the reference entities but also offers additional signals to enrich both the recognition and the disambiguation processes. For entity linking, FALCON [24]

introduces the concept of using knowledge graph context for improving entity linking performance over DBpedia. Falcon creates a local KG fusing information from DBpedia and Wikidata to support entity and predicate linking of questions. We reused the Falcon Background knowledge base and then expanded it with all the entities present in the Wikidata (primarily non-standard entities). Another work in the similar direction is by Seyler et al. [184]. Authors utilise an extensive set of features as background knowledge to train a linear chain CRF classifier for the NER task.

The developments in deep learning have introduced a range of models that carry out both NER and NED as a single end to end step using various neural network-based models [37]. Kolitsas et al. [37]

enforces during testing that gold entity is present in the potential list of candidates, however, Arjun doesn’t have such assumption and generates entity candidates on the fly. This is one reason Arjun is not compared with Kolitsas’s work in the evaluation section. Please note, irrespective of the model opted for entity linking, the existing EL approaches and their implementations are commonly evaluated over standard datasets (e.g. CoNLL (YAGO) [147]). These datasets contain standard formats of the entities commonly derive from Wikipedia URI label. Recently, researchers have explicitly targeted EL over Wikidata by proposing new neural network-based approach [73]. Contrary to our work, authors assume entities are recognised (i.e. step 1 of Arjun is already done). Inputs to this model is a “sentence, one wrong Wikidata Qid, one correct Qid”; using an attention-based model predicts correct Qid in the sentence- more of a classification problem. Hence, 91.6 F-score in Cetoli et al.’s work [73] is for linking correct QID to Wikidata, given the particular inputs. Their model is not adaptable for an end to end EL due to input restriction. OpenTapioca [185]) is an end-to-end EL approach on Wikidata that relies on topic similarities and local entity context. Howbeit, it ignores the Wikidata specific challenges (chapter 5). Works in [10,24] are other attempts for Wikidata entity linking.

Mention Detection (MD): The first attempt to organise a named entity recognition (NER) task traced back to 1996 [186]. Since then, numerous attempts have been made, ranging from conditional random fields (CRFs) with features constructed from dictionaries [187] or feature-inferring neural networks [188].

Recently, contextual embedding based models achieve state of the art for NER/MD task [32,189]. We point to the survey by [183] for details about NER. Few early EL models have performed MD task independently such as [190,191].

Candidate Generation (CG): There are four prominent approaches for candidate generation. First is a direct matching of entity mentions with a pre-computed candidate set [192]. The second approach is the dictionary lookup, where a dictionary of the associated aliases of entity mentions is compiled from several knowledge base sources (e.g. Wikipedia, Wordnet) [193–195]. The third approach is to generate entity candidates using empirical probabilistic entity-map p(e|m). Thep(e|m) is a pre-calculated prior probability of correspondence between positive mentions and entities. A widely used entity map was built by [196] from Wikipedia hyperlinks, Crosswikis [197] and YAGO [147] dictionaries. End-to-end EL approaches such as [37,198] relies on the entity map built by Ganea and Hofmann. The next approach for generating the candidates is proposed by [24]. Authors build a local KG by expanding entity mentions using Wikidata and DBpedia entity labels and associated aliases. The local KG can be queried using BM25 ranking algorithm [199]. The modular architecture of Arjun5gives us the flexibility to experiment with several ways of generating entity candidates. Hence, we reused candidate list proposed by [196] and built a new CG approach based on [24] for the second implementation in section5.3.

End to End EL:Few EL approaches accomplish MD and ED tasks jointly. [174] propose joint recogni-tion and disambiguarecogni-tion of named-entity menrecogni-tions using a graphical model and show that it improves EL. [198] combine local and global features using knowledge about the neighbouring mentions and

respective entities to solve EL task. Work in [37] also proposes a joint model for MD and ED. Authors use bi-LSTM based model for mention detection and compute the similarity between the entity mention embedding and set of predefined entity candidates. This approach first selects a set of entities with a high local score and computes the similarity between the in-process entity embedding and an average of the selected entity embeddings. The work in [39] employs BERT to model three subtasks of the EL jointly. The authors use an entity vocabulary of 700K top most frequent entities to train the model. Work in [200] uses a Transformer architecture with large scale pre-training from Wikipedia links for EL. For CG, authors train the model to predict BIO-tagged mention boundaries to disambiguate among all entities.

For Wikidata KG, Opentapioca is an entity linking approach which relies on a heuristic-based model for disambiguation of the mentions in a text to the Wikidata entities [185]. Our Arjun [201] approach discussed in chapter5offers flexibility in the candidate generation process by leveraging a modular system to EL.