Approach : Evaluating Knowledge Relevance

Generalising Knowledge Context

6.2 Relevance of Different Forms of Knowledge Context

6.2.3 Approach : Evaluating Knowledge Relevance

The section described the approach for KG-enhanced contextualised entity disambiguation. Figure6.4, shows the overall process to modelling KG context for the entity disambiguation task. In this work, we conceive a model-agnostic view and define an approach consisting of three major stages: Input representation, Sentence-pair encoding, and scoring. Only the second stage requires a design choice of a specific model for learning a combined representation of the source and entity KG context. This section describes the steps involved in our approach, followed by a summary of the two models we selected for our experiments.

Input

Representation Scoring

... African Cup Winners Cup ﬁnal at the National Stadium on Friday: Arab Contractors

-Egypt 4 Sodigraf Zaire 0, halftime 2:0 Scorers:

Aly Ashour 7’, 56’(penalty) ...

Zaire Democratic

Republic of the Congo

ETHZ-Attention Bidirectional Transformer

Sentence-Pair Encoding

MLP Ranker MLP Regressor

D, description, Sovereign state in middle Africa

D, alias, Zaire Q974

.... .... .... ... ... ...

Figure 6.4: Approach : Entity-Context-Enhanced Disambiguation portraying the input representation and the selected models. The abbreviation D. refers to "Democratic Republic of the Congo"

Context Enhanced Disambiguation

Input Representation Text inputs for deep learning-based NLP models generally take the form of a sequence of word vectors. Such vectors are either trainable embedding learned within the first layer of the model or are fetched from static pre-trained vectors in a given vocabulary. Regardless of whether the vectors are trainable or static, a context-enhanced disambiguation model requires that the input tokens’

vector representation encapsulate both the semantics and meaning expressed by the token as the relative meaning defined by its partition (segment). Our approach’s input consists of the four partitions illustrated in section6.2.2. Therefore, the representation must capture the singular token semantics and the semantics expressed by its relative partition. For instance, the word "Zaire" may have the same meaning when it appears in the mentioned segment of figure6.4and when it appears in the third triple segment. Let’s represent the embedding vector for a given token in the input as~w, and the vector representation for the input partition as~%. In contrast, the elective absolute positional embedding is denoted as~p. The final token representation is given by :Combine(~w, ~%, ~p). The functionCombine() is a model-dependent choice between concatenation, addition, averaging, or multiplication.

Sequence-pair Encoder The sequence-pair encoding aims to obtain a single unified vector repres-entation that combines the semantics and expresses the proximity of the two sequences (source and mention sequence vs. candidate entity and entity kg-context sequence). With the progress made towards sequence encoding through attention mechanism, the intuition obtains different initial contextual word representation for the two sequences, followed by extrapolated, selective inter-matching of tokens across the two sequences. Depending on the attention mechanism’s granularity designed in a model, the final representation often involves the aggregation of several attention functions or units’ outputs. Our decision to use attention-based approaches in the sequence-pair encoding is informed by the tremendous success achieved in previous researches such as [31, 32, 261]. As indicated in similar to works [261, 262], attention take two representation and defines a weighted representation that entails their compatibility.

Let us take a hidden layer representation of a given attention unit ashLif the unit exists in the left sequence. It ish_Rif the unit exists in the right sequence (a unit can be a token, token block or the whole sequence). An attention function is performed as follows:

a=α(W^LhL(op)W^RhR)

Whereais the attention vector,αis a nonlinear function such astanhorReLU,W^L,W^Rare the trainable weight matrices for the left and right units, and (op) is an operation depending on the type of attention used (concatenation, addition, and multiplication). If we then take several of such units respectively, the final value of a unit representation is given by combining the attention vector about other units with the unit’s original vector, e.g., forhL.

h⁰_L=hLso f tmax(w^Tai)

For final sentence-pair representation, a non-linear function is used to combine the first sequence’s attention weighted representation with the last output vector encoding the two input sequences. We generate another representation based on the second sequence and concatenate the two in a two-way attention mechanism.

Scoring At the scoring stage, we take the vector representation from the sequence-pair encoder and fit it into an objective that produces the final prediction. The common scoring objectives include classification, regression, or ranking. This entails passing the encoding through one or two fully connected layers with the selected scoring objective. If we assume the feature vector representation for the source and candidate entity, the ranking objective takes the max-margin loss while the classification employs the cross-entropy loss function.

Models

In our approach, we chose two different models that explicitly influence the behaviour of the three stages described before. The first model is the DCA model [28] based on the ETHZ-attention, and the second is the pre-trained transformer architecture XLNet [52]. In the following, we briefly describe each.

DCA Model The ETHZ-Attn was introduced in work by Ganea et al. 2017 [196] in which a 2-layer feed-forward network is used to aggregate scores obtained from 3 different local features : (1) A prior probability score for the mention-entity pair distribution empirically drawn from a large corpus. (2) A score that estimates the similarity between the the source context and the entity context, and (3) Type similarity score that relates the type of the candidate entity to the mention. The model then seeks to amass information from entities that have been already linked. Such information becomes a dynamic context to boost subsequent linking decisions. For removing irrelevant entities in the dynamic context, an attention mechanism is again applied [28]. The model employs a ranking objective to select the right candidate finally.

Adding KG Context to DCA: The attention mechanism employed in the DCA model allows extended range matching of tokens and segments. The original model was therefore trained using Wikipedia paragraphs as entity context. However, Wikipedia paragraphs are not adequately structured, hence do not provide concise entity context representation. In our experiments, we replace this context with the Wikidata entity context. The initialisation used in this model is drawn from the static Glove embeddings.

We feed the triples in the form of word embeddings for the natural language forms of the triples. For the DCA model, the KG entity context input takes a sequence similar to the example below.

Democratic Republic of the Congo, alias DRC, alias DR Congo, alias, Congo-Kinshasa, alias Zaire, alias Dem. Republic of the Congo, alias Dem. Republic of Congo, alias Dem. Rep. Congo, alias Congo (Kinshasa), alias COD, description sovereign state in Central Africa, instance of country, instance of sovereign state, part of Middle Africa, . . .

Fine-tuning XLNet Model Language models such as BERT[32], RoBERTa[53]yang2019learning, and XLNet[52] have become state of the art for several tasks. Mulang’ et al. [10] experimented with RoBERTa and XLNet for context representation and determined that XLNet is a more stable model under extra context in the input sequence. We, therefore, borrow this insight and implement our approach using a fine-tuned XLNet model. We add a regression layer to obtain the scores shown in figure6.4.

Adding KG Context to XLNet: The architecture for pre-trained transformer models lends easily to our need to represent separate segment embeddings for each input portion. We employ the segment separator token[SEP]between each triple, which allows the embedding for each triple to be unique from the rest of the input. From figure6.4, eacha^e

x 1

i ∈A^e^x¹ refers to a triple derived from the entity attribute context setA^e¹^x of the candidate entitye₁^x, while eachτ^e_i¹^x ∈ T^e^x¹ refer to triple connecting the entity to other entities. For our running example, the following represents the entity context.

[SEP] Democratic Republic of the Congo alias DRC, DR Congo, Congo-Kinshasa, Zaire, Dem. Republic of the Congo, Dem. Re-public of Congo, Dem. Rep. Congo, Congo (Kinshasa), COD, [SEP] description sovereign state in Central Africa [SEP] instance of country, sovereign state [SEP] part of Middle Africa, . . .

Im Dokument Knowledge Context for Entity and Relation Linking (Seite 105-108)