4.2 Approach
4.2.1 Model description
From a broad perspective, our model (see Figure4.1) is a matching function: given a questionπand sets of candidate subject entities and relations,Cπ ={π
1, . . . , π
π}andCπ ={π
1, . . . , π
π}respectively, it returns the subject and predicate that matches the question best. To do so, it
(1) maps a questionπto vector representationπ
π =(ππ
π, ππ
π)π, whereππ
πandππ
πare subject and the relation sprecific encodings of the question respectively,
(2) maps each candidate subjectπ
π β Cπ to a vector representationπ
π π, (3) maps each candidate predicate π
π β Cπto a vector representationπ
π π, (4) and computes scoresπ
π (π, π
π)andπ
π(π, π
π)for each pairππ
π, π
π π
, π=1, . . . , π, andππ
π, π
π π
, π= 1, . . . , π.
Based on these scores the final prediction is(π ,Λ πΛ), with Λ
π =argmaxπ πβ Cπ π
π (π, π
π) , (4.1)
Λ
π =argmaxππβ Cππ
π(π, π
π) . (4.2)
Steps (1)-(3) heavily rely on RNNs with Gated Recurrent Units (GRUs) [45], which are described in Section2.2.2. In the following subsections, the four parts of our model are described in detail.
4.2 Approach ENCQ
βWhatβ
βcycloneβ
βaffectedβ
βHainanβ
GRU
GRU
GRU
GRU REPW
REPW
REPW
REPW
rq
Figure 4.2: Question encoding networkENCQ. Each word is represented by a vector usingREPW. The sequence of word vectors is encoded using a GRU.
Representing the question The mapping of a questionπ ={π€
1, . . . , π€
π}to its subject and predicate related vector representations rπ πandrππ, respectively, is done using a single-layered unidirectional GRU based encoder network. We call this part of the model thequestion encoderENCQ
rπ = rππ
rππ
=ENCQ({π€
1, . . . , π€
π}) . (4.3)
The question encoderENCQfirst uses the word representation functionREPW(π€
π‘)to generate vector representations for all words π€
π‘, π‘ = 1, . . . , π (as described in the next paragraph), which are subsequently fed to the RNN until all words have been seen. Starting with the initial hidden stateh0, the GRU of the question encoder RNN iteratively updates its hidden statehπ‘ after processing each word according to Equations (2.6) to (2.9), where the word representation vectorREPW(π€
π‘)is fed as input to the GRU (i.e. xπ‘ =REPW(π€
π‘)). The final hidden statehπ (produced after processing the last word represented byREPW(π€
π)) is returned byENCQas the representation of questionπ. The question encoder is visualized in Figure4.2.
Word representation In the following we descibe how we generate the vector representations of the wordsπ€
1, . . . , π€
π. In order to exploit both word- and character-level information of the question, we use a βnestedβ word- and character-level aproach concatenating the pre-trained embedding of a word with an RNN-based encoding on character level.
As word embeddings, we useGloVe[40] vectors provided on the GloVe website1. Such pre-trained word embeddings implicitly incorporate word semantics inferred from a large text corpus based on the distributional semantics hypothesis [186]. This hypothesis can be phrased as βwords with similar meanings occur in similar contextsβ, which in our case translates to similar vectors for words with similar meanings. See Section2.4.1for more on pre-trained word embeddings. Using such pretrained word embeddings allows us to better handle synonyms and find better matches between words in the question and subject labels or predicate URIβs. In addition, during testing, it allows to handle words
1http://nlp.stanford.edu/projects/glove/
Chapter 4 Word- and character-level representations in question answering over knowledge graphs
GRU Emb
Emb Emb Emb Emb Emb
GRU GRU GRU GRU GRU
ENCW
βhainanβ
h a i n a n
GloVe
wtc wte
Figure 4.3: Word representation networkREPWwith example. The word is considered as a sequence of character and fed toENCW, where each character is embedded and the sequence of character vectors is encoded using a GRU to produce a character-level encoding of that word. This is concatenated to the word embedding (we use GloVe) to produce the complete word representation.
that have not been seen during training.
The word embedding ofπ€
π‘ resulting in the π
π€
π-dimensional vector representation wππ‘ can be formally described as follows
wππ‘ =Wβ€πvπ‘ , (4.4)
whereWπβR|ππ|Γππ€π is the provided pretrained word embedding matrix for a vocabulary of size
|π
π|(GloVe covers 400k words), andvπ‘is the one-hot vector representation ofπ€
π‘. Since the coverage of word embeddings is limited, many words appearing in the questions (especially those that are part of a reference to a particular subject entity, e.g. the last name βGolfisβ in the question βWhat city was Alex Golfis born inβ) are not contained in the vocabulary of the pre-trained embeddings. In such cases (20.8% and 14.5% of unique words in the train resp. test questions), we set the word embedding to the zero vector.
The encoding of the wordπ€
π‘ ={π€1
π‘, . . . π€
πΎ
π‘ }on character-level is based on a RNN encoder:
π€
π
π‘ =ENCW({π€1
π‘, . . . π€
πΎ
π‘ }) , (4.5)
InsideENCW, the charactersπ€
π
π‘, π =1, . . . , πΎare first embedded by
cπ‘π =Wβ€πvππ‘ , (4.6)
with character embedding matrixWπ βR|
π π|Γπ
ππ
(for|π
π|characters) learned during training, and vπ‘π the one-hot vector representation of the characterπ€
π
π‘. Then we feed the sequence of character vectors{c1π‘, . . . ,cπ‘πΎ}to a single-layered unidirectional GRU network and take its final state as the character-level word encodingwπ.
The added character-level encoding provides information necessary for matching question words
4.2 Approach with entity labels, in addition to providing distinguishable representations for OOV words. This approach is similar to thechar2wordmodel proposed by Ling et al. [187] with the difference that we use a unidirectional GRU network.
Finally, to get the vector representation of a wordπ€
π‘, the word embeddingwππ‘ and character-level encodingwππ‘ are concatenated:
REPW(π€
π‘) = wπ‘π
wπ‘π
. (4.7)
The whole word representation network is illustrated in Figure4.3.
Representing the subject
We use both the entity label and the type label of the entities in the knowledge graph to build subject representations. Entity labels are encoded on the level of characters because of the high prevalence of OOV words and their importance for entity labels. On the other hand, OOV words are rather rare in type labels and thus type labels are encoded on word level.
For Freebase entities, we extract entity labels using thetype.object.nameproperties of entities and type labels of entities by first getting thecommon.topic.notable_typesof entities2and then taking thetype.object.namevalue of the types.
The character-level entity label encodingsπand word-level type label encodingsπ‘are concatenated to produce the subject representation vector
sπ=ENCSL({π1
π , π2
π , . . .}) , (4.8a)
sπ‘ =ENCST({π€1
π‘, π€2
π‘, . . .}) , (4.8b)
rπ = sπ
sπ‘
, (4.8c)
whereENCSLis the character-level encoder of the subject entity label andENCSTis the word-level type label encoder. The label characters and type label words, respectively, are first embedded (following Equation4.6and Equation4.4, respectively) and the embedding vectors are fed to the respective encoding RNNs. BothENCSLandENCSTcorrespond to single-layer unidirectional GRU-based RNNs and take their final hidden state as the entity label encodingsπand type label encodingsπ‘, respectively.
The subject representation network is visualized in Figure4.4.
This method of building subject representations is similar to the method proposed by Sun et al. [188]
who focus on entity linking and CNNs for word-level entity name encoding (instead of using RNNs for character level based encodings like our model, which allows to handle OOV words) and word-level entity type name encoding, followed by an additional layer that merges the two (where we simply concatenate both representations).
Representing the predicate
We use the predicate URIβs provided by the KG to build latent vector representations of the predicates.
The predicate URI is first split into words π€1
π, π€2
π, . . ., each word is embedded (as described by
2The notable types property provide the single, most characteristic type for that entity. However, using all types of the entity (e.g. concatenating their labels) could be interesting as well, which we leave for future work.
Chapter 4 Word- and character-level representations in question answering over knowledge graphs
GloVe GloVe GRU
GRU Emb
Emb Emb Emb Emb Emb
GRU GRU GRU GRU GRU
ENCSL
βhainanβ
h a i n a n
GRU
ENCST
βchineseβ
βprovinceβ
s
ls
tr
sFigure 4.4: Entity encoder with example. The entity label is encoded on character level (ENCSL) and the subject type label is encoded on word level (ENCST). The two are concatenated to produce the subject vector.
ENCR
/meteorology /affected
_area /cyclone
GRU
GRU
GRU
GRU GloVe
GloVe
GloVe
GloVe
rp
Figure 4.5: Predicate encoder networkENCRwith example. The predicate URI is split into words and encoded on word level using GloVe embedding.
Equation4.4), and then the word embeddings are fed into a single-layer word-level GRU-based encoder ENCRthat takes the final state of its RNN as the representation of the predicate URI, that is
rπ=ENCR({π€1
π, π€2
π, . . .}) . (4.9)
The relation encoding network is visualized in Figure4.5.
Matching scores
Given the question encoding vectorrπ =(rπ π,rππ), the latent vector representationrπ of the relation, and the latent representationrπ of the subject entity, we compute two matching scores: one between the question and subject entity and one between the question and predicate, as follows:
ππ (π, π ) =cos(rπ π,rπ ) (4.10a) ππ(π, π) =cos(rππ,rπ) , (4.10b)
4.2 Approach
wherecosis the cosine similarity given bycos(a,b) = (aΒ·b)
|a||b| .