• Keine Ergebnisse gefunden

Chapter 5 Transfer learning for question answering over knowledge graphs

We report the results of this experiment in Table5.7. In summary we conclude that transferring models across KGQA datasets via simple fine-tuning is a viable strategy to compensate for the lack of training samples in the target dataset.

5.8.3 Transfer Learning with pre-trained Transformers

For our transformer based slot matching model, we use a transformer, initialized with the weights of BERT-Small12, instead of an LSTM, as discussed in Sec5.7.2. The transformer has 12 layers of hidden size 768 and 12 attention heads per layer. Following [7], we set dropout to 0.1. We train using Adam with initial learning rate 0.00001. Table5.8shows the performance of the pre-trained transformer (BERT), used as in [7] as well as the pre-trained transformer in the slot matching configuration (Slot Matching (BERT)). For BERT, we follow the sequence pair classification approach described by [7].

QALD-7 LC-QuAD

BERT 0.23 0.67

Slot Matching (BERT) 0.18 0.68

Table 5.8: CCA for slot matching model, as proposed in Sec5.7.2initialized with the weights of BERT-Small, compared with regular transformers initialized with the same weights.

Through this experiment we find that using pre-trained weights immensely improves model performance, as both transformer based models outperform the ones in Section5.8.1. Additionally, we find that the augmentations we propose in Sec5.7.2are beneficial for the task, improving CCA on LC-QuAD by 1.4% relative to regular pre-trained transformers. However, both models exhibit poor performance over QALD-7. The cause of this degradation is unclear to us at this time. An important difference between QALD and LC-QuAD that could help explain this is the extremely small number of training examples (220) in QALD-7, which, coupled with the complexity of the questions could lead to much more challenging training, especially with such an over-parameterized model as BERT.

5.9 Conclusion KGQA. We show that pre-training models over a larger task-specific dataset and fine-tuning them on a smaller target set leads to an increase in model performance. We thereby demonstrate the high potential of these techniques for offsetting the lack of training data in the domain. Finally, we propose mechanisms to effectively employ large-scale pre-trained state of the art language models (BERT [7]) for the KGQA task, leading to an impressive performance gain on the larger dataset.

Our comparison of a fully pre-trained transformer to a BiLSTM-based model where only the word embeddings have been pretrained (Glove) might not yield a fair comparison between the two architectures (transformer vs BiLSTM). Further insights could be gained by analyzing the performance of a BiLSTM, which has also been pretrained as a language model (maybe combined with other tasks) in the future. Here instead, our aim was to provide evidence that the use of neural networks pre-trained as language model is beneficial for knowledge graph-based question answering, in particular for SimpleQuestions, a usecase not included in the original BERT evaluation and, to the best of our knowledge, not yet explored in the literature.

Even though BERT improves upon our BiLSTM baseline on SimpleQuestions, the improvements in the full data scenario might not justify the significantly longer training and inference times and memory requirements. These practical concerns, however, could be mitigated by practical tweaks and future research. Furthermore, with fewer data the performance increases w.r.t. the baseline become more spectacular, indicating that using pre-trained networks like BERT might be essential for achieving reasonable performance in limited data scenarios. Such scenarios are common for datasets with more complex questions. Therefore, we believe pretrained networks like BERT can have a bigger impact for complex KGQA (even when training with all data available). However, from our experiments with the very small QALD-7 dataset, we found that BERT-based models perform worse.

While we can not explain this finding, we believe that training difficulties due to the extremely small yet complex dataset and the over-parameterized model could (partially) be to blame.

C H A P T E R 6

Insertion-based decoding for semantic parsing

“I always like going south. Somehow it feels like going downhill.”

– Treebeard, fromThe Lord of the Rings:

The Two Towers

In this chapter, we investigate insertion-based decoding for semantic parsing. Insertion-based decoding may (1) improve the efficiency of decoding by requiring fewer decoding steps and (2) may affect the accuracy because of different independence assumptions (compared to left-to-right decoding) and other inductive biases.

Sequence generation is usually based on a left-to-right autoregressive decoder that decomposes the probability of the entire sequence𝑦conditioned on𝑥 (𝑥can be empty) as the product 𝑝(𝑦|𝑥) = Î𝑁

𝑖=0𝑝(𝑦

𝑖|𝑦

<𝑖, 𝑥). At each decoding step, the decoder model predicts the next token𝑦

𝑖based on the previously generated outputs𝑦

<𝑖 and the input𝑥. This approach to decoding sequences is linear in sequence length: the number of decoding steps necessary to produce the sequence is equal to the length of the sequence. In addition, every generated token𝑦

𝑖is conditioned on all the tokens𝑦

<𝑖generated so far, which may encourage learning spurious patterns and thus negatively affect generalization to novel data.

Recently, several works have proposed non-autoregressive decoders that are sub-linear. This allows to decode a sequence using fewer decoding steps than the length of the sequence and can thus greatly speed up inference, especially for longer sequences [23,96,98–100]. In particular, the Insertion Transformer of Stern et al. [23] uses insertion operations to iteratively expand the sequence by inserting tokens between existing tokens, thus enabling the generation of multiple tokens in a single decoding step and requiring onlyO (log2𝑛) decoding steps for a sequence of𝑛tokens.

In this chapter, we investigate how well an insertion-based sequence decoder works for semantic parsing, where we re-implement the method proposed in Stern et al. [23]. Moreover, we also develop a novel insertion-based decoding method that is specifically tailored for trees and compare it against the insertion-based sequence decoder, as well as normal sequence-to-sequence and sequence-to-tree models. The insertion-based tree decoder that we propose here does not need to explicitly decode structure tokens. Moreover, it can achieve a best-case complexitybelowO (log2𝑁)1in terms of the

1The exact best possible speed-up heavily depends on the data.

Chapter 6 Insertion-based decoding for semantic parsing

number of decoding steps and guarantees that all intermediate outputs are valid trees. In addition, in the investigated approaches, the tokens that are generated in parallel are predicted independently. On the one hand, this changes the independence relations between the output variables when compared to left-to-right decoding. Higher degree of independence between output variables may reduce the learning of spurious patterns over the output variables. However, it would not decrease the learning of spurious patterns from the input since the entire input is still observed. On the other hand, the introduced independence relations may negatively affect performance if the “correct” patterns have been broken. Moreover, other design choices and resulting inductive biases may increase or decrease task accuracy. In particular, there may be a difference when treating the trees in semantic parsing as sequences or as trees. Our proposed tree-based insertion transformer decodes and models trees differently than the sequence-based insertion transformer [23], which may affect accuracy. These considerations further motivate the investigation of parallel insertion-based decoding and tree-based insertion decoding.

We run experiments on the Overnight dataset for semantic parsing. Since the formal language queries in semantic parsing can often be represented as trees (e.g. abstract syntax tree), semantic parsing is a particularly interesting task for the evaluation of the presented decoding approach.

The contributions of this work are:

• a transformer-based decoding algorithm that uses insertion operations specifically tailored for decoding trees,

• a novel transformer architecture that uses novel tree-based relative positions,

• and an evaluation of the proposed algorithm and model on the well-known Overnight dataset, and a comparison against a strong non-autoregressive baseline.

This chapter is based on the following publication:

9. Denis Lukovnikov, Asja Fischer. Insertion-based tree decoding. In the Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics: Findings (ACL Findings 2021). 2021. This work was also presented in the 5th Workshop on Structured Prediction for NLP at ACL 2021.

The investigation presented in this chapter corresponds to Contribution3, which addresses the following research question:

RQ3: Does insertion-based decoding improve accuracy and how much does it decrease the number of decoding steps?

The rest of the chapter is organized as follows: We elaborate on the insertion-based sequence decoder in Section6.1. We present the insertion-basedtreedecoder approach in Section6.2. The experiments are presented in Section6.3and a discussion is presented in Section6.4. We conclude in Section6.5with a conclusion and an outlook on future work. Related work regarding this topic is discussed in Section3.1.6in Chapter3.

6.1 Insertion Transformer

Below, we give a very brief overview of the Insertion Transformer, which we use here as a baseline.

Due to space constraints, we refer interested readers to the work of Stern et al. [23] for a more elaborate description of their model and training procedure.

6.2 Tree-based Insertion Transformer

Im Dokument Question Answering over Knowledge Graphs (Seite 112-117)