• Keine Ergebnisse gefunden

Answer Verbalization

9.5 User study

Figure 9.5: The user interface adopted in the user study.

We are planning to have six-months release cycles, regularly updating the dataset based on improvement suggestions and corrections. However, we also plan to further extend VQuAnDa with more verbalization examples. We also aim to make the dataset a community effort, where researchers working in the domain of verbalization can update the data and also include their own evaluation baseline models. VQuAnDa should become an open community effort.

9.5 User study

We design a user study to evaluate the performance of various approaches to the problem of validating the answers given by the QA system. This enables us to study users’ behavior when they are offered with different representation methods.

In the following, we first describe the user interface which is adopted for the user study. Afterward, we discuss the evaluation setup including evaluation metrics and various approaches that provide a particular representation for the formal query.

9.5.1 User Interface

We develop a web application to carry out the user study. The workflow is as follows: Users need to sign up in the system. Then, users can log in to the system and begins the user study. An explanatory web page is shown to the users to describe the goal of the user study, various steps and the different elements in the user interface. After the brief introduction, users are shown the main page of the user interface (See Figure9.5). At the top of the user interface(#1), the question is displayed and the answer is presented in the next line(#2). On the left side, the interpretation of the formal query is shown(#3). It is described later in Section9.5.2. Users can decide whether the interpretation is semantically equivalent to the input question by choosing the corresponding option from the right side of the user interface(#4). Users can also skip the question when the question or its interpretation is not comprehensible by clicking on the skip button on the side right(#5).

Finally, we ask users for their feedback on the usability of the user study on a scale of one to five, with one being hard to use and five being easy to use.

Chapter 9 Answer Verbalization

(a) SPARQL Representation (b) Graph Representation

(c) Controlled Language Representation (d) Answer Verbalization Representation

Figure 9.6: Various representation approaches of the formal query to the question“Which location city of Denver Broncos is the place of birth of Steven Clark Cunningham?"

9.5.2 Evaluation Setup

We adopted a user group of 33 grad and post-grad students (mostly in computer science) and presented them 20 questions that are randomly drawn from the LC-QuAD dataset. The participants were briefly introduced to the systems and they spent 20 minutes on average to conduct the user study. The following rules were observed during the user study:

• Each user is provided consistently with a fixed representation of the formal query.

• No question is provided twice to the same user.

• Users may skip if they find a question or its interpretation as incomprehensible.

The main page of the user interface is depicted in Figure 9.5.

Metrics

We consider multiple evaluation metrics to capture different aspects: We defineAccuracyas the percentage of the questions for which users confirm the interpretation of the question and answer.

Durationspecifies the average time that users take to process each question. Ease of Useis the explicitly inquired feedback from the users at the end of the user study (on a scale of 1 to 5).

Baselines

We consider the following approaches to provide a different representation of the formal queries to the users:

98

9.5 User study

(a) Accuracy (b) Average time

• We consider theFormal representationas the baseline representation, in which the user is simply provided with a formal query (e.g SPARQL) of the input question, as well as the answer that is retrieved from the knowledge graph. This representation requires certain skills and knowledge about semantic technologies including SPARQL querying language (See Figure9.6(a)).

• We show aGraphical interpretation (similar to [161]) of the formal query that denotes a subgraph from the underlying knowledge graph along with the answer. Figure9.6(b)depicts an example of this interpretation. While this representation is more intuitive with respect to the previous one, it still requires a limited understanding of knowledge representation with graphs made of nodes and edges. Furthermore, certain formal characteristics such as boolean queries or aggregate functions cannot be integrated into this representation.

• A correct and sound verbalization of the formal query is very challenging. Furthermore, a mistake in the verbalization of the query might confuse the users or change the meaning of it.

Inspired by the recent works on verbalizing the formal query [143,162,163], we usecontrolled language interpretationas yet another baseline. This interpretation tries to minimize the required skill to understand the interpretation while avoiding the complexity of the complete formal query verbalization. In particular, we benefit from the online API3of SPARQLToUser [163] to provide controlled language interpretation. Furthermore, we provide the answer to the user as well (See Figure9.6(c)).

Moreover, we provide the users with Answer verbalized representation, which is based on our hypothesis: to provide a sound and correct verbalization of the answer (see Figure9.6(d)).

9.5.3 Evaluation Results

We present the result of the user study in this section. The goal of the use study is to analyze the effect of using various interpretation approaches in the performance of the users in terms of accuracy, time and user satisfaction.

Note that we only provide the correct interpretations. While adding incorrect interpretations might be insightful to measure whether users can reject incorrect interpretations, we refrain from doing so to avoid any further complications and focus on measuring the comprehensibility of the interpretations from the user perspective.

3https://qanswer-sparqltouser.univ-st-etienne.fr/sparqltouser

Chapter 9 Answer Verbalization

While all of the approaches show a declining trend as the complexity increases (See Figure9.7(b)), users that were shown theAnswer verbalized representationconstantly exhibit better judgments across all the complexities. It was followed byGraphical interpretationby a margin of 7%. Figure9.7(b) unvails thatFormal representationandcontrolled language interpretationfailed to efficiently help users to confirm the interpretation, notably in case more complex questions.

However, Figure9.7(a) reveals that users withAnswer verbalized representationspent slightly more (two seconds on average) to provide their feedback, in contrast to the users withGraphical interpretation. Our detailed analysis shows the reasons for the low average time in case of the most complex question withcontrolled language interpretationis that users either incorrectly reject the interpretation or choose the interpretation to be incomprehensible. This case is also justified by the very low performance ofcontrolled language interpretationin the case of the most complex question in Figure9.7(a).

Moreover, we computed the average for the ease of use metric: Users withAnswer verbalized representationassigned the highest value of 4.3 compared to others. It is followed by an average score of 4.0 forGraphical interpretation, and 3.8 forcontrolled language interpretationandFormal representation.

Our findings from the user study confirm our hypothesis that a correct and comprehensive verbalization of the answer is an effective approach in enabling the users to verify the answer provided for a given question.