• Keine Ergebnisse gefunden

Model evaluation. In the last step of our machine learning pipeline, we per-form a final test of our trained machine learning model from the previous step.

Therefore, the input for this step is the best performing model of the n-fold cross-validation, including hyperparameter tuning. The goal of this step is to check how our model performs on unseen data. For that reason, we created a train and a test set in the data collection and preparation step. The test set usually compares the performance of competing models. It is a curated sample, which includes a real-world representation of the data (i.e., realistic distribution of the classes) [220]. As it is unseen to the trained model, we can make assumptions about the model’s performance and compare the results of several models. If the evaluation of the model is unsuccessful, we might have overfitted the model and have to go back to either the second or third step of the pipeline.

We use the evaluation on the test set to create our benchmarks, which are tables describing the performance of each selected machine learning algorithm and its hyperparameters. We only report on the best configuration for each algorithm as sometimes we performed hundreds of machine learning experiments to find an optimized model.

The final result is a machine learning model that we can use to make predictions on new data points (i.e., newly received user feedback).

The explicit and implicit feedback analyses are the core enablers for requirements intelligence. As discussed in Chapter 3, stakeholders find user feedback valuable but rarely utilize it for several reasons like the high amount of received feedback and the amount of irrelevant feedback. We, therefore, pro-pose a framework that continuously collects explicit and implicit user feedback.

Then, it filters the collected feedback to only show feedback that is valuable for stakeholders, i.e., requirements-relevant feedback like problem reports. Then, the framework analyzes the features users address in explicit feedback and analyzes how they use it with implicit feedback. Finally, requirements intelligence extracts and matches the documented features with those from the user feedback to en-able advanced analytical insights in the integrated interactive visualization.

Requirements intelligence applies rigorous machine learning for its core enablers. The core enablers for requirements intelligence are the analyses of ex-plicit and imex-plicit user feedback. For each feedback type, we perform the three activities data collection and preprocessing, feedback filtering, and feedback to re-quirements. The data collection and preprocessing activity prepare the data for the feedback filtering and feedback to requirements activities that apply machine learning.

Core Enablers for Requirements

Intelligence

Explicit User Feedback Analysis:

Feedback Filtering

Everything we hear is an opinion, not a fact. Everything we see is a perspective, not the truth.

Marcus Aurelius, Meditations

Publications. This chapter is based on and extends our publications “On the automatic classification of app reviews” [151]. My contributions to this publi-cation were co-designing the interviews, interviewing stakeholders, creating tool mockups, and helping to analyze and discuss the paper’s results. Further, this chapter is also based on “Classifying Multilingual User Feedback using Traditional Machine Learning and Deep Learning” [235]. My contributions to this publica-tion were conducting the research, developing the tradipublica-tional machine learning part, and leading the analysis and writing.

Contribution. The approach, we introduce in this study concerns the first activity of the explicit user feedback analysis of the requirements intelligence framework (see Figure 4.1), feedback filtering. The overall idea of the feedback filtering activity is to identify the user feedback containing requirements-relevant information automatically. For example, if stakeholders are interested in using this activity to find problems for resolving them quicker, the approach imple-mented for this activity must extract all user feedback that reports problems or hints toward them. Our approach can filter irrelevant feedback and categorize the remaining feedback into problem reports and inquiries. We look into user

feedback in different languages coming from two different platforms.

Addressed stakeholder needs. In Chapter 3, we found that stakeholders value user feedback but rarely use it in their decision-making process. The reasons are that user feedback comes in large amounts, from diverse platforms/channels and often contains irrelevant and noisy information. As stakeholders do the analysis manually, they need an automated approach to filter irrelevant user feedback, allowing them to focus on requirements-relevant feedback.

5.1 Motivation

Research has shown the importance of extracting requirements related informa-tion from explicit user feedback to improve software products and user satisfacinforma-tion [140, 190]. Apps with better ratings and frequent feedback get a higher rank in the app distribution platforms, leading to more visibility, and therefore, higher download numbers [73]. But as user feedback on social media or app stores can come a thousandfold daily, a manual analysis of that feedback is cumbersome [187]. However, analyzing this feedback brings opportunities to understand user opinions better because it contains valuable information like problems users en-counter or features they miss [78, 99, 112, 187, 191]. Although user feedback con-tains valuable information for requirements engineers, the vast amount of user feedback is rather uninformative and of low quality [180]. Such uninformative feedback is often spam, praise for the app, insulting comments, or a repetition of the star rating in words [109, 187]. When analyzing explicit user feedback, we typically face the following challenges:

Uninformative Feedback: Pagano and Maalej [187], as well as Guzman et al. [99] show that more than 70% of the user feedback is not related to requirements engineering and therefore considered as noise.

Feedback Channels: User feedback is provided on many different chan-nels, such as social media and app distribution platforms. Development teams have to identify the channels by which users provide feedback, then aggregate that feedback, and analyze it to extract the desired information.

As these channels have different purposes, the language of the users may

differ. Tweets, for example, have a fixed and limited length and are usually used to state a general opinion or to request support. The purpose of app reviews, on the other hand, is to provide a review and rating of the app itself.

Language Diversity: Apps are usually not restricted to a single coun-try nor a single language. In particular, popular apps have users all over the world and provide their interface in many languages. When users give feedback, they often do so in their native language. Organizations such as Spotify are aware of that situation and have dedicated support profiles on Twitter to help their users in their native language. When automat-ically classifying user feedback, machine learning algorithms usually use language-dependent models for text representations. Therefore, when de-signing automated approaches and trying to be inclusive to other than English speaking users, it is important to consider the implications of those languages on the machine learning models.

Researchers have applied supervised machine learning to filter noisy, irrelevant feedback, and to extract requirements related information [69, 100, 151]. Most re-lated works rely on traditional machine learning approaches that require domain experts to represent the data with hand-crafted features. In contrast, end-to-end deep learning approaches automatically learn high-level feature representations from raw data without domain knowledge, achieving remarkable results in differ-ent classification tasks [87, 230, 262].

This study’s overall objective is to develop approaches that automatically ex-tract requirements-related information from explicit user feedback and to filter uninformative feedback. We develop these approaches by carefully addressing the described challenges. In this work, we aim at understanding if and to what extent deep learning can improve state-of-the-art results for classifying user feed-back into problem reports, inquiries, and irrelevant. We focus on these three categories because practitioners seek for automated solutions to filter noisy feed-back (irrelevant), to identify and fix bugs (problem reports), and to find feature requests as inspiration for future releases (inquiries) [151]. We consider all user feedback as problem reports that state a concrete problem related to a software product or service (e.g., “Since the last update the app crashes upon start”). We

define inquires as user feedback that asks for either new functionality, an im-provement, or requests information for support (e.g., “It would be great if I could invite multiple friends at once”). We consider user feedback as irrelevant if it does not belong to problem reports or inquires (e.g., “I love this app”).

To fulfill our objective, we employ supervised machine learning fed with crowd-sourced annotations of 10,000 English and 15,000 Italian tweets from telecommu-nication Twitter support accounts, and 6,000 annotations of English app reviews.

We apply best practices for both machine learning approaches (traditional and deep learning) and report on a benchmark.

The remainder of this chapter is structured as follows. In Section 5.2 reports on our study design by stating our research questions, our study process, and the dataset the study relies on. In Section 5.3 we detail on how we applied traditional machine learning. Section 5.4 shows our approach to using deep learning. In Sec-tion 5.5 we report our classificaSec-tion results. SecSec-tion 5.6 discusses the implicaSec-tions of the study, while Section 5.7 summarizes the overall work.