• Keine Ergebnisse gefunden

5.6.1 Implications of the Results

In this work, we classified user feedback for two languages from two different feedback channels. We found that when considering the F1 score as a measure, traditional machine learning performs slightly better in most of the examined cases. We expect that our approaches can also be applied to further feedback channels and languages, although some features are language-dependent and need to be updated. For example, our deep learning model requires a pre-trained word embedding model on top of a training set for each language, such as the English and Italian fastText models. Word embeddings capture the similarity between words depending on the domain and language. They are highly adaptable to language development by retraining the model regularly on current app reviews and tweets. It can capture the meaning of transitory terms like Twitter hashtags or emoticons. In traditional approaches, the language-dependent features are keywords, sentiment, POS tags, and the tf-idf vocabulary. This requires more effort to create models for multiple languages. The rest remains language and domain-independent.

Traditional approaches often perform better on small training sets as domain experts implicitly incorporate significant information through hand-crafted fea-tures [41]. We assume that for these experiments, the hand-crafted feafea-tures de-rived from the domain experts lead to considerably better classification results.

Deep neural networks derive high-level features automatically by utilizing large training samples. We presume that with more training data, a deeper neural network would outperform the traditional approach.

5.6.2 Field of Application

Classifying user feedback is an ongoing field in research because of the high amount of feedback organizations receive daily. In particular, Pagano and Maalej [187] show that popular apps such as Facebook receive about 4,000 reviews each day. When considering Twitter as a data source for user feedback for apps, Guzman et al. [99] show that popular app development companies receive, on average, about 31,000 daily user feedback. Such numbers make it difficult for

stakeholders to employ a manual analysis of user feedback [98]. Recent advances in technology and scientific work enable new ways to tackle these challenges. Our results show that we achieve the best classification results for identifying irrele-vant user feedback. For that feedback category, we achieved F1 scores of .89 for English app reviews, .74 for English tweets, and .83 for Italian tweets. Therefore, we can reduce the effort for stakeholders applying our approach to filter the user feedback they receive. However, the results are yet not good enough to ensure that stakeholders do not miss important feedback. As a consequence, we suggest including a human control mechanism. That control mechanism should allow stakeholders to correct the output of the approach to continuously improve it over time.

Stakeholders can use the results of our approach to investigate the feedback categories further. For example, if stakeholders want to understand if there are problems several users report, they can apply approaches like topic modeling.

If they apply topic modeling on the feedback our approach classified as problem reports, stakeholders already know that all topics are related to problems. If they apply topic modeling on feedback classified as inquiries, they can find common feature requests.

5.6.3 Alternative Implementations from Related Work

We presented a study giving one example of how to implement the requirements intelligence framework’s feedback filtering activity. However, there are several alternatives possible. Here we describe some alternatives found in the literature.

In our study, we focussed on the three categories problem report, inquiry, and others. Research shows that we can find many more categories in explicit user feedback. Pagano and Maalej [187] found 17 topics in app reviews, which they further summarized into the four categories rating, user experience, requirements, and community. Depending on our goals in the requirements engineering process, we could consider including other categories than those identified in our study.

For example, if we consider the topic helpfulness, which is about scenarios where the app has proven helpful, we could train a machine to identify this topic. This topic may be particularly interesting to understand how our users perceive our use cases if we cover all intended use cases and if they found unintended use cases. When we understand which of our use cases are perceived as the most

helpful, we can leverage this information in our decision-making, e.g., by giving more attention to it in the bug fixing or improvement process.

In our study, we considered explicit user feedback as static entities. However, depending on the platform, user feedback may be updated or evolve in lengthy discussions [107, 124, 157]. App stores, for example, allow users to write feedback and developers to answer on it. Both the feedback and the answer can be updated.

In forums or on Twitter, you may have a full conversation. In our study, we did not consider these dynamics, although they can include useful information. For example, Martens and Maalej [157] show that users who give feedback on Twitter, often do not include much information in case of a problem report. The support team on Twitter will reply to those messages asking, for example, for context information, such as the version of the software and steps to reproduce. This information is particularly crucial if we want to use the feedback to create an issue tracker entry for developers [22, 28, 170]. Therefore, a possible alternative for the feedback filtering activity is to bundle messages from a conversation to one that contains all stated information.

5.6.4 Limitations and Threats to Validity

We discuss the internal and external validity of our results as well as the limita-tions of the approach.

Internal threats to validity. Concerning the internal validity, one threat for conducting crowdsourcing studies to label data is that human coders can make mistakes, resulting in unreliable classifications. We took several measures to miti-gate this threat. First, we wrote a coding guide that described our understanding of the feedback categories problem reports, inquiries, and irrelevant. Second, we run a pilot study to test the quality of the coding guide and the annotations received. Third, both coding guides were either written or proof-read by at least two native speakers. Fourth, we further required that the annotators are natives in the language. Fifth, we employed peer-coding, which involved at least two persons, and three in case of a disagreement.

Additionally, there are many more machine learning approaches an, in par-ticular, classifiers we could have included. Specifically, the deep learning part discusses only one approach. To mitigate the threats, we selected several

clas-sifiers from the related work that had similar objectives to us. Future research might want to extend the benchmarks by including other deep learning like BERT [57].

External threats to validity. Concerning the external validity, we are confi-dent that our results have a high level of generalizability. We mitigated the threat regarding generalizability by performing rigorous machine learning experiments that report their results on an unseen test set. Therefore, our approach is not prone to overfitting.

Further, we addressed the issue of classifying feedback in different languages.

Our approach only considers two languages and, therefore, might not generalize to other languages. We mitigated this threat by describing in detail the effort stakeholders have to invest in creating machine learning approaches for other languages.