• Keine Ergebnisse gefunden

clas-sifiers from the related work that had similar objectives to us. Future research might want to extend the benchmarks by including other deep learning like BERT [57].

External threats to validity. Concerning the external validity, we are confi-dent that our results have a high level of generalizability. We mitigated the threat regarding generalizability by performing rigorous machine learning experiments that report their results on an unseen test set. Therefore, our approach is not prone to overfitting.

Further, we addressed the issue of classifying feedback in different languages.

Our approach only considers two languages and, therefore, might not generalize to other languages. We mitigated this threat by describing in detail the effort stakeholders have to invest in creating machine learning approaches for other languages.

of traditional machine learning, we can leverage the domain experts’ knowledge for extracting and selecting machine learning features. We further assume that selecting other deep learning approaches could make a difference, too.

The manual effort for applying traditional machine learning is higher than for deep learning. Global organizations have users across the world, meaning that they receive feedback in diverse languages. We selected the En-glish and Italian language to analyze how much effort stakeholders need in con-figuring and creating a multi-language approach. We further wanted to under-stand how well we can perform the classification in different languages. The feature engineering step in traditional machine learning requires domain experts because knowledge about the domain guides the extraction and selection of ma-chine learning features. Our study shows that for each language, stakeholders put more effort into feature engineering, as many machine learning features are language-dependent. For example, the terms users use to describe a problem may differ between the languages, and if stakeholders want to include keywords as a machine learning feature, they have to understand the language specifics.

Deep learning, on the other hand, does not require a domain expert as it extracts machine learning features automatically. Therefore, the effort for domain experts to apply deep learning is lower.

Irrelevant user feedback identification achieves the best results. One of the stakeholders’ primary concern with analyzing user feedback is the large amount of irrelevant feedback (see Chapter 3). To reduce the amount of user feed-back stakeholders analyze manually, stakeholders require automated approaches that extract requirements-relevant feedback. Our approach works best for iden-tifying irrelevant user feedback by achieving F1 scores of .89 for English app re-views, .74 for English tweets, and .83 for Italian tweets. Therefore, our approach can reduce the effort for stakeholders. Further categorizing requirements-relevant user feedback to problem reports and inquiries is possible with our approach. We achieve an average F1 score of .68 for identifying problem reports and an average F1 score of .59 for inquiries (both are averages across platforms and languages).

Explicit User Feedback Analysis:

Feedback to Requirements

Look deep into nature, and then you will understand everything better.

Albert Einstein

Publication. This chapter is based on our publication “SAFE: ASimpleApproach forFeatureExtraction from App Descriptions and App Reviews”[118]. My con-tributions to this publication were creating and evaluating the app feature ex-traction approach, including all steps, performing the analyses, and supporting the writing.

Contribution. This chapter concerns the second activity of the explicit user feedback analysis of the requirements intelligence framework (see Figure 4.1), feedback to requirements. We introduce an approach in this study that extracts app features users address in their feedback and that stakeholders document on app pages. Further, the approach can match the app features addressed in the feedback with the documented features on the app pages.

Addressed stakeholder needs. In Chapter 3, we found that stakeholders need to know the features user discuss and the features similar apps provide. In particular, they want to foster innovation and improve the app’s quality by un-derstanding the features users address and the features similar apps documented on their app page.

6.1 Motivation

App stores are particularly interesting to the software and requirements engi-neering community for two main reasons. First, they include millions of apps with the possibility to access hundreds of millions of users [187]. Second, ag-gregate information from the customer, business, and technical perspectives [73].

Each app in the store is presented by its own app page. This typically includes the app name, icons, screenshots, previews, and the app description as written and maintained by the stakeholders. Users rely on this information to find and download the app they are looking for. Later, some users review the app and comment on its features and limitations. Some apps might receive even more than a thousand reviews per day [187], some of which include hints and insights for the stakeholders.

Over the past years, we observed a “boom” of research papers, tools, and projects on app store analytics [162]. Most of these works focus on mining a large amount of app store data to derive advice for analysts, developers, and users. One popular analytic scenario is to mine app reviews for identifying pop-ular user complaints or requests [78, 112, 152] and assisting release planning [39, 101, 249]. We can analyze the impact features have on the success of an app if we can identify the specific feature users want, or the feature being affected by a problem [48, 218]. Another scenario is to mine the app pages [93, 106] and their evolution over time [217] to derive recommendations for users or to identify combinations that increase download numbers [105, 178].

A common and elementary step that these approaches share is the following:

how can the app features included in the natural language text be automatically and accurately identified, in particular since the text is unstructured, potentially noisy, and uses different vocabulary. Previous approaches focus on mining the features from a particular artifact in the store, typically either from the pages or from the review. However, app vendors are likely interested in a holistic approach that works for multiple types of artifacts to be able to combine the customer, business, and technical information.

In this paper, we introduce SAFE, a simple, uniform approach to extract and match the app features from the app descriptions (written by stakeholders) and the app reviews (commented by users). Our approach neither requires an a-priori training with a large dataset nor configuration of machine learning features

and parameters and works on single text elements such as a single app review.

Based on deep manual analysis, we identify a set of general textual patterns that are frequently used in app stores to denote app features This includes a) 18 part-of-speech patterns (such as Noun_Noun_Noun to represent a feature like “Email chat history”), b) five sentence patterns (indicating enumerations and conjunctions of features), and c) several noise filtering patterns (such as filtering contact information). After preprocessing the text in a single app description or a review, we apply those patterns and extract a list of features: each consists of two to four keywords. Finally, we match the extracted features from the reviews to those extracted from the descriptions.

We implemented our approach and applied it to the descriptions and the re-views of 10 popular apps from the Apple App Store. We also reimplemented two frequently cited state-of-the-art approaches: one focuses on app pages [106], the other on reviews [101]. We evaluated and compared the feature extraction accuracy as well as the matching between the features in the reviews and the descriptions.