• Keine Ergebnisse gefunden

reviews with feature requests (e.g., by Iakob and Harrison [112] or by Maalej et al. [152]), feature extraction also allows filtering the relevant review parts (e.g., sentences), which include the relevant information – something particularly useful as the quality and structure of the reviews are heterogeneous. Relevant insightful information about the feature request might be in the first or in the last sentence of a review. This will also enable the clustering and aggregation of artifacts (such as comments) based on features as well as identifying feature duplicates in these artifacts.

Finally, feature extraction is the preliminary underlying step in many, if not all, feature recommendation and optimization scenarios [105, 178, 217].

Feature recommendations might target both users and stakeholders. For users, feature extraction can help building recommender models that learn dependencies of reported and used features. From the app usage, such models can also derive which features users are interested in. When combining features mentioned in the pages of many apps (e.g., apps within the same category), a predictor model could derive the optimal combination of features, and answer questions such as

“which dependencies of features get the best ratings”. This might also help to identify which features are missing in a particular app release from the “feature universe”.

6.5.2 Field of Application

SAFE has one major advantage since it is uniform: that is, it can be applied to multiple artifacts, in particular on app descriptions and app reviews. SAFE automatically identifies app features in the descriptions and maps those to the app features in the reviews. It identifies fives feature sets:

• App features mentioned in the description.

• App features mentioned in the reviews.

• App features mentioned in the description and in the reviews (intersection).

• App features mentioned only in the description (probably unpopular fea-tures).

• App features mentioned only in the reviews (probably feature requests).

Another major advantage of SAFE is that it does not require a training or a statistical model and works “online” on the single app pages and the single reviews. This enables processing the pages and the reviews immediately during their creations. This also reduces the risk for overfitting a particular statistical model to certain apps, domains, or users.

Nevertheless, we think that the achieved accuracy of SAFE—even if it out-performs other research approaches—is not good enough to be applied in prac-tice. We think that a hybrid approach (a simple, pattern, and similarity-based as SAFE together with a machine learning approach) is probably the most appropri-ate. For instance, machine learning can be used to pre-filter and classify reviews before applying SAFE on them. Indeed, we think that this is the main reason why our accuracy values were rather moderate: because we were very restrictive in the evaluation and applied SAFE on “wild, random” reviews.

SAFE can also be extended by a model that is trained from multiple pages and multiples users. This might, e.g., be used to support users to improve their vocabulary by auto-completing feature terms.

Also, stakeholders should be able to correct the extracted features, e.g., follow-ing a critique-based recommender model, or simply buildfollow-ing a supervised classifier based on SAFE and continuously improving the classifier model based on stake-holders’ feedback (identifying false positives and true negatives). This can either be used to persist the list of actual app features or to train a classifier to learn or adjust patterns per app.

6.5.3 Alternative Implementations from Related Work

Our study focused on identifying the app features either organizations describe, or users address in their feedback. The feedback to requirements activity, however, covers the full spectrum of analyzing requirements-relevant user feedback. The overall idea of this activity is to take the feedback that is relevant for requirements engineering and to perform advanced analysis of this text to gain deeper insights tailored to the needs of the company using requirements intelligence. We there suggest the following examples for alternative implementations of the feedback to requirements activity.

Once we identified the relevant feedback for requirements engineering, we could consider performing aggregation and summarization tasks. For example, a

com-pany wants to analyze all feature requests they received since the release. That company received 1,000 app reviews that the feedback filtering activity identified as feature requests. A person needs to read through all of the feedback to get an understanding of the data, and afterwards, may have to apply qualitative meth-ods, such as a thematic analysis to understand topics and clusters within the data. If done correctly, the analysis takes several iterations and quality metrics like peer coding. Automated approaches can support the company by automat-ically identifying frequently discussed topics, which is an improvement to the previous situation as we were able to split the 1,000 reviews into smaller chunks based on semantic similarity. If we continue the example, the just-described procedure may lead to three semantically different chunks of app reviews, each having a size of about 333. To further ease the effort for the company of what their users discuss, we could perform summarization techniques [95, 100, 257].

With summarization, we could, for example, take each of the three chunks and select one representative review. Other techniques may return common keywords or try to compose a synthetical text made from the whole chunks of reviews.

These techniques allow companies to get a quick overview of what their users are saying, and if they find that interesting, they can dig deeper by reading the reviews related to the topic.

In a second example, we consider a company that introduced a new feature in its software and collects user feedback from Twitter. The company aims to understand the users’ opinion about that feature and how it is generally perceived.

For that, the feedback to requirements activity uses the results of this study as a first step and extracts the features addressed in the tweets. Then, we take the tweets addressing the newly integrated feature and perform sentiment analyses [1, 189, 214]. Guzman and Maalej [101] performed a study in 2014 that is closely related to this idea. From analyzing app reviews, they were able to extract fine and coarse-grained app features and assigned a sentiment score to each feature in the reviews. They suggest that using the extraction and the assignment of the sentiment could help get an overview of the features addressed and decide if there are necessary changes to the requirements.

6.5.4 Limitations and Threats to Validity

As for every study that includes manual coding, the coders in our evaluation might have done mistakes a) indicating wrong features in the descriptions and reviews or b) wrongly assessing the extraction of the SAFE or the two reference approaches. To mitigate this risk, we conducted a careful peer-coding based on a coding guide (e.g., stating what a feature is) and using a uniform coding tool. Therefore, we think that the reliability and validity of the app description and feature matching evaluations – both involving peer-coding – is rather high.

For the reviews, we decided to have one single coder for two reasons. First, the large number of reviews would require more resources. Second, the coders were well-trained from the first phase of coding app descriptions. In the future, manually pre-extracting review features, as we did for the features from app descriptions, would improve the reliability. As we share the data, we encourage follow researchers to replicate and extend the evaluation.

Concerning the sampling bias [161], our evaluation relies only on Apple App Store data and apps from one category. While this helped to concentrate the effort and to learn more about the evaluation apps, it certainly limits our results’

external validity. However, it is important to mention that the SAFE patterns, the core “idea” of the approach, were identified from Google Play Store data of 100 apps from multiple categories. Since SAFE also works on Apple Store data, we feel comfortable to claim that our approach is generalizable to other apps, except for games which, as described, we ignored on purpose.

As for the accuracy and benchmark values, we refrain from claiming that these are exact values, and we think that they are rather indicative. We think that the order of magnitude of precisions and recalls calculated and the differences between the approaches are significant. We carefully selected our measures to conduct a fair and non-biased comparison. For instance, during the evaluation of the extraction, the tool randomly showed the extracted results in the same layout, without mentioning which approach extracted which features.

Finally, when replicating the reference approaches, we might have made mis-takes. We tried to conduct the comparison as fair as possible and implemented all steps described in the papers to the best of our knowledge. In some cases (e.g., list of stop words), we tried rather be spacious and fine-tuned the reference approach to have results comparable to those reported on the papers.