• Keine Ergebnisse gefunden

We discuss the impact of our findings on the software engineering community, as well as the strengths and limitations of our approach and study.

8.5.1 Implications of the Results

Based on a survey with 607 responses from software engineers, Begel and Zimmer-mann reported [17] that the question “How do users typically use my application?”

is the most essential and worthwhile. Our results show that we can learn app feature usages from high-level, app-independent interaction data: both by train-ing classifiers with the data of other users of an app, or by ustrain-ing the usage data of different apps providing similar app features. In the following, we focus the

0 1 label 0.00

0.25 0.50 0.75 1.00

selectview

listen to music

0 1

label 0.00

0.25 0.50 0.75 1.00

clickview

share

0 1

label 0.00

0.25 0.50 0.75 1.00

selecttext

delete

0 1

label 0.00

0.25 0.50 0.75 1.00

scrollview

search

0 1

label 0.00

0.25 0.50 0.75 1.00

selectview

play game

0 1

label 0.00

0.25 0.50 0.75 1.00

clickview

manage items

0 1

label 0.00

0.25 0.50 0.75 1.00

clickview

earn money

0 1

label 0.00

0.25 0.50 0.75 1.00

notify

call

0 1

label 0.00

0.25 0.50 0.75 1.00

edittext

pay money

0 1

label 0.00

0.25 0.50 0.75 1.00

edittext

write message

Figure 8.9: Between-apps analysis: violin plot for the app features with the most significant machine learning features.

discussion on how this can be used to support stakeholders, including developers, managers, and analysts. We summarize the following three main use cases for the discussion.

Objective app usage analytics. App analytics tools do a good job in re-porting the app’s “health” by analyzing the written reviews on either app stores or social media [151, 176, 256, 257]. These analyses usually classify the written reviews into bug reports, feature requests, or irrelevant [39, 81, 249]. They further show the opinion of users about certain aspects of the app, such as app features listed on app pages [79, 101]. However, these analyses only rely on the subjec-tive experience of users, are non-representative (only those who submit feedback are represented), and often contain emotional and uninformative text [82, 187].

Based on our approach (analyzing the usage data), stakeholders can understand the actual objective usage of an app, independently of users’ emotions. Usage data analytics (interaction data + app feature usage) enables answering ques-tions such as: How long is my app or particular features of it being used? By whom? Which features are used together, and in what order? Depending on the collected usage data, more questions can be answered. If focusing on only inter-action data + labeling, we can perform feature usage analytics. When adding additional context data such as location, device/hardware usage, and connectiv-ity, we can answer questions such as: Where are our features being used? How much do they stress the hardware?. When also collecting data like views names and its content, we can answer advanced questions like: What views do our users stay longest and shortest? Where and at which point do my users close the app?

The consequence of performing the suggested advanced analyses is to collect very pervasive and privacy intrusive data.

8.5.2 Field of Application

Data-driven release planning. Stakeholders usually keep track of open bugs, enhancement and feature requests in issue tracker systems. For release planning, stakeholders have to decide what to develop for which release – often a decision complex [36, 215]. Understanding popular, as well as unpopular app features, can support making this decision. For instance, bugs affecting popular app features might have a higher impact on users’ satisfaction with the app [153]. Therefore, one recommendation to stakeholders could be to focus on bugs that affect popular app features as they would negatively impact the user more frequently. Similarly, feature popularity analysis can help prioritize testing, quality assurance, and documentation work.

Combining subjective with objective feedback. Software engineers de-scribe their app features in the stores’ app pages, which are also being addressed by users in their feedback and reviews [107, 118, 144]. A promising future area is the combination of both objective (what users do) and subjective (what users say) feedback. Written user reviews often report bugs but rarely state important information such as the affected device, software version, and steps to reproduce [22, 187]. Our work enables mapping written reviews with usage data to provide the missing information. For instance, spontaneous feedback mentioning a

cer-tain app feature can be used to label the interaction data and create the training set. On the other hand, if we precisely detect the feature the user has been using, we can, for instance, associate the star ratings (at least in part) to that feature.

This combination of information can help resolve bugs quickly and better un-derstand the meaning of the written reviews. It may also give insights for rather uninformative reviews like “this app is trash!!” as the usage data could reveal pain points such as non-crashing bugs, views on which the user stayed much longer than others, or a view on which a user clicked several times on a button out of frustration.

8.5.3 Alternative Implementations from Related Work

Our study aimed at identifying app feature usage, which helps, e.g., in under-standing how users use app features. However, we can implement the feedback to requirements activity for different ideas. In the following we summarize potential alternative implementations.

In the domain of mobile apps, one of the most important success factors is the user interface of the app. User interfaces designed in an intuitive fashion reduce the risk of frustrated users, who, in the end, might uninstall the app based on their negative experience. If we want to understand how users are using the UI in terms of navigation paths, time stayed on a view, or the number of performed actions, we can leverage user interaction data [55]. Deka, Huang, and Kumar [55] developed ERICA, an approach that mines user interaction data of mobile devices, which automatically detects, e.g., UI changes, records screenshots to create interaction traces. With these interaction traces, we can identify weaknesses in the UI like unintended navigation paths or gain insights into how to optimize the existing and intended navigation paths. Using interaction data for UI performance testing is helpful if we want to scale the tests. In particular, A/B testing, which aims at testing different versions with users, is time and cost-intensive as a substantial amount of users must be found to gain meaningful insights from the tests. With user interaction mining, we can scale this process to potentially all users of an app [54]. One of the next steps could be a data-driven design approach [53].

The data-driven design provides a database with example designs that helps decision-makers finding a design that we can use in our product. We can use the database to either identify best-practices or to get inspiration. With

data-driven models, we can understand whether a design may be successful in achieving the specified goals [53]. In their paper, Deka et al. introduce an approach that leverages user interaction data and UI databases to enable applications like design search, UI layout generation, UI code generation, user interaction modeling, and user perception prediction. However, as Liu et al. [146] point out that these approaches are rather black-boxes that do not expose the design semantics of the UI. They, therefore, performed interactive coding to create a lexical database for 25 different types of UIs that allows us to understand what each element on the UI means. In combination with user interaction data, we can not only understand the semantic of the UI elements but also know how they are used. That kind of analysis provides a rich source of information in the requirements engineering process that we can use to test different UIs, optimize navigation flows, identify navigation flaws, and understand what the user is doing in the app.

8.5.4 Limitations and Threats to Validity

We employed supervised machine learning for identifying app feature usage, which depends on labeled data for training. Machine learning can only be as good as these labels—wrong labels, e.g., fake or accidentally wrongly selected app features in the label dialog introduce bias and noise to the algorithm. We mitigated this threat by 1) carefully selecting the study participants in a pre-study and 2) by limiting the number of labels participants can assign during the field study (see Section 8.2.3). We have confidence in the collected data because of these measures and our data, which includes many popular apps with reasonable labels.

The generalizability of this approach is another threat to the validity of our results as we rely on data from 55 participants. However, we chose to perform crowdsourcing instead of, e.g., student-based experiments to ensure real mobile phone usage and selected participants from 12 countries to increase participants’

diversity.

Since app features are named in diverse ways, even though they are semanti-cally or even technisemanti-cally the same, we rely on unifying these app features man-ually. This manual process is prone to bias as its correctness depends on our understanding of the app features. We address this potential bias with two steps.

First, one of the authors with more than eight years of experience in app devel-opment performed and documented the first iteration of the unification. Second,

two more authors checked the documented list of the unified app features and discussed the disagreements leading to the final list.