Results: Within-Apps Analysis - Requirements Intelligence : On the Analysis of User Feedback

We report on the within-apps analysis, which automatically identifyies app fea-ture usage for a particular app. We run our experiments on the top 30 apps (the most labeled) and report on the results in Table 8.2. Table 8.2 shows only the results of the nine apps for which we have enough labeled data to perform classification benchmarks for at least three app features. We decided for at least three app features because otherwise, we would not be able to show if we can distinguish between different features of an app. The table shows for each app the total number of app features it encompasses (n_features), the app features for which we have enough labels for performing 10-fold cross-validation, the best working machine learning algorithm (estimator), the evaluation metrics, as well as the number of labels used for training and evaluating the classifier. For ex-ample, Facebook counts a total number of 13 app features in our dataset, out of which we can identify the usage for seven. For all seven app features, balancing (n_true vs. n_false) the dataset leads to the best performing results. One addi-tional observation is that having many labels does not necessarily lead to better

Table 8.2: Within-apps analysis: machine learning results for app feature usage identification.

App Name n_features App Feature Estimator Precision Recall F1 ROC AUC n_true n_false

Facebook 13 browse newsfeed Gradient Boosting 0.85 0.92 0.88 0.85 12 12

read post Voting Classifier 0.63 0.88 0.74 0.68 49 49

view event Naive Bayes 0.55 0.83 0.66 0.53 109 109

share media Ada Boost 0.62 0.70 0.65 0.60 53 53

get notification Linear SVC 0.51 0.90 0.65 0.47 143 143

like post Linear SVC 0.49 0.89 0.63 0.46 94 94

write comment Random Forest 0.61 0.55 0.58 0.61 40 40

Fb Messenger 11 send message Voting Classifier 0.79 1.00 0.88 0.61 252 69

play game Voting Classifier 0.75 0.86 0.80 0.71 14 14

read message Voting Classifier 0.62 0.88 0.73 0.55 17 17

view media Voting Classifier 0.64 0.69 0.67 0.65 36 36

Gmail 8 read mail Linear SVC 0.68 1.00 0.81 0.51 354 170

send mail Ada Boost 0.65 0.65 0.65 0.66 69 69

delete mail Linear SVC 0.50 0.92 0.65 0.58 12 12

search mail Linear SVC 0.60 0.67 0.63 0.43 48 48

get mails Naive Bayes 0.50 0.81 0.62 0.54 160 160

Google Drive 4 edit file Linear SVC 0.56 1.00 0.71 0.51 20 16

view file Linear SVC 0.47 0.90 0.62 0.54 10 10

save file Naive Bayes 0.39 1.00 0.56 0.83 7 29

Instagram 11 like photo Voting Classifier 0.85 0.81 0.83 0.85 21 21

watch story Random Forest 0.65 0.68 0.67 0.71 19 19

browse timeline Linear SVC 0.51 0.94 0.66 0.45 33 33

Snapchat 8 send message Naive Bayes 0.58 0.95 0.72 0.70 65 65

read message Naive Bayes 0.56 0.98 0.71 0.70 41 41

save message Ada Boost 0.67 0.73 0.70 0.65 11 11

watch story Linear SVC 0.53 0.75 0.62 0.37 12 12

capture photo Voting Classifier 0.73 0.50 0.59 0.64 22 22

Twitter 6 view feed Voting Classifier 0.55 0.85 0.67 0.46 13 13

read message/tweet Decision Tree 0.65 0.66 0.65 0.65 73 73

get live Linear SVC 0.53 0.64 0.58 0.30 50 50

Yahoo Mail 4 send mail Random Forest 0.94 1.00 0.97 0.97 16 31

read mail Decision Tree 0.86 0.90 0.88 0.88 20 20

get notifications Naive Bayes 0.65 1.00 0.79 0.60 11 11

YouTube 12 watch video Linear SVC 0.55 1.00 0.71 0.47 131 108

view latest videos Naive Bayes 0.58 0.83 0.68 0.57 18 18

browse recommendations Naive Bayes 0.57 0.80 0.67 0.60 20 20

rate video Decision Tree 0.57 0.62 0.59 0.57 21 21

Mean 0.62 0.83 0.70 0.61 56.65 47.00

results; for example Facebook “browse newsfeed” vs Facebook “get notification”.

For the machine learning algorithms, it is more important if the usage between two app features is as diverse as possible. If a user is browsing through their Facebook “newsfeed”, the interaction event scroll view may be more frequent and important than the interaction event edit text or any other.

One finding of our experiments is that using an ensemble classifier—such as Random Forest, which conducts several Decision Trees to conclude—does not necessarily lead to better results. For the app Gmail, the best performing (F1) app feature is “read mail”. That app feature achieves an F1 score of .81 and is classified with Linear SVC. Generally speaking, in 15 out of 37 experiments, ensemble classifiers performed best. On average, we can identify app feature usage for 0.53 of all app features in Table 8.2.

However, the interpretation of the F1 score must be made with caution, as

it might be misleading. Therefore, we also included the ROC AUC, which is a measure that compares our results to a baseline and helps to interpret how much better the models are compared to random models. In the following, we discuss three distinct cases that foster the interpretation of our results when considering F1 and ROC AUC.

Case 1: One example of a model with good classification results is Facebook with the app feature “browse newsfeed”. It has a promising F1 score of 0.88, which is an indicator of a good performing model. The ROC AUC score of 0.85 is very similar and confirms that the model will achieve much better results than a random classifier (ROC AUC = 0.5).

Case 2: Gmail, on the other hand, has a misleading F1 score of 0.81 for the app feature “read mail”. Its ROC AUC value of 0.51 reveals that the model is performing similar to a random model. When taking precision and recall into account, we find that the recall of 1.0 is the indicator for the misleading F1 score.

A recall of 1.0 means that the model simply classifies all instances as “read mail”

even though from the n_false columns, we can see that there are 170 labels for the negative case (“not reading mail”).

Case 3: Twitter’s app feature “get live” has a rather low F1 score of 0.58 (precision 0.53, recall 0.64). The ROC AUC value of 0.30 indicated that the model performs worse than a random classifier, which has a ROC AUC of 0.50.

However, since we employed binary classifiers, a ROC AUC value below random indicates that the classification predictions could be inverted—resulting in a ROC AUC of 0.70.

As a consequence, we conclude that interpreting classification results should be done carefully by considering not only the F1 score but also the ROC AUC value.

Figure 8.6 shows how many of the nine apps in our experiment reach a certain threshold of the mean F1 and ROC AUC score across the app features we are able to classify. For example, it shows that for all nine apps, we reach the threshold of a mean F1 score of at least 0.60. In contrast, the mean ROC AUC values show that for three apps, we do not achieve a mean value of 0.60. For the three appsInstagram,Fb Messenger, andYahoo Mail, the mean F1 scores of their app features are at least 0.70, while for Yahoo Mail, the mean F1 score of all its app features is at least 0.80. The colors on the heat map show for each app how

Figure 8.6: Within-apps analysis: the ratio of app features for which we reach a certain mean F1/ROC AUC threshold.

many of its app features we can successfully classify. For example, our models can successfully classify half of Twitter’s app features with at least a mean F1 score of 0.60 (In Table 8.2 it shows that Twitter has six app features in our dataset out of which we can classify three). Figure 8.7 focuses on the max F1 and ROC AUC

Figure 8.7: Within-apps analysis: ratio of app features for which we reach a cer-tain max F1/ROC AUC threshold.

score we reach for classifying app features. The figure helps to better overview the best performing cases of Table 8.2. It reveals that for Yahoo Mail, we can achieve a max F1 score of above 0.90. Further, we can report that for five out of the nine apps, we have at least one app feature for which we achieve a high F1 score of 0.80. Similarly to the F1 score of Yahoo Mail, we achieve, for one app feature, a ROC AUC score of at least 0.90.

Table 8.3: Within-apps analysis: machine learning feature importance reporting χ² (chi-squared test).

App Name App Feature Click View Notific. Edit Text Sel. Text Scroll View Ch. Content Foc. View Ch. Wind. Sel. View

Facebook share media 0.16 0.29 2.17 0.92 0.06 0.07 0.01 0.38 0.10

Gmail send mail 0.38 10.19 0.01 8.49 0.00 0.27 0.24 0.48 0.14

Fb Messenger send message 0.05 0.06 2.26 2.49 1.26 1.05 0.01 0.83 0.01

Snapchat capture photo 0.20 0.12 0.28 0.22 0.17 0.00 0.49 9.55 0.05

Google Drive edit file 0.03 1.06 0.02 0.02 0.02 0.02 0.01 0.23 0.00

YouTube rate video 0.65 0.08 0.00 0.00 0.11 0.02 0.03 0.59 0.02

Instagram send post 0.00 0.01 1.29 1.32 0.16 0.13 0.00 0.31 0.00

Twitter read message 0.01 0.00 0.20 0.00 0.09 0.00 0.01 1.35 0.08

Yahoo Mail send mail 0.15 0.14 0.14 1.35 0.13 0.10 0.05 3.02 0.02

Machine Learning Feature Significance

We analyze the machine learning feature significance quantitatively and qualita-tively to gain insights into our trained machine learning models. Table 8.3 shows the significance (impact) of the machine learning features in our classification experiments based on theχ² scores [193]. The table lists the nine apps, which we also used for our classification experiments. The table shows the app feature that contains the highestχ² score. It is important to note that a high significance does not mean that this machine learning feature had the highest occurrence. Instead, it means that it is crucial information to classify the app feature. A highχ² score could also mean that a small value of a certain feature is informative.

Figure 8.8 shows the feature distribution of the most significant features. For example, the table reveals that the feature edit text is the most significant to classify the app feature “share media” in the Facebook app (2.17). We can see in the violin plot that users trigger the edit text event more often when they

“share media”. In the Gmail app, the feature notification reaches the highest χ² score. It is highly significant for classifying the app feature “send email” with aχ² score of10.19. This could be the case as thenotification event is triggered more frequently through the toast message that indicates that the mail has been sent successfully. For the app Snapchat, thechange window event is highly significant for classifying the “capture photo” feature. According to the violin plot, we can see that the change window event occurs more frequently when users capture a photo. We think that the switch between Snapchat and the camera app could cause this imbalanced number of “change window” events. We conclude that the significance scores and the distribution of the machine learning features can be aligned with an intuitive human understanding.

0 1 label 0.00

0.25 0.50 0.75 1.00

edittext

Facebook, share media

0 1

label 0.00

0.25 0.50 0.75 1.00

notify

Gmail, send mail

0 1

label 0.00

0.25 0.50 0.75 1.00

selecttext

Fb Messenger, send message

0 1

label 0.00

0.25 0.50 0.75 1.00

changewindow

Snapchat, capture photo

0 1

label 0.00

0.25 0.50 0.75 1.00

notify

Google Drive, edit file

0 1

label 0.00

0.25 0.50 0.75 1.00

clickview

YouTube, rate video

0 1

label 0.00

0.25 0.50 0.75 1.00

selecttext

Instagram, send post

0 1

label 0.00

0.25 0.50 0.75 1.00

changewindow

Twitter, read message

0 1

label 0.00

0.25 0.50 0.75 1.00

changewindow

Yahoo Mail, send mail

Figure 8.8: Within-apps analysis: violin plot for the apps with the most signifi-cant machine learning features.

Im Dokument Requirements Intelligence : On the Analysis of User Feedback (Seite 183-188)