Results: Within-Users Analysis - Requirements Intelligence : On the Analysis of User Feedback

Table 7.1: Overview of the study participants.

# Gender Age Country Position Affiliation Experience

P1 M 25-34 Germany Researcher University 2 years

P2 F 25-34 Germany Researcher University 3 years

P3 M 25-34 Germany Researcher University 2 years

P4 M 25-34 Germany App Developer Health App Company 4 years

P5 F 35-44 Germany Lead Engineer Telecommunication Company 10 years

P6 F 25-34 Austria Researcher Research Institute 1 year

P7 M 25-34 Austria Researcher Research Institute 4 years

P8 M 25-34 Austria Researcher Research Institute 5 years

P9 F 25-34 Spain Researcher University 3 years

P10 M 35-44 Spain Associate Professor University 1 year

P11 M 35-44 Spain Software Engineer Information Security Company 12 years P12 M 35-44 Spain Chief Product Officer Information Security Company 12 years P13 M 65-74 Spain Project Manager Information Security Company 24 years P14 M 35-44 Spain Chief Operating Officer Information Security Company 5 years P15 F 55-64 Italy Project Manager Telecommunication Company 20 years P16 M 55-64 Italy Project Manager Telecommunication Company 20 years P17 F 25-34 Tunesia Software Engineer Intern ERP & Health App Company 2 years

P18 M 25-34 Switzerland Student Assistant University 1 year

belongs to private sessions and 20% to professional sessions. On average, each participant had 540 sessions with a standard deviation of 475.

could harm the application that relies on our approach. Therefore, we are also reporting on the F1-score, which takes both values into account.

Table 7.2: Mean accuracy of evaluated classifiers for all participants using the fullfeature set with 10-fold cross validation.

Classifier Usage Type Precision Recall F1

Private 0.978 0.987 0.983

Decision Tree

Professional 0.951 0.873 0.908

Private 0.952 0.985 0.968

Decision Table

Professional 0.902 0.734 0.797

Private 0.919 0.880 0.898

Naive Bayes

Professional 0.581 0.735 0.626

Private 0.946 0.949 0.946

LibSVM

Professional 0.752 0.591 0.639

7.3.1 Classification Benchmark

The first research question RQ7.1 aims to study the extent to which different classifiers can determine if the device usage is private or professional. Table 7.2 shows the results of the classification using 10-fold cross-validation and using all context data described in Table 2.3 as features. The results shown in the table are the mean values across all participants. We attached the detailed classification benchmarks per participant in the appendix. The table highlights the highest F1-score across the classifiers in bold font. We report on the precision, recall, and F1 score of a binary classifier deciding whether the current device usage is private or professional. We present our metrics on both classes to avoid reporting an accuracy paradox [248] due to our previously discussed data imbalance. In the full feature set, we used all the collected context data to see how accurate our potential results in classifying private and professional usage can be.

When comparing the classifiers in Table 7.2, we found that all classifiers per-form better than a random classification. The Decision Tree classifier achieves the best results for both usage types and across all participants. This classifier has mean F1-score values of >0.980 for private and >0.900 professional usage.

Table 7.2 reveals that theDecision Table classification has similar results to the Decision Tree for classifying private usage, but detecting professional usage is about 0.11 worse. In Appendix A.1, we show the complete benchmark results for each participant. It shows that, e.g., theDecision Table classifiers’ minimum

and maximum F1-scores are 0.197-0.954 for professional and 0.895-0.997 for pri-vate usage, while the Decision Tree classifiers’ F1-scores range from 0.738-0.995 for professional and 0.944-0.999 for private usage. The Naive Bayes and SVM classifiers seem to have difficulties for classifying professional sessions with a com-parably low and wide range of the F1-score. We were able to increase the accuracy of theNaive Bayes classifier during our experiments by using the supervised dis-cretization of Weka [103]. As shown by Dougherty et al. [62], the disdis-cretization of continuous features can improve the overall accuracy. In our case, we could improve the accuracy of the Naive Bayes classifier by up to 15% for professional usage. Nevertheless, as shown in Table 7.2, the accuracy of Naive Bayes falls behind the other classifiers. The mean values show that all classifiers have high precision, recall, and F1-score for classifying private sessions with values greater than 0.890. The mean values of the F1-score of all classifiers range from 0.898-0.983 for private usage, while professional usage ranges from 0.626-0.908. We found that balancing the datasets achieved the best classification results when we looked at each participant.

7.3.2 Feature Set Minimization

The second research question RQ7.2 aims to investigate which and how many features we need for accurate classification. The reason for reducing the number of features is twofold. First, we want to minimize the data collection of sensitive and private data as this can increase the trust in such a system [145]. Sheth and Maalej [223] found that users’ privacy concerns are data breaches and data sharing. Van der Sype and Maalej argue from the legal perspective that collected data should always be proportionate to the original purpose of a system [243].

Furthermore, we want to reduce the use of resources as this approach should solely run locally to avoid potential data leakage. In the following, we first explain the process of minimizing the number of features and then report on the results of the second classification benchmark.

First, we executed the default feature selection algorithm of Weka (attribute evaluator CfsSubsetEval [102] and the search method BestFirst). We ran the feature selection separately across all participants and looked at the most infor-mative features that the algorithm calculated for at least three participants. We then made different combinations of these features by using as few features as

Table 7.3: Mean accuracy of evaluated classifiers for all participants using the minimized feature set (app, day of the week, time of the day, Wi-Fi encryption, and number of available Wi-Wi-Fi networks) with 10-fold cross validation.

Classifier Usage Type Precision Recall F1

Private 0.977 0.987 0.982

Decision Tree

Professional 0.950 0.864 0.901

Private 0.952 0.982 0.967

Decision Table

Professional 0.890 0.735 0.792

Private 0.920 0.920 0.920

Naive Bayes

Professional 0.685 0.655 0.661

Private 0.957 0.959 0.957

LibSVM

Professional 0.920 0.681 0.739

possible to check which combination achieved the best result. Table 7.3 illustrates the resulting benchmark. The final features of this process are the app, day of the week, time of the day, the number of available Wi-Fi networks, and Wi-Fi encryption. The most informative feature is the app (for 9 participants). The second most informative feature is the hour of the day, which the feature selec-tion calculated for 8 participants. This process eliminated the privacy-intrusive features location and the interaction data. We assume that the location clusters have no high impact because we found that especially private sessions are done independently of the location. On the other hand, information of the Wi-Fi con-nection such as the BSSID, the number of available Wi-Fi networks and the Wi-Fi encryption could also indicate a rough location, because these features do not fre-quently change in common places like home and the office. Touch interactions also do not improve the classification accuracy, because this information might be too detailed as, e.g., touch events on specific elements like buttons dependent on the app used. The table in Appendix A.1 shows the complete benchmark results for every participant.

In the following, we compare the classifiers and their results for the minimized feature set described in Table 7.3. With the minimized feature set, we were able to get close to the results of the full feature set. Again the Decision Tree classifier performed best. For theDecision Tree and theDecision Table classifier, the mean F1-score differs by less than 1% compared to the full feature set. For professional usage, the Decision Tree outperforms the other classifiers as the Decision Tree’s mean F1-score is 0.901 while the second-highest mean is 0.792.

By looking at the mean values of the Naive Bayes and LibSVM, we can see that compared to the full feature set, the accuracy increased in the minimized set of features. In this case, we assume that the variety of features negatively influenced the classification. The Naive Bayes classifier performance improved for private usage by 0.022 for the F1-score, which was due to the 0.040 rise in the recall.

For professional usage, the Naive Bayes classifier improved its mean F1-score by 0.035, the precision increased by 0.104 while the recall decreased by 0.080. The LibSVM classifier improved its mean F1-scored by 0.011 for private usage and by 0.100 for professional usage.

From these results, we conclude that it is possible to focus on the most in-formative features and to remove the privacy-intrusive ones to keep high overall accuracy.

Concluding the section Within-User Analysis, we found that the classification of private and professional usage has promising results. Further, our feature selection section highlighted that reducing the features for the classification can still achieve results with an accuracy of above 90%. We were able to remove the two privacy-intrusive features location and interaction data because other features like the package name or the time of the day have a high information gain.

Im Dokument Requirements Intelligence : On the Analysis of User Feedback (Seite 160-164)