• Keine Ergebnisse gefunden

stay informed and understand how their app generally performs. Depending on the role of the stakeholders, they either need a general overview like a dashboard or more details like the actual user feedback and the features users address.

Theintegrationin the integrated interactive visualization stands for integrating and combining both types of user feedback in one single place. It presents a dashboard with descriptive analytics showcasing, for example, when and how often users provide feedback. Besides dedicated views for the feedback types, the visualization combines both by, e.g., enriching explicit user feedback with context data like the app and operating system version. The underlying machine learning models of the other activities and the combination of both feedback types allow visualizations for all four types of analytics. We address the stakeholders’ need for providing a single point of access for different feedback types, coming from different platforms, which is available on the web.

The term interactive in the integrated interactive visualization stands for al-lowing stakeholders to interact with the presented information. The visualization allows stakeholders to, e.g., select filter mechanisms, to define time frames of in-terests, and to correct the underlying algorithms of the feedback analyses activi-ties. Here, we address the stakeholders’ need for being in control of the machine learning models and the need for selecting and filtering information appropriate to the role of the stakeholder.

in this section. Each chapter applying a machine learning-based approach ex-plains each step of the machine learning procedure in detail, including a brief introduction to the techniques applied. This way, the reader learns about the techniques when they apply.

Feature:

Selection

Extraction

Normalization

Model:

n-fold cross validation

Configuration

Evaluation:

Against test set

Benchmarking Data:

Crowd labeling

Quality check

Cleaning

Machine Learning Pipeline

Data Collection

and Preparation Feature

Engineering Model Selection

and Training Model Evaluation Model for Predictions (optionally) return for improvements

Figure 4.2: The requirements intelligence machine learning pipeline.

Figure 4.2 shows the overview of the machine learning pipeline we utilized for the requirements intelligence approaches. The general idea of the pipeline is to have a standard procedure to produce robust machine learning models for our approaches. In every approach presenting machine learning-based solutions, we performed a benchmark for finding an optimal model. For the benchmark, we performed several machine learning feature extraction techniques and tested dif-ferent feature combinations. We further selected several machine learning models, tuned their hyperparameters, and tested them with an unseen test set.

The machine learning pipeline contains typical steps suggested by the litera-ture [23, 32, 83, 203]. The developer and advocate at Google Cloud, Yufeng Guo, summarizes the typical steps for machine learning similar to our approach [96].

In the following, we briefly introduce the pipeline steps of Figure 4.2.

Data collection and preparation. For creating supervised machine learn-ing models, we have to collect and prepare the data as a first step. Supervised machine learning models learn to make predictions based on a labeled input. For example, a supervised machine learning approach for classifying user feedback into either problem report or not a problem report (binary classification prob-lem), we have to feed an algorithm with examples. Typically, humans take a look at a sample of user feedback and label it to one of the two categories.

The quality of the data is one of the most crucial parts of machine learning because models fed with, e.g., wrongly labeled data, cannot make accurate

pre-dictions [250]. Either domain experts or the crowd (e.g., users, but not necessarily domain experts) can label the collected data (e.g., explicit user feedback). No matter if experts or the crowd perform the labeling task, both need training for understanding the problem, the machine learning algorithm has to solve. We can achieve a common understanding of the problem by performing training or by writing a coding guide. In either approach, we clarify what the exact problem is, what their boundaries are, and give positive and negative examples. By running a prior pilot labeling, we can check if the labeling task is well understood. Our studies rely on crowdsourcing studies for labeling the datasets.

Then, we perform cleaning techniques on the labeled data, such as handling missing values and further split the data into a training and a test set. The train-ing set is part of the model selection and traintrain-ing step, which includes the tuntrain-ing of the model and performance checking. The test set is only part of the last step of the pipeline, the model evaluation, to ensure that our optimized model is not, e.g., over-fitted and performs well on unseen data.

Feature engineeringis the step that uses domain knowledge to create meaning-ful data representations (features) for machine learning models [222]. Typically, a feature can either be numerical (continuous or discrete) and categorical (ordi-nal or nomi(ordi-nal). Depending on the machine learning algorithm, we can create features of both types or only one of the two types. However, we can transform features from one type to the other. For example, the context data “internet connection” can be any of the categories “Wi-Fi”, “mobile”, “not connected” (cat-egorical feature). However, most machine learning algorithms can work with numerical features. For that reason, we can either one-hot encode features or assign a number to each categorical value [208]. We can prepare the usage con-text data and interaction events into a set of features (feature vector). In natural language processing, we also have to transform the text into feature vectors us-ing techniques like bag-of-words [42], tf-idf [232], or fastText [123]. The chapters employing those techniques describe them in more detail.

Besides transforming the data into feature vectors (feature extraction), fea-ture engineering also concerns the selection of feafea-tures. Sometimes, we do not want to include all possible features (e.g., all context data) but select some of them. The reasons for limiting the number of features are, among others, privacy

concerns regarding the collected data [243], the decreasing performance of the model, and longer times for training the algorithms. Therefore, we can perform feature selection approaches like the attribute evaluator CfsSubsetEval and the search method BestFirst [102]. Feature selection techniques return the features that have the highest impact on the machine learning model’s accuracy.

One final step is the normalization of features, sometimes called feature scaling [2]. Data often comes in different scales or different units like counts and percent-ages. We include feature normalization in our benchmark experiments because normalizing features can improve the machine learning model [30].

Model selection and training. The model selection and training step of the machine learning pipeline is about selecting and optimizing the machine learning models for the train set, which we created in the first step (data collection and preparation).

Our goal is to find the best machine learning algorithm and configuration for each approach of the requirements intelligence framework activities. Therefore, we selected several algorithms available in either the Python Scikit-learn library [194] or the Java Weka data mining software [103]. Most machine learning algo-rithms allow hyperparameter tuning, which can improve performance [19, 20].

For each selected machine learning algorithm, we run n-fold cross-validation [33, 130, 188] to validate the performance of our models on the train set. Cross-validation splits the train set into a train and Cross-validation set for each of the n-folds. This technique then trains a model using the train set and checks its performance using the validation set. After the nth fold, all data points were once part of the validation set. We selected the n-fold parameter based on the number of available training samples. We detail more on our decision in each chapter using n-fold cross-validation. We chose grid search for hyperparameter tuning to find the optimal values for various parameters. Grid search exhaustively samples hyperparameter combinations for a defined grid of configurations [20].

Performing this method gives us full control over the hyperparameters.

In case, we thought that the performance of the classification model could be improved, we either changed the hyperparameter grid or went back to the feature engineering step.

Model evaluation. In the last step of our machine learning pipeline, we per-form a final test of our trained machine learning model from the previous step.

Therefore, the input for this step is the best performing model of the n-fold cross-validation, including hyperparameter tuning. The goal of this step is to check how our model performs on unseen data. For that reason, we created a train and a test set in the data collection and preparation step. The test set usually compares the performance of competing models. It is a curated sample, which includes a real-world representation of the data (i.e., realistic distribution of the classes) [220]. As it is unseen to the trained model, we can make assumptions about the model’s performance and compare the results of several models. If the evaluation of the model is unsuccessful, we might have overfitted the model and have to go back to either the second or third step of the pipeline.

We use the evaluation on the test set to create our benchmarks, which are tables describing the performance of each selected machine learning algorithm and its hyperparameters. We only report on the best configuration for each algorithm as sometimes we performed hundreds of machine learning experiments to find an optimized model.

The final result is a machine learning model that we can use to make predictions on new data points (i.e., newly received user feedback).