• Keine Ergebnisse gefunden

Generalization, Model Design, and Hyperparameters

The goal of a deep learning algorithm is to produce a model that performs well on inputs that were unobserved during training. This concept is usually referred to as generalization [324]. In the previous section, we referred to a single training datasetXtrain, which was experienced by the model through optimization. To assess generalization, we also need to consider a validation setXval and test set Xtest withXval * Xtrain, Xtest * Xtrain, andXval *Xtest. The validation setXval is used to fit hyperparametershM. We assess the model’s performance with respect to the test setXtest, as it indicates how well the model did learn the task. The performance on the training setXtrainreflects how well the algorithm fits the parameterswM to the training dataset. While providing no indication about generalization, the training performance is relevant to observe the training process itself and ensure that the model optimization process leads to the desired goal.

In the medical image analysis domain, datasets are often limited in size. Splitting of a validation and test set, for example, based on random sampling, could induce a bias.

By chance, the test set might be chosen in a way such that generalization performance is over- or underestimated. To overcome this problem, cross-validation (CV) has been proposed. In its most common form,kCV-fold CV, the entire dataset is partitioned into kCV subsets. Then, kCV different deep learning models are trained where each uses kCV −2subsets for training, and the remaining two subsets are used for validation and testing. IfkCV is the number of examples in the datasets, the method is referred to as leave-one-out CV.

The reason why generalization can be achieved is still part of current research [507]

and not part of this thesis. In this thesis, we are concerned with deep learning model design for finding models that empirically generalize well for deep learning tasks. De-signing deep learning models is the problem of finding a suitable set of hyperparameters hM. The hyperparameters define the hypothesis space of the deep learning model. The hypothesis space is the set of all possible functions that are considered for solving the optimization problem given in Equation 3.3. The selection of hM, and thus, the hypothesis space determines thecapacityof a deep learning model. Capacity describes a deep learning model’s ability to fit complex functions [176]. A model with low capacity is only able to fit simple functions, such as linear regression models. A model with high capacity is also able to fit complex nonlinear functions.

3.2 Generalization, Model Design, and Hyperparameters

Formally, finding a suitable set of hyperparameters extends the optimization problem given in Equation 3.3 to the bilevel optimization problem:

hM = arg min optimiza-tion problem in Equaoptimiza-tions 3.4a and 3.4b, respectively, can have different performance measuresJP1 andJP2. Also, the upper optimization problem aims to minimize the error for the validation data, while the lower optimization problem minimizes the error for the training data. The test dataset is not part of the optimization problem to provide a performance estimate on unseen data.

Solving the upper optimization problem in Equations 3.4a is often infeasible due to high computational requirements. Nevertheless, there are several optimization methods that deal with the problem efficiently. A simple solution is to perform a grid search over a predefined subset of possible parametershˆMhM. A downside of this approach is high computational effort ifˆhM is not chosen to be very small. However, prior knowledge on the parameter’s behavior can be used to set up a reasonable subsetˆhM for the search.

When trying to reduce the necessary number of iterations for optimization, other techniques such as random search [46] and Bayesian optimization [456] have been introduced. In general, grid search is the most common method for hyperparameter optimization, as a lot of prior knowledge can be integrated when setting up the grid. As a result, choosing potential hyperparameters is the major engineering part that needs to be performed when employing deep learning methods.

If the choice of hyperparameters and thus the model’s capacity is not optimal, the model often exhibits behavior that is referred to as underfitting and overfitting. Un-derfitting describes the issue when a model is not capable of fitting to the data, which results in a high training error, for example, when trying to use a linear model for a nonlinear task. Overfitting occurs when a model fits the training data perfectly, usually also including noise, leading to a large difference between training and test error and, overall, a large test error. Underfitting can occur for models with low capacity, while models with high capacity tend to overfit data.

Often, the terms bias and variance are used in conjunction with under- and overfitting.

Bias is an error made by a model due to incorrect assumptions in the learning algorithm [254]. An example is the assumption of a linear relationship for a nonlinear problem.

Variance is an error made by a model that also captures small variations such as random noise instead of the target relationship [254]. An example is a high fluctuation of predictions from a nonlinear model that was used for a simple linear problem. Therefore, underfitting with low capacity relates to a high bias problem with large deviations from the true value. Overfitting comes with a high variance problem as predictions for the test set are likely to deviate a lot. The relationship between training and test error, overfitting and underfitting, capacity, and bias and variance are shown in Figure 3.1.

Capacity Capacity Optimal

Bias

Variance Training Error

Test Error Overfitting

Underfitting

Fig. 3.1: The relationship between training and test (generalization) error, overfitting and underfitting, capacity, and bias and variance. With increased capacity, bias tends to decrease while variance increases. Up to the point of optimal capacity, both training and test error decrease. Here, the model’s capacity is still in the underfitting zone. Once the optimal capacity is passed, the model enters the overfitting zone, and the test error begins to increase while the training error decreases further. Based on Goodfellow et al. [176].

Adjusting a model’s capacity by choice of hyperparameters alone is often difficult in practice. Therefore, other methods, such asregularization, can be used to control a model’s capacity. Here, a model’s capacity is chosen to be large with a large hypothesis space, and regularization techniques are used to limit the model’s tendency for overfitting.

In a general model, a regularizerΩ(wM)adds a penalty to a model’s weight such that parameter selection of a training algorithm is handicapped. The optimization problem given in Equation 3.4 is therefore extended to

hM = arg min

hM

1 Nval

Nval

X

i=1

JP1(yi, fM(xvi, wM, hM)) (3.5a) subject towM ∈arg min

wM

1 Ntrain

Ntrain

X

i=1

JP2(yi, fM(xtri , wM, hM)) +λwΩ(wM) (3.5b) whereλw is a weighting parameter that trades off the regularization and the model’s ability to fit the training data. Thus, instead of including or excluding functions from the hypothesis space, regularization more gradually forces a tendency towards a type of function. We introduce typical regularization methods in Section 3.3.5.

Summarized, a deep learning model’s generalization capability is measured by an independent test set. For achieving good generalization performance, the model needs to be designed appropriately by selecting suitable hyperparametershM. Hyperparameters are chosen based on performance on a validation set that has no overlap with the training and test set. The selection of hyperparameters influences a model’s capacity. If the