• Keine Ergebnisse gefunden

3.3 Neural Networks

3.3.5 Regularization

Regularization describes methods that try to prevent overfitting in a model, as explained in the context of generalization in Section 3.2. There is a vast amount of regularization strategies. In many cases, techniques related to deep learning have an additional regular-izing effect, although their primary purpose is focused on something else. Therefore, regularization is a loose term that can refer to a lot of different techniques. Goodfellow et al. define regularization as "any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error" [176].

Conventional regularizationtechniques usually enforce explicit regularization. A common method is to impose a penalty on the neural network’s weights, which is added to the loss function

J(wˆ M,x,y) =JP(wM,x,y) +λwΩ(w0M) (3.33) whereλw trades off parameter fitting to the data and regularization. Note thatwM is the set of all parameters andwM0 the set of regularized parameters withwM0wM. One of the most popular penalty choices is theL2 norm, also referred to as weight decay, which is given by

Ω(w0M) = 1

2kw0Mk22. (3.34)

Similarly, theL1 norm penalty is given by

Ω(θ) =kw0Mk1. (3.35) Both approaches behave differently. TheL2norm keeps weights at smaller magnitudes by adding a term to the gradient that scales with the weight’s value. Therefore, larger weights that deviate a lot from zero are penalized more. TheL1 norm, on the other hand, always contributes a constant term. This leads to theL1 norm enforcing sparsity in the weights and driving some of them to zero. This property can be used as a feature selection mechanism [356].

Another approach that is typically employed is the use of early stopping. When training a model, the validation and training error will initially both go down as the model is fitted to the data. At some point, if the model capacity is large enough, the model will start overfitting, and the validation error starts to increase [387]. Thus, a simple algorithm can be used to save the current best model and stop training when the validation error keeps increasing for several iterations. The regularizing effect is achieved by limiting the parameter space to which the model can extend [52]. Assuming a bounded gradient, a learning rate, and a certain number of iterations before we stop, the model’s parameter space volume around the initial parameter setw0M is limited. In fact, early stopping limits a model’s capacity similar to theL2norm [176].

A technique already introduced in the context of convolutional and recurrent neural networks is parameter sharing which also imposes a regularizing effect. While a norm constraint forces parameters to be close to zero, another approach is to force parameters to be similar to each other, where the extreme case is to force them to be equal. This is motivated by the idea that we often search for the same kind of features in different

regions of an input. Therefore, CNNs and RNNs already have an in-built regularization effect.

Another popular method is the use of model averaging or ensemble methods. For this approach, multiple models are trained, and their outputs are aggregated, for example, by calculating a weighted average of the models’ outputs [334]. Given a set of trained modelsFM ={fM1 , . . . , fMNen}withNen trained models, we obtain the new ensemble prediction with

ˆ

yen =Een(FM(x)) (3.36)

whereEen aggregates individual model predictions, for example, by averaging. This usually leads to better performance compared to a single model, however, the perfor-mance is traded off for significantly higher computational costs as several models have to be trained and evaluated.

The methods presented above have been introduced before the emergence of deep learning, but they are still relevant and effective in today’s models. In addition, more recently, methods have been introduced which were designed for modern deep learning architectures.

Dropoutwas introduced by Hinton et al. [198] as a computationally efficient method to regularize FC-NNs. The basic idea is to randomly drop hidden units by setting layer outputs to zero during training with a probability ofpdto keep weights from co-adapting too much [464]. Thus, when setting a hidden unit to zero, its connections to the following layer all contribute a value of zero. This corresponds to sampling a network out of a set of potential networks in each training iteration, where all networks share the same set of parameters. For a single-layer network withdNN neurons, a new network is sampled from2dNN possible networks at each training iteration. For evaluation after training, this would lead to an ensemble of2dNN networks that need to be evaluated. Following Equation 3.36 with averaged aggregation, we need to calculate

ˆ

yen = 1 2dNN

2dNN

X

i=1

fMi (x). (3.37)

As this is computationally infeasible, the weight scaling inference rule can be used instead. Here, all neurons are scaled with the dropout probability, which is an ap-proximation of evaluating the entire ensemble. Thus, only a single model is required, which makes the method very efficient while also showing considerable performance improvements across different tasks [464].

Dropout inspired other methods that follow the idea of stochastic model averaging. For example, DropConnect does not set entire units to zero but drops only some connections between input and output units [517]. Typically, these methods are used for FC-NNs, but they have also found applications with CNNs and RNNs. For CNNs, we can set pixels or voxels within a feature map to zero instead of setting hidden units to zero.

Alternatively, Tompson et al. proposed that dropping entire feature maps instead of individual pixels is more beneficial for CNNs [490]. As neighboring pixels are usually strongly correlated, single pixel dropout does not enforce independence but only scales the gradient by the dropout probability. Thus, preventing co-adaptation between entire feature maps is deemed preferable for CNNs [490].

3.3 Neural Networks

Similarly, vanilla dropout is not optimal for RNNs. Introducing dropout within recurrent cells has been shown to be ineffective as long-term information appears to be lost [578]. Therefore, most approaches only applied dropout to the input and output of the RNN, not to recurrent connections within cells [378]. Another popular approach found that dropout can be applied effectively if the same dropout mask is used for all recurrent steps within a training step [144].

Batch normalization[214] is a technique that accelerates training while also pro-viding a regularizing effect. The main problem addressed by batch normalization is called internal covariate shift. When training deep networks, intermediate layer outputs and gradients tend to vary significantly. As gradients depend on surrounding layer outputs and weights within the computational graph, different changes surrounding different weights make the decision for a single learning rate difficult. Therefore, it would be desirable to keep all intermediate layer outputs and gradients in a similar range.

For an FC-NN, Ioffe et al. suggest normalizing the output of each neuron in a layer independently such that

ˆ

xjl = xlj−E[xjl]

qVar[xlj] (3.38)

wherexlis the layer’s output for the jth neuron in layerl. The mean and variance estimates are calculated over the mini-batchBtrain ⊂ Xtrainin each training step. As a result, layer outputs depend on the current batch composition and other examples in the batch. This would lead to inconsistent results during inference. Therefore, the calculated means and variances are saved as moving averages during training and applied as a fixed normalization during inference.

After a batch normalization layer, the network’s capacity can be reduced. For example, for a sigmoid activation function, the normalized values would always reside within the function’s linear region between0and1. To avoid this problem, a trainable linear transformation layer is added after normalization. The transformation is defined as:

˜

xl =γMll+βMl (3.39)

whereγMlandβMlare learnable parameters in layerl. Ioffe et al. suggest to include the batch normalization layer directly in front of activation functions within an FC-NN.

Note that the mean subtraction eliminates the bias variable that is usually present. The bias is implicitly included again with the transformation in Equation 3.39.

In practice, batch normalization has become a core component of almost all deep learning models. The normalization significantly enhances the training process with well-conditioned gradients, leading to both faster convergence and better performance.

Usually, networks using batch normalization can be trained with a larger learning rate while still converging correctly. In addition, the technique comes with an implicit regularizing effect. The mean subtraction and division by variance are a source of noise added to the data, which varies with each batch. Thus, in each training epoch, training samples are transformed differently, given that batches are assembled randomly. This forces the model to be more robust towards distorted inputs. Therefore, Ioffe et al.

suggest to tone down other regularization measures such asL2 regularization or dropout as batch normalization does already provide sufficient regularization.

When applying batch normalization with CNNs and RNNs, modifications are re-quired. For CNNs, the typical approach is to apply batch normalization across the batch dimension and all spatial image dimensions. Thus, each feature map within the network is normalized independently [214]. For RNNs, recurrent batch normalization has been introduced where batch normalization layers applied both to the input transformation and the hidden state transition [94]. There are no batch normalization layers for the cell state update in order to keep the long-term gradient flow untouched.

Over the years, several improvements for batch normalization have been proposed. For example, batch re-normalization addresses the problem of batch size dependence [213].

Here, additional parameters are introduced to obtain better results for small batches.

Furthermore, instance normalization has been proposed where each example in a batch is normalized independently [498]. Thus, the method also works well for a batch size of1, where batch normalization cannot be applied. Group normalization is another extension of batch normalization for small batch sizes [548]. Here, feature maps within the network are split into different groups, and each group is normalized independently. While batch normalization is still the standard normalization method for most applications, instance normalization, and group normalization are particularly well suited for higher-dimensional data where batch sizes need to be small due to the exponential increase in memory requirements.

Data Augmentationis a regularization method that addresses overfitting by changing the network inputs instead of the network itself. Typical data augmentation strategies apply a transformation φ(x) to an image x, for example, rotation, which allows for generating new, artificial examples. While this can be interpreted as a way to generate additional data, its actual purpose is to force a deep learning model to become invariant towards the transformationφ(x). If we show the same image with different rotations to the network while the image’s label stays the same, the network should learn to ignore rotations and achieve robustness towards that transformation. In terms of regularization, this means that we can keep the network from overfitting to a particular pose and orientation for the example of rotations.

As a result, we can design data augmentation strategies that transform inputs in a way that is not relevant to the learning task at hand. For example, if we want to classify a disease based on an image of a lesion, the pose is likely not relevant, and data augmentation by rotation could be beneficial. However, if we want to determine an object’s pose within the image, rotations are likely detrimental as the object’s pose is changed. Therefore, data augmentation strategies need to be carefully designed for each application.

Besides rotation, flipping, scaling, and shear are popular data augmentation strategies.

In the natural image domain, brightness and contrast, as well as color adjustments, are popular. Random cropping is another technique related to data augmentation. Here, the idea is to crop a small part out of a larger image for model training. In this way, the model only observes a part of the input image at a time. This can prevent overfitting to image background or clutter, which is not relevant for the learning task. For application, we need to ensure that the most relevant part of the image is still contained in the crop.

In general, data augmentation strategies are only applied during training and not evaluation, where performance is judged based on real images. As model evaluation should be deterministic, random cropping needs to be replaced, for example, by taking

3.3 Neural Networks

a single center crop from the image. As an alternative, multi-crop evaluation can be used where we takeNcropscropsx˜j from the imagex, evaluate them individually with the trained model, and then aggregate the result. Calculation is performed similar to Equation 3.36, except the same model is evaluated on different image patches. Thus, for multi-crop evaluation, we calculate

ˆ

ymc = 1 Ncrops

Ncrops

X

j=1

fM(˜xj). (3.40)

Similar to ensembling, this approach usually improves robustness and performance, however, it is traded off for an increased computational effort.