• Keine Ergebnisse gefunden

5 CNN-based Segmentation

5.4 CNN Architecture

A CNN is composed of an input and an output layer and multiple hidden layers in between, similarly to the concept of amultilayer perceptron(MLP). The data is passed from the input layer to the hidden layers, where the non-linear transforma-tions of the data are computed, and finally one or more outputs are predicted in the output layer (Fig. 5.3). Typically, the hidden layers of a CNN include convo-lutional layers, activation functions, pooling layers, regularization layers, and fully connected layers. Choosing the correct amount, the order, and the combination of layers requires experimentation and is dependent on the task and the input data.

Hidden layer 1

Input layer Hidden layer 2 Output layer

Figure 5.3: A multilayer perceptron

The starting point for establishing the architecture for this work was U-Net [157]

which has been successfully applied to segment biomedical images and has been setting state of the art results (Fig. 5.4). Encoder-decoder models such as U-Net are the de facto standard in the field of image segmentation [4, 19, 20, 25, 120, 132, 139, 157, 229]. These models processes the data along two paths: thecontracting path (encoder) spatially compresses the images while theexpanding path(decoder) upsamples the images to restore the original size. This allows the network to capture features at different spatial resolutions while also increasing the processing speed due to downsampling.

Next, an overview of the layers relevant for this work, followed by the designed architecture for the segmentation task of this work.

5.4 CNN Architecture

contracting path

expanding path

Figure 5.4:U-Net, a popular CNN architecture for segmentation, first com-presses the input data and then expands it back to it’s original size (adapted from [157])

Convolutional layer. The primary building block to construct the architecture of a CNN is the convolutional layer [73] and therefore the termconvolutionalin “CNN”.

Convolution is about multiplying the value of each image pixel and it’s neighbouring ones with akerneland then summing them up to calculate a new value at the pixel’s location (Fig. 5.5). For CNNs, the kernel is generally a matrix with size 3×3, 5×5, or 7×7. The kernel is applied across the entire image in a sliding window fashion to generate a feature map. Each convolutional layer is composed of multiple kernels and the number typically becomes larger in deeper layers. The goal of CNNs is to learn the values of all kernels, i.e. theweights, to produce the best results for the given task. Furthermore, by using two or more convolutional layers in succession, a CNN is able to learn complex and hierarchical features [23, 73]. Recently, alterations of the regular convolution operation have emerged such asdepthwise separableand dilated convolutions. Depthwise separable convolutions have been used to create smaller and faster models with comparable performance [23, 24], especially useful for mobile devices and real-time applications [227]. Dilated convolutions on the other hand, have gained popularity since they enable to capture features at multiple scales without the need of scaling operations [134, 146, 221, 223].

Kernel

Input * = Output

Figure 5.5: Convolution is about sliding a kernel across the image to generate a feature map to be used as input for other CNN layer

Activation function. Activations functions are placed between the layers of a CNN, more precisely after convolutional layers, to introduce a non-linearity. The purpose is to generate features that are non-linear transformations of the input [73]. Otherwise, a simple linear transformation could be calculated and the benefit of building a deep network would vanish. The most popular activation function in modern neural networks is therectified linear unit (ReLU) [137]. Other variants of this activation function are PReLU [74], LReLU [124], and ELU [26]. They all return the identity for positive arguments but handle negative ones differently (Fig. 5.6). The advantage of ELU is that its exponential part for negative arguments pushes the mean activation to zero, which accelerates the learning process and reduces the bias shift passed from the activation function to the next layer [26].

Figure 5.6: Common activation functions of neural networks

5.4 CNN Architecture

Pooling layer. Pooling is an operation to reduce the size of the output feature maps of convolutional layers. The benefit of this operation is that it introduces spatial invariance [73, 147, 166], reduces the number of calculations, and reduces the possibility of overfitting [147]. One can choose betweenmax poolingandaverage poolingof which the first one is more common and generally performs better [166].

Max pooling computes the maximum value in each subregion of the feature maps and sets the result in the output (Fig. 5.7). If the max pooling filter size is set to 2×2 and thestride1parameter is set to 2 then one can effectively reduce the feature values by half. By combining convolutional and max pooling layers one can reduce the input to a dense representation of the most important features.

Max pooling subregion of the input and halves the input size

Skip connections. Each max pooling layer reduces the size of the image and causes loss of spatial information. To overcome this issue,long skip connections can be in-troduced to pass spatial information from the contracting to the expanding path of the network [44, 88]. Another type of skip connections are theshort skip connec-tions, also known asresidual connections, which can be added around one or more layers. They are used to address the issue ofvanishing gradient2when building very deep networks [44]. Moreover, their integration in segmentation models has shown that the model converges faster during training, it’s optimization is easier, and it’s segmentation accuracy improves [75, 88].

Regularization. Many different approaches can be used to address the issue of overfitting mentioned in section 5.1. A popular and commonly used technique is

1Step of the convolution operation in pixels

2The kernel weights of the convolutional layers are modified during training by the partial deriva-tive of the error function. Especially in very deep models, the derivaderiva-tive can become vanishingly small, preventing the weights to be effectively changed [69, 81, 82].

Dropout [194] (DO) which randomly drops kernels and their connections to other layers during training. As a consequence, the model is encouraged to learn indepen-dent features since one kernel cannot rely on the effect of other ones [73, 194]. Other types of regularization areL2 andL1 regularization, which are applied to the ker-nel weights to reduce overfitting [73]. Batch Normalization[87] (BN) is yet another technique which normalizes the inputs of a layer and can induce faster training and higher robustness of network weights initialization [26, 83, 87].