• Keine Ergebnisse gefunden

Convolutional Neural Networks

3.3 Neural Networks

3.3.2 Convolutional Neural Networks

CNNs are a modified version of FC-NNs that are specifically designed for processing image datax ∈Rd1×d2×···×dn instead of feature vectorsx ∈ Rd1. CNNs gained initial attention for digit recognition in 1998 [279]. With a quick increase in computing power over the last decade, training large scale CNNs became feasible, and, in 2012 Alex Krizhevsky popularized deep learning with outstanding results at the ImageNet classification competition [262]. Here we explain the characteristics of convolutional layers which make deep learning models very effective for image processing.

Moving from fully-connected layers, as given in Equation 3.8, to convolutional layers is motivated by the concepts of sparse connectivity and parameter sharing [176]. Classic NNs contain several neurons in a layer where each neuron is connected to every input with individual weights. Considering a 2D image or 3D volume input, this leads to millions of weights per layer and thus an increased risk of overfitting, as described in the previous section. To overcome this issue, neurons are reduced to only have a smaller receptive fieldv ∈Rk1×k2×···×knwithki < di. Hence, they are only connected to a small portion of the input.

A next step is to enforce parameter sharing among all neurons in a layer. Thus, the same parametersK ∈Rk1×k2×···×knare used for every receptive fieldv. This reduces the number of parameters even further. The motivation for this is our goal to learn similar features in every region of the input, for example, the detection of edges in an image.

As a result, we obtain a kernelK instead of a weight matrixwM. Before, the layer’s operation was a matrix multiplication of the inputs with the weight matrix.

Now, we can picture the operation as the kernel being slid over the input which resembles the mathematical operation of a discrete convolution which was introduced in Section 2.2.2. As a result, the convolutional layer is closely related to classic image processing with local operators. We introduced several different kernels for tasks such as smoothing or edge detection. Here, we learn the kernels from data instead. Thus, there is no need to handcraft kernels based on, potentially flawed, assumptions about which properties in an image are important for a learning problem.

In the general case, a convolutional layerlreceives a tensorxl−1 ∈Rd1×d2×···×dn×nl−1c from a previous convolutional layer. Here,ncis the number of input channels. In case of the first convolutional layer of a CNN and an RGB image, n1c = 3. The kernel K ∈Rk1×k2×···×kn×kcl is used to producenlc=kcl new feature maps. This results in an output tensorxl ∈Rd1×d2×···×dn×nlc. To producenlcnew feature maps,kclnl−1c individual convolutions are performed within one convolutional layer. For the example of an RGB imagexl−1 ∈Rd1×d2×nl−1c , the convolution operation is defined as

Layer 1

Layer 2

Layer 3

Fig. 3.3: Receptive fields of kernels over three layers. The deeper layer is indirectly connected to a larger receptive field, allowing for more abstract feature learning.

for klc = 1. Note Equation 3.9 actually describes a cross correlation. A cross correlation is equivalent to a convolution with the same, flipped kernel. Also, note that in the following, we also refer to the size of a kernelK ∈Rk1×k2×klc withk1×k2×kcl, following conventions in the deep learning literature.

Each location(q1, q2, qc)corresponds to a receptive fieldvq1,q2,qc. By visiting only a subset of receptive fieldsvr1q1,r2q2,qc withrj ∈ Nwe obtained a strided convolutional layer. Again, this corresponds to the strided convolution we introduced for image processing, see Section 2.2.2. Thus, we downsample the layer’s outputxlby a factor of rj in each dimensionj.

The convolutional layer can be extended to the general case withNddimensions. The discrete convolution operation forNddimensions is defined as

(K∗x)(q1, . . . , qn) :=X

i1

· · ·X

in

K(i1, . . . , in)x(q1i1, . . . , qnin) (3.10) which can be employed for a general convolutional layer, as given in Equation 3.9.

While the mathematical extension to arbitrary dimensions is straight-forward, high-dimensional CNNs are difficult to design, which we address in Section 4.

Another important property of CNNs is the stacking of local receptive fields. Sliding the same filter over an input implies that only local properties can be captured. However, deep CNN architectures also allow for learning more abstract features, as the field of implicitly grows when stacking more layers. This is visualized in Figure 3.3.

In Section 2.2.2, we also addressed the problem of border effects for convolutions. A receptive field’s location close to an image border is restricted by the kernel size. Thus, the maximum number of receptive fieldsuvj for dimensionj is given by

uvjdjkj

rj + 1. (3.11)

Thus, the size of output tensorxl for each axisj is at mostdjkj + 1for rj = 1, which leads to shrinking tensor dimensions when stacking convolutional layers. In some cases, this behavior is tolerated or desired. This mode of operation is referred to as VALID convolutional layers [176]. If we aim to keep the sizedj of all dimensions

3.3 Neural Networks

constant across layers, we can employ zero padding withpdj zeros added around the tensor’s border in all tensor dimensionsdj. We obtain

uvjdj + 2pdjkj

rj + 1 (3.12)

such thatpdj can be chosen to adjust the new tensor’s size. Forrj = 1we can keep the image size constant with:

pdj = kj

2. (3.13)

This is referred to as SAME convolutional layers [176].

After the convolution operation in a layer, a nonlinear activation function is applied to the outputxl, similar to standard FC-NNs. Afterward, another operation called pooling is often applied. This operation reduces the spatial dimensions of a layer’s feature map by grouping a region of the feature map with a summary statistic [176]. Usually, max pooling [594] is used which selects the largest value from a receptive field. Thus, we obtain

vl=max(vl−1). (3.14)

Other choices can be an average pooling operation or a more recent approach such as stochastic pooling [582]. Still, the standard approach in state-of-the-art architectures is to use max pooling over a2×2or2×2×2region for the 3D case [475]. Max pooling comes with the advantage of introducing small-scale invariance towards minor translations.

Assuming pixels shift inside of the pooling region, the max pooling operation will still output the same maximum value. While this is a desirable property for some applications, pooling operations can also be effectively replaced by convolutional layers with a stride ofrj = 2. This has been studied extensively with the result that CNNs without pooling operations can achieve similar or better performance with convolutional layers using a stride ofrj = 2instead [463]. Thus, both approaches have been frequently used in practice.

Convolutional layers, including kernels, an activation function, and pooling, form the basis of a CNN. Usually, there are several consecutive convolutional layers that can be seen as a feature extraction stage, which is followed by an output layer where the standard FC-NN structure with matrix multiplications is used once again to produce a prediction vector. When moving from convolution-based processing to matrix multiplications, the feature tensor needs to be reshape into a vector. Thus, before entering the output layer, the feature tensorxl ∈Rd1×···×nlc is flattened toˆxl ∈Rd1...nlc. This can be problematic if the dimensionsdj are large as the output layer will have a large number of parameters.

A typical approach to overcome this is problem is global average pooling (GAP). Here, we apply an averaging pooling operation with a receptive field of sizevl−1 ∈Rd1×···×dn to the final feature tensorxl [299]. In this way, the output layer’s number of parameters is substantially reduced.

The concepts we introduced so far would allow for construction of a CNN for image-based learning problems. However, from a practical perspective, a lot of hyperparameter choices are still unclear. As outlined in Section 3.2, hyperparameter optimization

Conv 1×1 Pool Concat

xl Conv 3×3

Conv 3×3

Conv 1×1

Conv 3×3 Conv 1×1

Conv 1×1 xl+1

Fig. 3.4: Example for an Inception layer, based on [475].

is extensively influenced by search space design of the deep learning engineer. For example, the number of layers, the choice ofkcl at each layer and the size ofK appear to be arbitrary. Therefore, Simonyan et al. [450] introduced several fundamental design principles in theirVGGCNN model which are still relevant and applied today. Simonyan et al. suggest that using kernel sizeskj = 3for all dimensions is the optimal choice for CNNs. Larger kernels have the advantage of covering a larger, local region, allowing for more context to be taken into account. However, considering Figure 3.3, we can observe that stacking multiple layers also leads to a larger implicit receptive field. As a result, a convolutional layer with kernel sizeskj = 5can be represented by two layers with kernel sizeskj = 3. While reducing the number of parameters, for example, from25to 18for the 2D case, this also allows for more nonlinearities being introduced in between layers which should allow the CNN to learn more abstract features [450]. Furthermore, Simonyan et al. suggest to use an initially small number of output feature mapsklcin the first convolutional layer. Then,kcl is doubled every time the spatial dimensions are reduced. In this way, architecture design is simplified as only the initial feature map size, the number of layers, and spatial reduction stages need to be defined.

Additional architecture design principles have been introduced with theInception (IN) architecture [476]. Here, the idea is to split a single convolutional layer into parts.

Consider the output feature map xl of a layer in a CNN. In an Inception layer, this output is fed into several different convolutional layers at the same time. The outputs of these layers are concatenated back together to one output. This is motivated by the idea to extract features at different scales from the same input. The different layers in the Inception layer can be convolutions with different kernel size or pooling operations with striderj = 1. An example for an Inception layer is shown in Figure 3.4. This architecture was introduced in an improved iteration of the Inception architecture [475].

Note, that a convolutional layer with kj = 1for the spatial dimensions is placed in front of convolutional units. This serves the purpose of dimensionality reduction in the number of feature maps of the original input. This ensures that the computational effort does not become too large. Moreover, one branch contains two consecutive layers with spatial kernel sizeskj = 3, following the idea of VGG [450]. Several Inception-like

3.3 Neural Networks

xl Conv 3×3 Conv 3×3

xl+1

H

xl Conv 3×3 Conv 3×3

xl+1

F

+

Fig. 3.5: Illustration of the ResNet principle. Left, two normal, consecutive convolutional layers are shown. Right, a typical ResNet block with two convolutional layers and a residual connection is shown.

layers have been introduced so far. While working well in practice, a downside of this concept is the extensive manual architecture engineering that is required.

ResNets (RN) are another concept that is crucial for architecture design. He et al. [193] observed that there was a general trend towards deeper networks with more layers. However, there is a limit for the number of layers that is effective in a CNN model, and for deeper versions, test performance goes down. At some point, thedegradation problem occurs. Here, very deep networks show both a higher testand training error than shallower versions. Thus, the deep networks do not fit the data well, indicating that the optimization process does not work well. He et al. proposed residual connections inside the network as a solution. We assume that several convolutional layers should learn a target mappingHat layerl. Instead, we can also learn a mapping:

H(xl) = F(xl) + xl. (3.15) Here, the key assumption is that a mapping F is easier to learn as the network is only required to learn a residual or some small deviation from xl instead of a full mapping. The implementation of this idea can be realized using a skip connection, see Figure 3.5. In practice, residual connections have been proven to be very effective, enabling very deep and higher-performing CNNs. Almost all modern CNN architectures utilize residual connections in some way.

Summarized, CNNs generally consist of multiple convolutional layers, activation functions, pooling, and an output layer. A generic architecture overview is shown in Figure 3.6. Most modern architectures follow this concept and largely focus on improving the structure of the convolutional blocks.