• Keine Ergebnisse gefunden

4 Deep neural networks

4.2 Convolutional layer

4.2 Convolutional layer

The convolutional layer is motivated by the fact that, in an image, the information of each pixel has a strong local correlation to neighboring pixels (e.g., edges are an important feature formed by local correlations). Since features can be present in sev-eral areas of an image, a filter needs to slide over the complete input data to extract them. The local correlations are utilized by convolving a small filterK with the in-put data. The filter often has a symmetric kernel size ofk×k. Although the layer is called a convolutional layer, the cross-correlation is typically calculated because this helps to omit kernel flipping. For a two-dimensional input matrix Iand filter K, the two-dimensional cross-correlation is calculated as follows:

C(i,j) = (I?K)(i,j) = Notably we calculate a valid cross-correlation. This means that the calculation area is constrained to pixels(i,j), where the filterK∈Rk×k is fully within the input matrix I ∈Rp×q. Let h =bk/2c, where b·c is the integer division. Thus, we can define the calculation area withi∈ {h,h+1, . . . ,ph}and j∈ {h,h+1, . . . ,qh}. The parameters of the filters are learned during training of the neural network.

In the following section, we explain a so-called two-dimensional convolutional layer and provide an illustration of this layer in Figure 4.2. The feature mapFl ∈Rwl×hl×dl is the output of thel-th convolutional layer with widthwl, heighthl, and depthdl. While the widthwl and height hl depend on the size of the input mapFl−1, the depthdl is the number of filters a convolutional layer can learn during optimization. Moreover, the depth dl is a hyperparameter that is often defined before training. Let v and a be the run indexes over the depthdl anddl1, respectively. Thus, we can extend the equation (4.1) for a three-dimensional case:

Fl(i,j,v) =

In comparison to the fully-connected layers, it is easier to consider that the neurons are structured in a matrix and not as a vector. The total number of neurons N in a

4 Deep neural networks

Figure 4.2: Illustration of a convolutional layer with stridesl=2 and paddingpl=1. The color emphasizes the difference between each tensor.

convolutional layer equals the size of the feature map; therefore,N=wlhldl. Two key components are required to realize the convolution in a neural network and reduce the number of parameters: local receptive field and weight sharing.

Local receptive field: Each neuron of the l-th convolutional layer is only con-nected to local areaRli,j,v in thel−1-th layer with the sizekl×kl×dl−1, wheredl−1 is the depth of the input layer to the convolutional layer. This local area or local recep-tive field describes the size of the region in the input that contributed to the feature calculation. As such, each local receptive field can learn its own filterKli,j,v with the same size asRli,j,v. The displacement of each local receptive field in a convolutional layer is defined by the stridesl ∈N.

Without weight sharing (which is explained next), each of theN neurons would have klkldl−1+1 parameters, while the convolutional layer would haveN(klkldl−1+1) pa-rameters in total. Notably, one parameter is added due to the biasbof each neuron.

Weight sharing: Since the same feature can appear at multiple locations, the con-cept of weight sharing was proposed. This makes it unnecessary to learn the same fea-ture extractor multiple times and reduces the parameters significantly. Weight sharing implies that all neurons belonging to the same slicev have the same filterKlv. There-fore, the depthdl controls how many filters can be learned. This reduces the total pa-rameters of the convolutional layer bywlhl; hence, the layer only hasdl(klkldl1+1) parameters. In Figure 4.3, we provide a simple example of a convolutional layer with stridesl =2 and kernel sizekl =2,Kl ∈K2×2×1. To calculate the final results, we use the cross-correlation in Equation 4.1 and add the bias b. For example, in the top row

52

4.2 Convolutional layer

3 6 -5 7

-4 -7 -2 5

5 5 4 -5

-3 1 0 -2

l−1-th layer

2 4

1 1

Kl

8

bl 27

l-th layer

3 6 -5 7

-4 -7 -2 5

5 5 4 -5

-3 1 0 -2

l−1-th layer

2 4

1 1

Kl

8

bl 27 29

l-th layer

Figure 4.3: Example of a valid cross-correlation calculation with stridesl =2 and without zero-padding. Only the first two steps are shown. First, the filter Kl (size:

2×2) is applied to the top left area of thel 1-th layer (i.e., the light red area). Thereafter, the biasbl is added and the result is the top left pixel of the l-th layer (i.e., the light red pixel). Then, the filter is shifted by the stride sl to the right and the same calculation is performed again. This calculation is shown as the light green area and pixels.

of Figure 4.3, we calculate the result for the first cell as follows:

Fl(1, 1) = (Fl1?Kl)(1, 1) +bl= (3·2) + (6·4) + (−4·1) + (−7·1) +8=27 . The local receptive field must be fully connected to the input. Thus, the size of feature mapFl can be calculated by:

hl = (hl−1kl)/sl+1 (4.3) wl = (wl1kl)/sl+1 . (4.4)

This would always reduce the size of the input tensor by at least kl +1. Therefore, padding was introduced. Padding artificially increases the size of thel−1-th layer by

4 Deep neural networks

Figure 4.4: Example of a valid cross-correlation calculation with zero-paddingpl =1 and stridesl =2. The zeros with a light green background are added because of the zero-padding.

adding a border around the input tensor. The size of the border is defined by pl ∈N and the added border typically contains only zeros. Hence, padding is also known as zero-padding. In Figure 4.4, we illustrate zero-padding with paddingpl=1 and stride sl =2 for an example matrix. The width and height are then calculated as follows:

hl = (hl1+2plkl)/sl+1 (4.5) wl = (wl−1+2plkl)/sl+1 . (4.6)

54