• Keine Ergebnisse gefunden

Chapter 1 Introduction

4.4 Convolutional Neural Network

A Convolutional Neural Network (CNN) is composed of several layers of convo-lution and pooling operations stacked together. These two operations simulate the responses of simple and complex cell layers discovered in visual cortex area V1 by Huben and Wiesel [139]. In a CNN, the abstraction of the simple cells is represented by the use of convolution operations, which use local filters to compute high-level features from the input stimuli. The pooling operation creates a higher level of abstraction of the complex cells and increases the spatial invariance of the stimuli by pooling simple cell units of the same receptive field in previous layers.

Every layer of a CNN applies different filters, which increases the capability of the simple cells to extract features. Each filter is trained to extract a different representation of the same receptive field, which generates different outputs, also known as feature maps, for each layer. The complex cells pool units of receptive fields in each feature map. These feature maps are passed to another layer of the network, and because of the complex cells’ pooling mechanism, each layer applies a filter in a receptive field which contains the representation of a larger region of the initial stimuli. This means that the first layer will output feature maps which contain representations of one region of the initial stimuli, and deeper layers will represent larger regions. At the end, the output feature map will contain the representation of all stimuli.

Each set of filters in the simple cell layers acts in a receptive field in the input stimuli. The activation of each unit ux,yn,c at (x,y) position of the nth feature map in the cth layer is given by

ux,yn,c =max(bnc+S,0), (4.11) where max(·,0) represents the rectified linear function, which was shown to be more suitable than non-linear functions for training deep neural architectures, as discussed by [108]. bnc is the bias for the nth feature map of the cth layer and S is defined as

S =

M

X

m=1 H

X

h=1 W

X

w=1

whw(c1)mu(x+h)(y+w)(c1)m , (4.12) wheremindexes over the set of filtersM in the current layer,c, which is connected to the input stimuli on the previous layer (c-1). The weight of the connection between the unit ux,yn,c and the receptive field with height H and width W of the previous layer c−1 is whw(c1)m. Figure 4.4 illustrates this operation.

A complex cell is connected to a receptive field in the previous simple cell, re-ducing the dimension of the feature maps. Each complex cell outputs the maximum activation of the receptive field u(x, y) and is defined as:

4.4. Convolutional Neural Network

Figure 4.4: Illustration of the convolution process. Each neuronu is connected to a receptive field in the input stimuli by a set of weights w, which represents the filters, and is affected by a bias b, which is the same for all the filters in the same layer. Each filter produces a feature map, composed of several neurons which are then passed to the next layer.

aj =maxn×n(un,c(x, y)), (4.13) wherevn,c is the output of the simple cell. In this function, a simple cell computes the maximum activation of the receptive field (x, y). The maximum operation down-samples the feature map, maintaining the simple cell structure. Figure 4.5 illustrates this operation.

Figure 4.5: Illustration of the pooling process. Each unit of the complex cell is connected to a receptive field of the feature map, and applies a maximum operation, resulting in one activation per receptive field.

4.4.1 Cubic Receptive Fields

In a CNN, each filter is applied to a single instance of the stimuli and extracts features of a certain region. This works well for an individual stimulus, but does not work when a certain sequence dependency is necessary, as in the cases of gestures, speech or even emotion expressions. In this case, if the filter extracts the

same features in each snapshot of the sequence, it will not take into consideration that a hand is moving towards one direction, or that a smile is being displayed.

To introduce sequence dependency cubic receptive fields are used [146]. In a cubic receptive field, the value of each unit(x,y,z) at the nth filter map in the cth layer is defined as:

ux,y,zn,c =max(bnc+S3,0) (4.14) where max(·,0) represents the rectified linear function, bcn is the bias for the nth filter map of the cth layer, and S3 is defined as

S3 =X

m H

X

h=1 W

X

w=1 R

X

r=1

w(chwr1)mu(x+h)(y+w)(z+r)

(m1) , (4.15)

where m indexes over the set of feature maps in the (c-1) layer connected to the current layer c. The weight of the connection between the unit (h,w,r) and a receptive field connected to the previous layer (c−1) and the filter map m is w(chwr1)m. H and W are the height and width of the receptive field and z indexes each stimulus; R is the number of stimuli stacked together representing the new dimension of the receptive field.

Figure 4.6: Illustration of the cubic convolution process. Different from the com-mon convolution, each neuronuis connected to a receptive field in all of the stimuli at the same time. This way, each neuron has R filters represented by the weights w, whereR is the number of input stimuli.

4.4.2 Shunting Inhibition

To learn general features, several layers of simple and complex cells are necessary.

Which leads to a large number of parameters to be trained. This, put together with the usual necessity of a large amount of data, which is one of the requirements for the filters to learn general representations, is a big problem shared among deep neural architectures. To reduce the necessity of a deeper network, we introduce

4.4. Convolutional Neural Network

the use of shunting inhibitory fields [99], which improves the efficiency of the filters in learning complex patterns.

Shunting inhibitory neurons are neural-physiological plausible mechanisms that are present in several visual and cognitive functions [113]. When applied to com-plex cells, shunting neurons can result in filters which are more robust to geometric distortions, meaning that the filters learn more high-level features. Each shunt-ing neuron Sncxy at the position (x,y) of the nth receptive field in the cth layer is activated as:

Sncxy = uxync anc+Incxy

(4.16) where uxync is the activation of the common unit in the same position and Incxy is the activation of the inhibitory neuron. The weights of each inhibitory neuron are trained with backpropagation. A passive decay term, anc, is a defined parameter and it is the same for the whole shunting inhibitory field. Figure 4.7 illustrates shunting neurons applied to a complex cell layer.

Figure 4.7: Illustration of the shunting inhibitory neuron in complex cells. Each neuronu has an inhibitory neuron, I, attached to it. Each inhibitory neuron has its own set of weights, that connect the inhibitory neuron to the common neuron, and a passive decay a, which is the same for all the neurons in a layer.

The concept behind the shunting neurons is that they will specify the filters of a layer. This creates a problem when applied to filters which extract low-level features, such as edges and contours. When applied to such filters, the shunting neurons specify these filters causing a loss on the generalization aspects of the low-level features. However, when applied to deeper layers, the shunting neurons can enhance the capability of the filters to extract strong high-level representations, which could only be achieved by the use of a deeper network.

4.4.3 Inner Representation Visualization in CNNs

CNNs were successfully used in several domains. However, most of the work with CNNs does not explain why the model is so successful. As CNNs are neural networks that learn a representation of the input data, the knowledge about what

the network learns can help us to understand why these models perform so well in different tasks. The usual method to evaluate the learned representations of neural networks is the observation of the weights matrices, which is not suited for CNNs. Each filter in the convolution layers learns to detect certain patterns in the regions of the input stimuli, and because of the pooling operations, the deeper layers learn patterns which represent a far larger region of the input. That means that the observation of the filters does not give us a reliable way to evaluate the knowledge of the network.

Zeiler and Fergus [306] proposed the deconvolutional process, which helps to visualize the knowledge of a CNN. In their method, they backpropagate the acti-vation of each neuron to an input, which helps to visualize which part of the input the neurons of the network are activated for. This way, we can determine regions of neurons that activated for similar patterns, for example, neurons that activate for the mouth and others for the eyes.

In the deconvolution process, to visualize the activation of a neuron a in layer l (al), an input is fed to the network and the signal is forwarded. Afterward, the activation of every neuron in layer l, except for a, is set to zero. After that, each convolution and pooling operation of each layer are reversed. The reverse of the convolution, named filtering, is done by flipping the filters horizontally and vertically. The filtering process can be defined as

Sf =

M

X

m=1 H

X

h=1 W

X

w=1

wf(chw1)mu(x+h)(y+w)(c1)m , (4.17) where wf(chw1)m represents the flipped filters.

The reverse of the pooling operation is known as unpooling. Zeiler and Fergus [306] show that it is necessary to consider the position of the maximum values in order to improve the quality of the visualizations, so these values are stored during the forward-pass. During the unspooling, the values which are not the maximum are set to zero.

Backpropagating the activation of a neuron will cause the reconstruction of the input in a way that only the region which activates this neuron is visualized. Our CNNs use rectified linear units, which means that the neurons which are activated output positive values, and zero represents no activation. That means that in our reconstructed inputs, bright pixels indicate the importance of that specific region for the activation of the neuron.

In a CNN, each filter tends to learn similar patterns, which indicates that those neurons in the same filter will be activated to resembling patterns. Also, each neuron can be activated for very specific patterns, which are not high-level enough for subjective analysis. To improve the quality of our analysis, we apply the concept of creating visualizations for all neurons in one filter, by averaging the activation of each neuron in that filter. That allows us to cluster the knowledge of the network in filters, meaning that we can identify if the network has specialized filters and not specialized neurons. Also, visualizing filters on all layers help us to understand how the network builds the representation and helps us to demonstrate