• Keine Ergebnisse gefunden

3.3 Neural Networks

3.3.3 Recurrent Neural Networks

Fig. 3.5: Illustration of the ResNet principle. Left, two normal, consecutive convolutional layers are shown. Right, a typical ResNet block with two convolutional layers and a residual connection is shown.

layers have been introduced so far. While working well in practice, a downside of this concept is the extensive manual architecture engineering that is required.

ResNets (RN) are another concept that is crucial for architecture design. He et al. [193] observed that there was a general trend towards deeper networks with more layers. However, there is a limit for the number of layers that is effective in a CNN model, and for deeper versions, test performance goes down. At some point, thedegradation problem occurs. Here, very deep networks show both a higher testand training error than shallower versions. Thus, the deep networks do not fit the data well, indicating that the optimization process does not work well. He et al. proposed residual connections inside the network as a solution. We assume that several convolutional layers should learn a target mappingHat layerl. Instead, we can also learn a mapping:

H(xl) = F(xl) + xl. (3.15) Here, the key assumption is that a mapping F is easier to learn as the network is only required to learn a residual or some small deviation from xl instead of a full mapping. The implementation of this idea can be realized using a skip connection, see Figure 3.5. In practice, residual connections have been proven to be very effective, enabling very deep and higher-performing CNNs. Almost all modern CNN architectures utilize residual connections in some way.

Summarized, CNNs generally consist of multiple convolutional layers, activation functions, pooling, and an output layer. A generic architecture overview is shown in Figure 3.6. Most modern architectures follow this concept and largely focus on improving the structure of the convolutional blocks.

3.3.3 Recurrent Neural Networks

As discussed in the previous section, CNNs can also be used to process temporal dimensions using convolution operations. As an alternative, recurrent neural networks have also been introduced, which are specifically designed for processing sequential data. Processing sequences of examples is necessary for tasks where the learning target cannot be derived from a single example, for example, when trying to forecast the movement of an object. One approach could be to consider all previous examples as

Input Image 224×224×3

Initial Conv Block

Feature Map 56×56×64

Conv Blocks

28×28×128 Feature Map

Conv Blocks

14×14×256 Feature Map

Conv Blocks

7×7×512 Feature Map

GAP Output Layer

Fig. 3.6: An example of a full CNN architecture for a 2D input image of size224×224.

The initial convolutional block typically consists of several convolutional layers.

The other convolutional blocks can be ResNet blocks, Inception blocks or other variants.

the input for a neural network. However, this approach can be very inefficient due to the large number of model parameters required, similar to processing an image with an FC-NN. Therefore, RNNs have been developed with the core idea of sharing parameters for several computations within a sequence. Often, sequences represent temporal data where each example within the sequence corresponds to a time pointti. While most applications include temporal sequences, the only necessary assumption for RNN applications is that the sequence follows some order [176].

In the following, we will assume that some type of time series is being processed. In general, an RNN processes a sequencexi ={xti−nt, . . . ,xti}of lengthnt. We refer toti as the current time step that comes with history ofnt−1predecessors. At a time stepti, we compute

hti =fM(hti−1,xti, wM) (3.16) wherehti is the RNNs internal state at time steptiandwM are the RNNs parameters.

The RNN computes the next state hti using a previous time step hti−1, the current sequence example xti, its weights wM, and some function fM. In a typical RNN application, we try to predict future values from a history of dataxi. Here,hti can be interpreted as a compressed representation of previous information contained inxi. As xi is often a long sequence with a lot of examples, hti is chosen to be limited in size such the only the most relevant information for the task at hand can be stored.

The transforming functionfM can be chosen in different ways. In a typical RNN, fea-ture vectorsxti and stateshti−1 are transformed using matrix multiplications, similar to FC-NNs. To produce usable outputsoti, an additional transformation ofhtiis performed at each time step. Thus, for an example RNN we compute

3.3 Neural Networks

Fig. 3.7: An example of an RNN, both in its single-node representation (left) and its finite unrolled representation (right). For the single-node representation, the black box at the recurrent connection indicates a delay of one time step. Note that for the unrolled representation, the computations at all time steps share parameters.

ati = Whhti−1 +Wxxti (3.17)

hti = ga(ati) (3.18)

oti = Wohti (3.19)

wherega is an activation function and matricesWwM are the learnable model parameters. These computations can be visualized in two different ways, see Figure 3.7.

One way is to represent the computation by a single node with a recurrent connection.

Another way is to unfold the RNN into a computation graph where the computations for a finite set of time steps are depicted. While the single node is closer to a physical representation of an RNN, for example, in a biological neuron, the unfolded version explicitly shows all the computations that are involved. The unfolded version is limited in terms of the number of time steps which represents realistic computations. In practice, an RNN is trained with truncated sequences using the unfolded graph representation.

Note that Equation 3.19 is just one option for RNN design. State, input example, and output can be connected in arbitrary ways, which allows for designing specialized RNNs for different applications. Furthermore, matrix multiplications are not necessarily required. Some RNN extensions are designed to deal with image-like data by performing convolution operations instead of matrix multiplications [556].

One important aspect of RNNs is the design of their output. Following Equation 3.19, we can produce a prediction at every time step. Some tasks do not require an output at every time step, for example, when predicting only a current or future estimate at the final time stepti in an unfolded RNN. Here, previous examples only serve as a source of information for the current output, and we are not interested in outputs of previous time steps. This is important for training RNNs as we only create a loss signal for learning at the RNN’s final time step.

A major shortcoming of RNNs is their ability to deal with very long sequences and long-term dependencies. Assuming that a prediction at the final time pointti depends on an input example from the past, transferring the information through multiple time steps in the statehti can be difficult. The statehti undergoes a transformation where its values are multiplied with the same weight matrix multiple times. If values inhti are

LSTM Cell cti−1

hti1

xti

σ σ tanh σ

× +

× ×

tanh

cti

hti

fgti inti oti

ˆ cti

Fig. 3.8: An example of an LSTM cell [200]. Blue blocks indicate a neural network layer with learnable weights. White blocks indicate element-wise operations.

σ(x)refers to a sigmoid activation function.

smaller than1, chained multiplications will eventually pull the value to zero, and if the values are larger than1, they will explode. While exploding values can be avoided with sigmoid-like activation functions, values will saturate instead. In the case of saturation, the vanishing gradient problem occurs, which prevents RNNs from training effectively.

The problem of long-term dependencies in long sequences has been extensively studied, for example, by Bengio et al. [41] and Hochreiter et al. [199].

One approach for overcoming this problem aregatedRNNs. Here, the idea is to create connections between time steps which do not result in vanishing or exploding gradients.

Hochreiter and Schmidhuber [200] implemented this idea using self-loops within the RNN block, which allows the gradient to flow within the network for a long time. This RNN model is termed long short-term memory (LSTM) cell, see Figure 3.8. A core part of the LSTM is its cell statecti, representing the LSTM’s self-loop. The cell state can remain the same for a long time as it is only manipulated by element-wise multiplications or additions. Changes to the cell state, for example, for adding or removing information, can be introduced through gates.

Gates are neural network layers with a sigmoid-like activation functionga. The first gate is the forget gate, which controls what kind of information is kept in the cell state and what is removed. If the gates’ activations are zero, the element-wise multiplication will remove said values from the cell state. If the activations are one, they pass through unchanged. The update gate computes

fgt

i =σ(Wf hhti−1 +Wf xxti) (3.20) where we consider a similar input and state transformation as in Equation 3.19. Next, new information can be added to the cell state with the input gate. First, a new candidate cell statecˆti is computed using another neural network layer. Typically, this layer uses a tanhactivation function. Here, any activation function can be used as the output values do not directly affect the cell statecti. Next, another gating layer determines which parts of the candidate state are kept for addition to the previous cell statecti−1. Thus, for the input gate, we compute

3.3 Neural Networks

GRU Cell hti−1

xti

σ σ tanh

×

× hti

ˆhti

zti

1x rti

+

×

Fig. 3.9: An example of a GRU cell [87]. Blue blocks indicate a neural network layer with learnable weights. White blocks indicate element-wise operations. σ(x) refers to a sigmoid activation function.

inti = σ(Wihhti−1 +Wixxti) (3.21) ˆ

cti = tanh(Wihhti−1 +Wixxti). (3.22) Thus, the new cell statecti is computed by

cti =fgticti−1+inticˆti. (3.23) Finally, the LSTM’s output needs to be computed. Here, the output is the new cell statectitransformed with atanhactivation. Then, an output gate computes which values are kept within the transformed cell state:

oti = σ(Wohhti−1 +Woxxti) (3.24) hti = otitanh(cti). (3.25) Many different LSTM variants have been introduced where the gating process is slightly modified. A very popular modification are gated recurrent units (GRUs) [87].

They are slightly more efficient than standard LSTMs as they combine the input and forget gate into a single update gate. Furthermore, the cell stateˆcti and the hidden state ˆhtiare fused into a single representation. The GRU performs the following computations:

zti = σ(Wzhhti−1 +Wzxxti) (3.26) rti = σ(Wrhhti−1 +Wrxxti) (3.27) ˆhti = tanh(Whhhti−1 +Whxxti) (3.28) hti = (1−zti)hti−1 +ztˆhti (3.29) A visual interpretation of a GRU cell is shown in Figure 3.9. Note that the GRU only uses three instead of four neural network layers. Also, the fused hidden representation reduces the model’s memory footprint. These advantages have made GRUs a popular alternative for LSTMs in a lot of tasks.