Recurrent Neural Networks - Neural Networks

3.3 Neural Networks

3.3.3 Recurrent Neural Networks

Fig. 3.5: Illustration of the ResNet principle. Left, two normal, consecutive convolutional layers are shown. Right, a typical ResNet block with two convolutional layers and a residual connection is shown.

layers have been introduced so far. While working well in practice, a downside of this concept is the extensive manual architecture engineering that is required.

ResNets (RN) are another concept that is crucial for architecture design. He et al. [193] observed that there was a general trend towards deeper networks with more layers. However, there is a limit for the number of layers that is effective in a CNN model, and for deeper versions, test performance goes down. At some point, thedegradation problem occurs. Here, very deep networks show both a higher testand training error than shallower versions. Thus, the deep networks do not fit the data well, indicating that the optimization process does not work well. He et al. proposed residual connections inside the network as a solution. We assume that several convolutional layers should learn a target mappingHat layerl. Instead, we can also learn a mapping:

H(x^l) = F(x^l) + x^l. (3.15) Here, the key assumption is that a mapping F is easier to learn as the network is only required to learn a residual or some small deviation from x^l instead of a full mapping. The implementation of this idea can be realized using a skip connection, see Figure 3.5. In practice, residual connections have been proven to be very effective, enabling very deep and higher-performing CNNs. Almost all modern CNN architectures utilize residual connections in some way.

Summarized, CNNs generally consist of multiple convolutional layers, activation functions, pooling, and an output layer. A generic architecture overview is shown in Figure 3.6. Most modern architectures follow this concept and largely focus on improving the structure of the convolutional blocks.

3.3.3 Recurrent Neural Networks

As discussed in the previous section, CNNs can also be used to process temporal dimensions using convolution operations. As an alternative, recurrent neural networks have also been introduced, which are specifically designed for processing sequential data. Processing sequences of examples is necessary for tasks where the learning target cannot be derived from a single example, for example, when trying to forecast the movement of an object. One approach could be to consider all previous examples as

Input Image 224×224×3

Initial Conv Block

Feature Map 56×56×64

Conv Blocks

28×28×128 Feature Map

Conv Blocks

14×14×256 Feature Map

Conv Blocks

7×7×512 Feature Map

GAP Output Layer

Fig. 3.6: An example of a full CNN architecture for a 2D input image of size224×224.

The initial convolutional block typically consists of several convolutional layers.

The other convolutional blocks can be ResNet blocks, Inception blocks or other variants.

the input for a neural network. However, this approach can be very inefficient due to the large number of model parameters required, similar to processing an image with an FC-NN. Therefore, RNNs have been developed with the core idea of sharing parameters for several computations within a sequence. Often, sequences represent temporal data where each example within the sequence corresponds to a time pointti. While most applications include temporal sequences, the only necessary assumption for RNN applications is that the sequence follows some order [176].

In the following, we will assume that some type of time series is being processed. In general, an RNN processes a sequencex_i ={x_t_i−_nt, . . . ,x_t_i}of lengthn_t. We refer tot_i as the current time step that comes with history ofn_t−1predecessors. At a time stept_i, we compute

h_t_i =f_M(h_t_i−1,x_t_i, w_M) (3.16) whereh_t_i is the RNNs internal state at time stept_iandw_M are the RNNs parameters.

The RNN computes the next state h_t_i using a previous time step h_t_i−1, the current sequence example x_t_i, its weights w_M, and some function f_M. In a typical RNN application, we try to predict future values from a history of dataxi. Here,h_t_i can be interpreted as a compressed representation of previous information contained inxi. As x_i is often a long sequence with a lot of examples, h_t_i is chosen to be limited in size such the only the most relevant information for the task at hand can be stored.

The transforming functionf_M can be chosen in different ways. In a typical RNN, fea-ture vectorsx_t_i and statesh_t_i−1 are transformed using matrix multiplications, similar to FC-NNs. To produce usable outputso_t_i, an additional transformation ofh_t_iis performed at each time step. Thus, for an example RNN we compute

3.3 Neural Networks

Fig. 3.7: An example of an RNN, both in its single-node representation (left) and its finite unrolled representation (right). For the single-node representation, the black box at the recurrent connection indicates a delay of one time step. Note that for the unrolled representation, the computations at all time steps share parameters.

a_t_i = W_hh_t_i−1 +W_xx_t_i (3.17)

h_t_i = g_a(a_t_i) (3.18)

o_t_i = W_oh_t_i (3.19)

wherega is an activation function and matricesW ∈ w_M are the learnable model parameters. These computations can be visualized in two different ways, see Figure 3.7.

One way is to represent the computation by a single node with a recurrent connection.

Another way is to unfold the RNN into a computation graph where the computations for a finite set of time steps are depicted. While the single node is closer to a physical representation of an RNN, for example, in a biological neuron, the unfolded version explicitly shows all the computations that are involved. The unfolded version is limited in terms of the number of time steps which represents realistic computations. In practice, an RNN is trained with truncated sequences using the unfolded graph representation.

Note that Equation 3.19 is just one option for RNN design. State, input example, and output can be connected in arbitrary ways, which allows for designing specialized RNNs for different applications. Furthermore, matrix multiplications are not necessarily required. Some RNN extensions are designed to deal with image-like data by performing convolution operations instead of matrix multiplications [556].

One important aspect of RNNs is the design of their output. Following Equation 3.19, we can produce a prediction at every time step. Some tasks do not require an output at every time step, for example, when predicting only a current or future estimate at the final time stept_i in an unfolded RNN. Here, previous examples only serve as a source of information for the current output, and we are not interested in outputs of previous time steps. This is important for training RNNs as we only create a loss signal for learning at the RNN’s final time step.

A major shortcoming of RNNs is their ability to deal with very long sequences and long-term dependencies. Assuming that a prediction at the final time pointt_i depends on an input example from the past, transferring the information through multiple time steps in the stateh_t_i can be difficult. The stateh_t_i undergoes a transformation where its values are multiplied with the same weight matrix multiple times. If values inh_t_i are

LSTM Cell ct_i−1

hti−1

xti

σ σ tanh σ

× +

× ×

tanh

cti

hti

fg_t_i inti oti

ˆ cti

Fig. 3.8: An example of an LSTM cell [200]. Blue blocks indicate a neural network layer with learnable weights. White blocks indicate element-wise operations.

σ(x)refers to a sigmoid activation function.

smaller than1, chained multiplications will eventually pull the value to zero, and if the values are larger than1, they will explode. While exploding values can be avoided with sigmoid-like activation functions, values will saturate instead. In the case of saturation, the vanishing gradient problem occurs, which prevents RNNs from training effectively.

The problem of long-term dependencies in long sequences has been extensively studied, for example, by Bengio et al. [41] and Hochreiter et al. [199].

One approach for overcoming this problem aregatedRNNs. Here, the idea is to create connections between time steps which do not result in vanishing or exploding gradients.

Hochreiter and Schmidhuber [200] implemented this idea using self-loops within the RNN block, which allows the gradient to flow within the network for a long time. This RNN model is termed long short-term memory (LSTM) cell, see Figure 3.8. A core part of the LSTM is its cell statec_t_i, representing the LSTM’s self-loop. The cell state can remain the same for a long time as it is only manipulated by element-wise multiplications or additions. Changes to the cell state, for example, for adding or removing information, can be introduced through gates.

Gates are neural network layers with a sigmoid-like activation functiong_a. The first gate is the forget gate, which controls what kind of information is kept in the cell state and what is removed. If the gates’ activations are zero, the element-wise multiplication will remove said values from the cell state. If the activations are one, they pass through unchanged. The update gate computes

fg_t

i =σ(W_{f h}h_t_i−1 +W_{f x}x_t_i) (3.20) where we consider a similar input and state transformation as in Equation 3.19. Next, new information can be added to the cell state with the input gate. First, a new candidate cell statecˆ_t_i is computed using another neural network layer. Typically, this layer uses a tanhactivation function. Here, any activation function can be used as the output values do not directly affect the cell statec_t_i. Next, another gating layer determines which parts of the candidate state are kept for addition to the previous cell statec_t_i−1. Thus, for the input gate, we compute

3.3 Neural Networks

GRU Cell ht_i−1

xti

σ σ tanh

× hti

ˆhti

zti

1−x rti

Fig. 3.9: An example of a GRU cell [87]. Blue blocks indicate a neural network layer with learnable weights. White blocks indicate element-wise operations. σ(x) refers to a sigmoid activation function.

in_t_i = σ(W_ihh_t_i−1 +W_ixx_t_i) (3.21) ˆ

c_t_i = tanh(W_ihh_t_i−1 +W_ixx_t_i). (3.22) Thus, the new cell statecti is computed by

c_t_i =fg_t_ic_t_i−1+in_t_icˆ_t_i. (3.23) Finally, the LSTM’s output needs to be computed. Here, the output is the new cell statec_t_itransformed with atanhactivation. Then, an output gate computes which values are kept within the transformed cell state:

o_t_i = σ(W_ohh_t_i−1 +W_oxx_t_i) (3.24) h_t_i = o_t_itanh(c_t_i). (3.25) Many different LSTM variants have been introduced where the gating process is slightly modified. A very popular modification are gated recurrent units (GRUs) [87].

They are slightly more efficient than standard LSTMs as they combine the input and forget gate into a single update gate. Furthermore, the cell stateˆcti and the hidden state ˆh_t_iare fused into a single representation. The GRU performs the following computations:

z_t_i = σ(W_zhh_t_i−1 +W_zxx_t_i) (3.26) r_t_i = σ(W_rhh_t_i−1 +W_rxx_t_i) (3.27) ˆh_t_i = tanh(W_hhh_t_i−1 +W_hxx_t_i) (3.28) h_t_i = (1−z_t_i)h_t_i−1 +z_tˆh_t_i (3.29) A visual interpretation of a GRU cell is shown in Figure 3.9. Note that the GRU only uses three instead of four neural network layers. Also, the fused hidden representation reduces the model’s memory footprint. These advantages have made GRUs a popular alternative for LSTMs in a lot of tasks.

Im Dokument Deep Learning with Multi-Dimensional Medical Image Data (Seite 59-64)