• Keine Ergebnisse gefunden

Conditional Generative Adversarial Recurrent Neural Networks

In this thesis, we aim to investigate the CGAN framework’s ability to generate se-quential data. Since the generator and the discriminator can be chosen arbitrarily as long as the generator is a generative and the discriminator is a classifying model, we will use LSTMs for both. As mentioned in section 2.2, LSTMs are a common model for processing sequential data and were often used with huge success. We refer to this model as Conditional Generative Adversarial Recurrent Neural Net-works (CGARNNs). We expect CGARNNs to combine the abilities of generating context-sensitive high quality samples while learning and utilizing the distribution of the sequential data. Other than LSTMs, CGARNNs might give us the possibility to generate data described by context vectors that were not available during training because the LSTM’s loss always depends on a specific sample sequence per training

step while the CGARNN’s loss only depends on the discriminator’s feedback. We expect the model to interpolate the behavior for unknown context vectors since a context vector contains multiple random variables and therefore parts of the un-known vectors were processed during training before. This is particularly beneficial if the number of possible context vectors is very large and it is impossible to provide training data for all of them. In the following section we will explore the potential of the CGARNN model empirically in multiple experiments.

3 Experiments

In this section we describe the Conditional Generative Adversarial Recurrent Neural Network (CGARNN) experiments which are divided into two sections.

The first section is a proof-of-concept to show that the model is able to generate data depending on a given context. Therefore we use the MNIST data set [14] and compare our results to those of three other models (Tab. 1): a Generative Adver-sarial Network (GAN), a Conditional GAN (CGAN) and a Generative AdverAdver-sarial Recurrent Neural Network (GARNN). We chose those three rather similar models in order to determine the advantages of the single properties of the CGARNN model, namely the use of Recurrent Neural Networks (RNNs) to achieve some kind of mem-ory unit and the dependency on a context vector. The experiments should show that both the use of RNNs and the additional context input improve the quality of the generated data.

no memory memory

no context GAN GARNN

context CGAN CGARNN

Table 1: MNIST - Comparison of model properties

The second section describes the experiments on the Football Events data set (REF) with which we want to discover the advantages of the CGARNN model for the text generation task. For comparison reasons we train a Long Short-Term Memory (LSTM) model that also depends on a context vector in addition. So the main difference between these two models is the way the feedback is given to them. While the CGARNN’s generator gets it’s feedback from the discriminator, the LSTM model needs an original text from the data set in order to compare it to the generated one. We further discuss the impact of the way the feedback is given to the different models in 3.2.

We implemented all experiments in Python using Google’s TensorFlow1 library for machine learning, published the code under an open source license2 and wrote a code documentation (see appendix).

1https://www.tensorflow.org/

2https://bitbucket.org/ROYALBEFF/conditional_generative_adversarial_rnns (also listed in the appendix)

3.1 MNIST

The first data set we used for our experiments is the well-known MNIST data set [14] consisting of 70000 28×28 images showing handwritten digits. The experiments on the MNIST data set are a proof-of-concept, whereas the experiments in 3.2 show the model’s capabilities. As mentioned above, we trained three models in addition to the CGARNN on the MNIST data in order to compare their results. The three models are: a GAN that is independent of the context and does not make use of RNNs, a CGAN that depends on the context and does not make use of RNNs either, and a GARNN that is independent of the context but uses RNNs instead of common Artificial Neural Networks (ANNs) (Tab 1).

Generator Discriminator N (per layer) noise size, 392, 784 784, 392, 1

loss Sigmoid cross entropy with logits

optimizer ADAM

Table 2: GANs network settings. N describes the number of neurons per layer (input, hidden, output) for generator and discriminator. The second and third line are the loss and optimizer functions. kis the number of the discriminator’s training steps per each training step of the generator. Batch size is the number of training examples per epoch. Noise size describes the size of the generator’s input vector that is then formed to a 28×28 image. α is the learning rate. The initializer describes the way the variables are initialized. Epochs is the number of training epochs.

First we trained a GAN with parameters as seen in Tab. 2. In this model both the generator and the discriminator are artificial neural networks sharing most of their network parameters.

The generator consists of three layers: input layer (noise size neurons), hidden layer (392 neurons) and output layer (784 neurons). The number of neurons per layer depends on the noise size which describes the size of the input vector, and on the size of the MNIST data. The input layer contains one neuron per entry in the input vector. The hidden layer contains exactly half of the neurons the output layer contains. This way the size of the output vectors across the layers increases regularly, such that the network’s output does not depend on a specific part of the network too much. The output layer then contains exactly 784 (28×28) neurons in order to produce an output vector similar to those in the MNIST data set.

The discriminator also consists of three layers: input layer (784 neurons), hidden layer (392 neurons) and output layer (1 neuron). The number of neurons per layer is

also decreasing regularly, such that the discriminator profits from this the same way the generator does. The discriminator’s job is to distinguish real MNIST images from those that are produced by the generator. The single scalar output describes the probability that the input vector came from the MNIST data set rather than having been generated. kdescribes the number of training steps of the discriminator per training step of the generator. We set k = 1, which is a very common choice.

As mentioned in chapter 2.3, the loss function of the generator and the dis-criminator can be described as a minimax game. During our experiments we often observed the discriminator’s loss being NaN (not a number). A quick look at the discriminator’s loss function (Eq. 13) reveals the reason for this behavior. The problem occurs when the discriminator is so good at classifying the input vectors that its loss value is really close to 0. Rounding errors then lead to a loss value of 0. At this point we try to calculate the logarithm of 0, which is not defined.

LD =−(logD(x) + log(1−D(G(z)))) (13) The first idea that came to mind was adding a small value = 0.001 to the network’s output in order to prevent the loss value from becoming too small (Eq.

14). Though even while this fixed the problem and results in better loss progression and generated samples, it did not feel satisfactory.

LD =−(logD(x+) + log(((1−D(G(z)) +))) (14) Another idea was to try a different and established loss function that does not have the problem described above: the concept of cross entropy. We noticed some similarities between these two loss functions and while having a closer look at them we realized that they are exactly the same. This means we can still use the minimax game described in the paper by just rearranging the formula and without adding the aforementioned value . In the following we explain cross entropy and then prove that it is the same as the minimax game.

Cross entropy (Eq. 16) can be used as a measure of similarity between two probability distributions. Before we can understand what cross entropy is, we must have an understanding of how entropy works in general. Entropy measures the average information content of a random variable X over a discrete probability distribution pand is described by Eq. 15.

H =

The number of possible valuesX can obtain isn. pi is the probability forX =iand logp1

i is the information content ofX. A very demonstrative way of understanding entropy is the task of assigning each value a bit sequence. The goal is to assign the bit sequence in a way that the expected length of the bit sequence of a random sequence of values is minimized. To achieve this goal we assign values with high probabilities shorter bit sequences than values with smaller probabilities. We can use this property to compare the likeliness of two probability distributions that are defined over the same set of values. Let’s say we want to approximate a given

probability distribution p over a certain set. Furthermore, we assume that H is the entropy defined over p such that the number of expected bits per sequence is minimal. Now we replacepwith our approximated distribution ˆpfor the assignment of the bit sequences. This results in another entropy with another expected number of bits per sequence. The difference between these two expected number of bits can be used to determine the likeliness of two probability distributions over the same set. This method is called cross entropy (Eq. 16).

The two distributions we compare are the real distribution p that describes our data set whereas the second distribution ˆpis the distribution representing the trained model. The goal is to change the network’s weights in a way that the represented distribution converges to the real distribution or minimizes the loss, respectively.

H =

A closer look at Eq. 16 will help us understand how it works. First of all we can rearrange Eq. 16, because we know that there are just two classes in total (n = 2) and this way obtain Eq. 17. Either the input vector will be classified as real or as fake. We can describe these two outcomes with the probabilitiesy (probability that the input vector is real) and it’s complementary probability 1−y (probability that the input vector is fake) for the real probability distribution and ˆy and 1−yˆfor the learned distribution respectively with y∈ {0,1}and ˆy∈[0,1].

H =ylog1

ylog ˆy describes the error that occurs in case the input vector came from the real data distribution. (1−y) log(1−y) describes the error for generated input vectors.ˆ When calculating the error for an input vector, only the corresponding part of the equation is used. The other part will be 0 due to the actual label y. Ify is 0 then the first part of the equation results in 0, otherwise the second part of then equation results in 0. The error then describes the difference between the optimal entropy given by the real data distribution and the entropy given by the learned distribution.

Obviously, when the learned and the real distribution are the same, the entropy of the different outcomes are the same and therefore the cross entropy is 0.

Now we prove that the minimax game and the cross-entropy are the same. The main idea behind this proof is that we want to make the cross entropy independent of the actual output of the network. Therefore we rearrange the cross entropy in a way that it does not take the activated output of the network, but its logits. We can describe the activated output of the discriminator D(x) as σ(ˆx), where ˆx is the discriminator’s last layer’s output, before we apply the activation function to it.

This rearrangement works, because the activation function of the output layer is the sigmoid function σ. As we can see in equation 18, we can make use of this property and rearrange the cross entropy such that it uses the logits instead of the activated

output and therefore will be defined totally. We call this rearranged variant Hσ

Next, we show that we can express the minimax game as the sigmoid cross entropy with logits. As seen in equation 19, this procedure is very straight forward.

LD =−(logD(x) + log(1−D(G(z)))

Now that we have shown that the minimax game describes the same optimization problem as the sigmoid cross entropy with logits, we can use it without running into the problem that the loss value can be NaN.

To minimize the loss value during training we use Adaptive Moment Estima-tion, an optimization method presented by Kingma and Ba in 2015 [13]. Adaptive Moment Estimation (ADAM) is a stochastic gradient-based optimization method, which means that it optimizes stochastic functions, such as the sigmoid cross en-tropy with logits, by using partial derivatives of that function. The idea of using gradient descent (ascent) to optimize a loss function is standard practice and used in all established optimization methods in the domain of machine learning. What is

special about ADAM is the fact that the learning rate is adaptive in each iteration.

The advantage of using an adaptive learning rate rather than a constant learning rate is that we can adjust it to the current learning progression. E.g. when the gradients of the last few iterations indicate a step size that is too large, the learning rate will be decreased for the next iterations in order to converge to the global min-imum rather than surpassing it. To determine the learning rate in each iteration we use the first (Eq. 20a) and second moment (Eq. 20b) of the gradients, which are the mean value and the variance, wheret is the current iteration,gtthe gradients in iteration t, g2t the element-wise multiplication and β1 and β2 are hyper-parameters describing the exponentially decreasing influence of the previous gradients. Kingma and Ba provide default values for the hyper-parameters, which are β1 = 0.9 and β2 = 0.999.

mt1∗mt−1+ (1−β1)∗gt (20a) vt2∗vt−1+ (1−β2)∗gt2 (20b) m0 and v0 are initialized with 0’s which leads to biased moments in later iterations.

Fortunately, one can simply correct these biased values by dividing by (1 −β1t) or (1 −β2t) respectively, which leads to the moments as seen in (Eq. 21). We reformulated the corrected moments to make the influence of the single gradients easier to see and to make it easier to understand why (1−β1t) or (1−β2t) respectively is used to correct the biased moments.

ˆ

The actual update step is then described in Eq.(22), where θt are the parameters at the t-th iteration, α is the upper bound of the learning rate (default: α = 0.001),

ˆ

mt is the mean value of the last t gradients, √ ˆ

vt is the standard deviation of the last t gradients and is a small value (default: 10−8) that is added to√

vt is also called signal-to-noise ratio (SNR). One can imagine the learning rate adjusting as follows: When the standard deviation is rather large, then the fraction will lead to a small value and so does the adjusted learning rate. A large standard deviation means that the direction the gradients have to move in, in order to reach the global minimum, is vague. In this case the SNR is rather large and the decreased learning rate prevents the gradient steps from being too large and smaller steps are taken instead. A small standard deviation means that the last few

steps reveal a clear direction for the gradients, such that the gradient steps can be increased in order to speed up the optimization process. Nonetheless, the step size is always bounded by α.

The next interesting part is the variable initialization which is typically based on a probability distribution where the initialization values are drawn from. For this we used a Xavier initialization, described by Xavier Glorot and Yoshua Bengio in 2010 [3]. The idea behind Xavier initialization is to solve the problem of too large or too small variances of the weights and therefore of the values that are propagated through a network. In the following we assume a uniform probability distribution with mean 0. When the variance of the weights is too small, then all weights are near the mean value. Calculating a layer’s output then results in values that are also near 0. Looking at the sigmoid function for instance, this will give us mostly gradients of about 1, which leads to an almost linear behavior (Fig. 11, blue area).

But when the variance of the weights is too large, the output values of a layer are far away from the mean value, too. This results in gradients near 0, which in turn means that the weights will stay as they have been in the previous iterations (Fig.

11, red area). Thus, the goal is to find a variance that we can use to initialize the weights in a way such that the variances of the inputs and outputs of all layers are the same. Looking at a single layer, we can describe the variance of its outputy as:

Figure 11: Problem of too small or too large variance of the weights, illustrated by the sigmoid function. A too large variance results in output values that are far away from the mean value and leads to gradients near 0 (red areas). A too small variance results in output values that are very close the mean value and therefore leads to linear behavior (blue area).

V ar(y) =V ar(xW +b) = V ar((

n

X

i=1

wixi) +b) (23) Since b can be seen as another weight that is always multiplied with the input 1,

we can drop it to simplify things. The summands are variances of products of independent variables and can therefore be described as:

V ar(wixi) = E(wi)2V ar(xi) +E(xi)2V ar(wi) +V ar(wi)V ar(xi)

=V ar(wi)V ar(xi) (24)

The expected values are 0, such that the variance of wi and xi is just the product of their variances. Assuming that the variances of the weights and the input in a single layer are all the same, equation 23 can be expressed as

V ar(y) = n∗V ar(w)V ar(x) (25) , where n is the number of neurons in the corresponding layer. Now we want the variances of the output y and the input x to be the same. Therefore we want to know the variance of the weights.

V ar(x) = V ar(y)

⇔ V ar(x) = n∗V ar(w)V ar(x)

⇔ 1 =n∗V ar(w)

⇔ 1

n =V ar(w) (26)

In case the size of y is not equal to the size of x, we have to average the input and the output size:

V ar(w) = 1

(ni+ni+1)/2 = 2

ni+ni+1 (27)

To adjust the variance of the uniform distribution to the variance described in equation 27, we needed to choose the boundaries of the interval the random values are drawn from, since the variance of a uniform distribution is given by:

V ar(x) = 1

12(b−a)2 (28)

Where a is the lower bound and b the upper bound of the interval. We assume that a =−b. Now we just have to solve the following equation in order to achieve interval boundaries that lead to a uniform distribution with the desired properties.

2 ni+ni+1

= 1

12(a−b)2

⇔ 2

ni+ni+1 = (2b)2 12

⇔ 2

ni+ni+1 = 4b2 12

⇔ 24 ni+ni+1

= 4b2

⇔ 6

ni+ni+1

=b2

√6

√ni+ni+1 =b (29) In summary this means that we can use a uniform distribution with mean 0 to initialize our weights without running into the problem of too large or too small variances when we set the interval boundaries as seen in equation 29 to adjust the variance of our distribution to 2/(ni+ni+1) for each layer, whereni is the input size and ni+1 the output size of layeri.

Now we will have a look at the experimental results. We trained the GAN model 8 times for 50000 epochs with a batch size of 64 and a learning rateα of 0.001 and calculated the means of the loss values (Fig. 12).

Figure 12: Progress of GAN models loss values over 50000 epochs with a batch size of 64 and a learning rate of 0.001. The pink line shows the generator’s loss, the green line shows the discriminator’s loss.

Analyzing the plot, we noticed that at the very beginning the discriminator’s

Figure 13: Failed GAN experiment.

loss value goes to 0, while the generator’s loss reaches a maximum of about 8. The reason for this is the strong discriminator in the first epochs. Here, most of the

loss value goes to 0, while the generator’s loss reaches a maximum of about 8. The reason for this is the strong discriminator in the first epochs. Here, most of the