Learning Multivariate Normal Distribution

Covariance Criteria

In order to define the two upcoming covariance criteria, the estimated covariance matrix from a generated batchX^f∈R^b×d needs to be computed with

Σˆ

Xe = 1

b−1X^f^T_cX^f_c, (60) whereX^f_c=X^f−1_bx¯^T denotes the centered batch matrix, 1_b theb−dimensional unit (column) vector and ¯x the row-mean vector from equation (58).

The correlation matrix is obtained with the estimated covariance matrix Rˆ

Xe =D⁻¹Σˆ

XeD⁻¹, (61)

where D is the matrix of square-rooted diagonal elements from the estimated co-variance matrix, i.e. D=^qdiag( ˆΣ

Xe).

The first covariance criterion c_l₁ is based on the l₁ norm and computes the sum of absolute differences between sample correlation matrix ˆR

Xe and unit correlation matrix Id.

c_l₁(X) :=^f 1 d²

i=1 d

j=1

|Rˆ

Xe(i, j)−I_d(i, j)|, (62) where (i, j) is the element in the i−th row and j−th column of the respective matrix.

The second covariance criterion cf b is based on the frobenius norm. The frobenius norm of a matrixΣ ∈R^n×m is defined as

||Σ||_F =

v u u t

i=1 m

j=1

Σ(i, j)². (63)

In order to compare the estimated correlation matrix with the unit correlation matrix, the second covariance criterion is defined as

c_{f b}(X) :=^f 1 d

h||R

Xe||_F − ||I_d||_Fⁱ . (64) Similar to the mean criterion metric, the two covariance criteria are expected to decrease with increasing training epoch.

In this proof-of-concept showcase, several network architectures were tested and their evaluation metrics compared. The final results with the training settings as well as network structures are displayed in the next Section. For all experiments, ei-ther ADAM optimizer [Kingma & Ba (2014)] or RMSprop optimizer [Hinton (2012)]

were chosen to update the network parameters. In the first experiment, the three GAN variants were compared to each other. The intention was to confirm whether

the Wasserstein GAN with gradient penalty (Algorithm 3) is superior to vanilla GAN and Wasserstein GAN with weight clipping. For that reason, the same net-work architectures with different optimizers were selected.

4.2.2 Results

In this proof-of-concept experiment, batch normalization as well as layer normal-ization [Ba et al. (2016)] for the generator network were tested. It turns out that adding batch normalization layers [Ioffe & Szegedy (2015)] in the generator network is crucial for generating good samples. In batch normalization, an activated batch B ∈R^b×d, will be normalized by subtracting the batch-meanµ∈R^dand dividing by the batch standard deviationsσ ∈R^dalong the batch dimensionb. Layer normaliza-tion computes the layer-mean and standard devianormaliza-tions along the feature dimension d to obtain µ∈ R^b and σ ∈ R^b. Normalizing is conducted in the similar way along the feature dimension. The architectures for selected generator and discriminator network are shown below.

Table 3: Illustration of the generator network architecture. It consists of three fully connected hidden layers with batch normalization and leaky ReLU activation [Xu et al. (2015)].

Name Type Input size Output size

input input:z∼U(−1,1) 100

-FC1 linear 100 256

batch normalization 256 256

leaky ReLU 256 256

FC2 linear 256 512

batch normalization 512 512

leaky ReLU 512 512

FC3 linear 512 256

batch normalization 256 256

leaky ReLU 256 256

output linear 256 50

batch normalization 50 50

Table 4: Illustration of the discriminator/critic network architecture. Depending on the algo-rithm, sigmoid activation function (equation (3)) is deployed in the output layer. This only holds for the vanilla GAN Algorithm 1.

Name Type Input size Output size

input input:x' N(4, I) 50

-FC1 linear 50 128

leaky ReLU 128 128

FC2 linear 128 256

leaky ReLU 256 256

FC3 linear 256 512

leaky ReLU 512 512

output linear 512 1

The learning rates for the generator and discriminator/critic networks were set to α_g = 0.0002 and α_d= 0.0004 for both RMSprop- and ADAM optimizer.

All GAN variants were trained for n_epochs = 150 epochs. At the beginning of every epoch, b = 5000 samples were generated and the metrics from equation (59), (62) and (64) computed for evaluation. In this experiment, the same architectures for generator and discriminator network were used (see Table 3 and 4) for all three GAN variants with the only difference of optimizer choice. All GAN variants are able to generate data with mean valueµ= 4 as demonstrated in Figure 40.

The Wasserstein GAN with weight clipping and RMSprop optimizer performs best regarding the mean value evaluation criterion, followed by its improved version with gradient penalty (and ADAM optimizer) and lastly the vanilla GAN (both ADAM

or RMSprop).

Figure 40: Mean evaluation criterion. Every GAN variant is able to generate samples with mean value~4 very quickly after even one epoch of training.

When analyzing the capability to model the second moment, the corresponding evaluation curves for the Wasserstein GAN with weight clipping are unstable and fluctuate strongly as shown in Figure 41. The Wasserstein GAN with gradient penalty and ADAM optimizer (as suggested by default in Algorithm 3) seems to be most robust regarding the two covariance criteria, generating samples that come from a normal distribution N(µ= 4, Σ =I₅₀).

(a) (b)

Figure 41: The two covariance evaluation criteria suggest that Improved WGAN with ADAM optimizer (Algorithm 3) is the best method to choose. The generator is able to generate samples which have a mean value of~4 as well as generate samples, where all column features of the samples have a (very) low pairwise correlation.

When analyzing thel₁-criterion, the generator from Improved WGAN with ADAM optimizer seems to produce samples, where its feature columns have low correlation as indicated in Figure 41a. Ideally, the estimated correlation matrix from the batch evaluation data matrix is approximately the identity matrix.

For the frobenius norm criterion, the Improved WGAN with ADAM optimizer per-forms best as well. Considering the results from this experiment, the Improved

WGAN and ADAM optimizer is chosen as the best algorithm for learning multivari-ate normal data. Of course, an extensive hyperparameter search can be conducted.

Since the goal was to try out different settings and (empirically) show that WGAN with gradient penalty is superior to vanilla GAN and WGAN with weight clipping, no further investigation in hyperparameter tuning was conducted.

Another interesting evaluation step is to select the generator network for the epochs i= 0,1,50,150, sample 5000 observations and extract any arbitrary column out of the generated batch data matrix, e.g. the first column. Knowing the data generating process for the multivariate normal distribution with identity matrix as the covari-ance matrix, we conclude that the joint probability can be factorized as a product of independent univariate Gaussians [Do (2008)]. So when training the GAN, as shown in Figure 40, a distribution shift in the univariate case with increasing epoch towards N(µ= 4, σ² = 1) is expected. To observe this expectation, a kernel density estima-tion (KDE) on the generated samples was computed. Since the Improved WGAN with ADAM optimizer is learning the true data very fast²⁰ even after one epoch, we perceive that the univariate Gaussian for the generator model shifts towards the true Gaussian with mean value of four as illustrated in Figure 42.

Figure 42: The generator network is learning to produce samples that follow an univariate N(µ= 4σ²= 1) even after one epoch of training. For this plot, the first column of the training and generated batch data matrix was selected. The reason why the KDE of generated samples in epoch zero looks Gaussian is because the weights of the generator network are initialized using Gaussian random numbers with zero mean and variance depending on the hidden layer size as stated in the end of Section 2.2.2.1 with theXavier Initialization Rule [Glorot & Bengio (2010)].

20The results after one epoch are so strong since the dataset is large with 1 million samples and the training was performed with a batch-size of m = 256. In this case, approximately 1000000/256 = 3906.25 generator updates are executed within one epoch.

Im Dokument De novo drug design in continuous space (Seite 67-72)