Deep Generative Models - Multi-Context One-Class Classification for Text

3.2 Multi-Context One-Class Classification for Text

4.1.2 Deep Generative Models

Neural generative models aim to learn a neural network that maps vectors sampled from a simple predefined source distribution Q (usually a Gaussian or uniform distribution) to the actual input distributionP⁺. That is, their objective is to train a networkφω with weights ω such thatφω(Q)≈P⁺, whereφω(Q) is the distribution that results from pushing the source distributionQthrough the neural network φ_ω.

The two most established neural generative models are Variational Autoencoders (VAEs) [277, 454, 278] and Generative Adversarial Networks (GANs) [186].

VAEs

Variational Autoencoders are deep latent variable models where the input x is parameterized on latent samples z ∼Q via some neural network so as to learn a distribution p_θ(x|z) such that p_θ(x) ≈ p⁺(x). A common choice is to let Q be an isotropic multivariate Gaussian distribution and let the neural network φ_d,ω = (µω,σω) (the decoder with weights ω) parameterize the mean and variance of an isotropic Gaussian distribution, so thatp_θ(x|z)∼ N(x;µ_ω(z),σ_ω²(z)I). Performing maximum likelihood estimation on θ is typically intractable. To address this, an additional network φe,ω⁰ (the encoder with weights ω⁰) is introduced to parameterize a variational distribution q_θ⁰(z|x), with parametersθ⁰ encapsulated by the output of the encoder φ_e,ω⁰, to approximate the latent posterior p(z|x). The full model is then optimized in a variational Bayes manner via the evidence lower bound (ELBO):

maxθ,θ⁰ −D_KL q_θ⁰(z|x)kp(z)+Eq_θ0(z|x)

logpθ(x|z). (4.1) Optimization proceeds using Stochastic Gradient Variational Bayes [277]. Given a trained VAE, p_θ(x) can be estimated via Monte Carlo sampling from the prior p(z) and computing Ez∼p(z)[p_θ(x|z)]. Using this estimated likelihood as an anomaly score has a nice theoretical interpretation, but experiments have shown that it tends to perform worse [595, 387] than alternatively using the reconstruction probability [21], which conditions onx to estimateEq_θ0(z|x)

logp_θ(x|z). The latter can also be seen as a probabilistic reconstruction model that uses a stochastic encoding and decoding process, which connects VAEs to general reconstruction-based autoencoders (see Section 4.2.3).

GANs

Generative Adversarial Networks approximate a data distribution by posing a zero-sum-game [186]. A GAN consists of two neural networks, a generator φω:Z → X and a discriminator ψ_ω⁰ : X → (0,1), which are pitted against each other. The discriminator is trained to discriminate betweenφω(z) andx∼P⁺wherez∼Q. The generator is trained to fool the discriminator and thereby learns to produce samples that are similar to the data distribution. This is achieved by using an adversarial objective:

minω max

ω⁰ Ex∼P⁺

logψω⁰(x)+Ez∼Q

log(1−ψω⁰(φω(z))). (4.2) Training is typically performed in an alternating optimization scheme which can be a notoriously delicate procedure [479]. There exist many GAN variants, for example the Wasserstein GAN [26, 201], which is frequently used for anomaly detection

methods using GANs, or StyleGAN, which has produced impressive high-resolution photorealistic images [266].

By construction, GAN models offer no direct way to assign a likelihood to points in the input space. Using the discriminator directly has been suggested as one approach to use GANs for anomaly detection [476], which is conceptually close to one-class classification (see Chapter 2). Other approaches apply optimization in latent spaceZ to find a point ˜z such that ˜x≈φω(˜z) for a test point ˜x. In AnoGAN [488], the authors recommend to use an intermediate layer l of the discriminator, ψ_ω^l0, and setting the anomaly score to be a convex combination of the reconstruction losskx˜−φω(˜z)k and the discrimination loss kψ_ω^l0(˜x)−ψ_ω^l0(φω(˜z))k. In AD-GAN [130], we recommend to initialize the search for a latent point multiple times to find a collection ofM latent points ˜z₁, . . . ,z˜_M while also adapting the generator parameters ωi individually for point ˜zi to improve the reconstruction. We then propose using the mean reconstruction loss as an anomaly score:

1 Viewing the generator as a stochastic decoder and the optimization for an optimal latent point ˜z as an (implicit) encoding of a test point ˜x, this way of utilizing a GAN, with the reconstruction error as an anomaly score, is similar to autoencoders (see Section 4.2.3). Later adaptations of GANs for anomaly detection have added explicit encoding networks that are trained to find the latent point ˜z. This has been used in a variety of ways, usually again with incorporating the reconstruction error as an anomaly score [604, 13, 489].

Normalizing Flows

Like neural generative models, normalizing flows [140, 413, 284] also attempt to map data points from a source distribution z ∼ Q (termed base distribution for flows) so thatx≈φω(z) is distributed according to p⁺. However, a distinguishing characteristic of normalizing flows is that their latent space Z ⊆R^D whereQ lives has the same dimensionality D as the input space X ⊆ R^D. A normalizing flow consists of L neural network layers φi,ωi : R^D → R^D so φω = φL,ωL ◦ · · · ◦φ1,ω1

where each φ_i,ω_i is designed to be invertible for all ω_i, thereby making the entire network invertible. The advantage of preserving the dimensionality and the invertible formulation is that the probability density ofxcan be calculated exactly via a change of variables optimized to maximize the likelihood of the training data. Evaluating the Jacobian and its determinant for each layer can be very expensive. For this reason, the layers of normalizing flows are usually designed so that the Jacobian has some nice

structure, for example being upper (or lower) triangular, so that it is not necessary to compute the full Jacobian to evaluate its determinant [140, 141, 240]. One benefit of normalizing flows over other neural generative models (e.g., VAEs or GANs) is that the likelihood of a point can be computed directly without any approximation for flows, while also enabling reasonably efficient sampling. Since the density p_x(x) can be computed exactly, normalizing flow models can be directly applied to anomaly detection [386, 577], using the negative log-likelihood as an anomaly score. Maziarka et al. [365] have recently proposed another flow-based anomaly detection model that optimizes the normalizing flow to learn a data-enclosing hypersphere of minimum volume in latent space, which connects their method to deep one-class classification (see Chapter 2.2).

One limitation of normalizing flows is that, per construction, they do not perform any dimensionality reduction, which argues against their use on data where the true (effective) dimensionality is much smaller (e.g., for images that live on a lower dimensional manifold in pixel space). For image data, it has been observed that these models can often assign high likelihood to anomalous instances [387]. Recent work suggests that one reason for this phenomenon seems to be that the likelihood in current flow models is dominated by low-level features due to their specific network architectures and inductive biases [487, 281].

Im Dokument Deep One-Class Learning A Deep Learning Approach to Anomaly Detection (Seite 90-93)