Theoretical Properties of Deep SVDD - Deep One-Class Classification

2.2 Deep One-Class Classification

2.2.2 Theoretical Properties of Deep SVDD

We here examine Deep SVDD theoretically. First, we analyze three properties (Propositions 1–3) that (in theory) can lead to trivial, uninformative solutions of Deep SVDD and thus should be accounted for (e.g., by means of regularization, see Section 2.2.3). Afterwards, we prove theν-property for soft-boundary Deep SVDD.

In the following let J_soft(R, ω) andJoc(ω) be the soft-boundary and One-Class Deep SVDD objective functions, as defined in (2.10) and (2.11), respectively. We first show that including the hypersphere centerc∈ Z as a free optimization variable can produce a trivial solution for both (non-regularized) Deep SVDD objectives.

Proposition 1 (Zero weights solution). Let ω⁰ be the set of zero network weights, i.e., ω_l = 0 for every ω_l ∈ ω⁰. For constant zero weights, the network maps any input to the same output, i.e., φ_ω0(x) =:c⁰∈ Z is constant for every x∈ X. Then, ifc=c⁰, the optimal solution of Deep SVDD is given by ω^∗ =ω⁰ and R^∗ = 0. Proof. For every parameter configuration (R, ω) we have that both Jsoft(R, ω)≥0 andJoc(ω)≥0. As the output of the zero weights networkφ_ω⁰(x) is constant, and the center of the hypersphere is given byc=φ_ω0(x), all errors in the empirical term of the objectives become zero. Hence, R^∗ = 0 and ω^∗ = ω⁰ are optimal solutions since we haveJ_soft(ω^∗, R^∗) = 0 andJoc(ω^∗) = 0 in this case.

Proposition 1 implies that if we include the hypersphere centerc as an optimization variable, optimizing the (non-regularized) Deep SVDD objectives via SGD may converge to the trivial solution (R^∗,c^∗, ω^∗) = (0,c⁰, ω⁰). We call such a solution, where the network learns a constant map to some fixed output, “hypersphere collapse”

since the hypersphere collapses to zero volume. Next, we identify two network architecture properties, which can also encourage such trivial hypersphere collapse solutions.

Proposition 2 (Bias terms). Let c∈ Z be any fixed hypersphere center. If there is a hidden layer in network φω :X → Z having a bias term, there exists an optimal solution (R^∗, ω^∗) of the Deep SVDD objectives (2.10) and (2.11) with R^∗ = 0 and φ_ω^∗(x) =c for every x∈ X.

Proof. Assume layerl∈ {1, . . . , L} with weightsω_l has a bias termb^l. For any input x∈ X, the output of layer l is then given by

z^l(x) =σ^l(ωl·z^l−1(x) +b^l),

where “·” denotes a linear operator (e.g., matrix multiplication or convolution), σ^l(·) is the activation function of layerl, and z^l−1(x) is the output of the previous layer which depends on input x. Then, for ω_l=0, we have thatz^l(x) =σ^l(b^l), i.e., the output of layerlis constant for every input x∈ X. Therefore, the bias term b^l (and the weights of the subsequent layers) can be chosen such that φω^∗(x) =cfor every x∈ X (assumingc is in the image of the network as a function of b^l and the weights ω_l+1, . . . , ω_L of the subsequent layers). Hence, selecting ω^∗ in this way results in an empirical term of zero in (2.10) and (2.11), and choosingR^∗= 0 gives the optimal solution (ignoring the weight decay regularization term for simplicity).

In other words, Proposition 2 implies that networks with bias terms can easily learn any constant function that is independent of the input x∈ X.²

2Proposition 2 also explains why autoencoders with bias terms are vulnerable to converge to a constant mapping onto the mean, which is the optimal constant solution of the mean squared error.

Proposition 3 (Bounded activation functions). Consider a network unit having a monotonic activation function σ(·) that has an upper (or lower) bound with suphσ(h) 6= 0 (or infhσ(h) 6= 0). Then, for a set of unit inputs {h₁, . . . ,h_n} that have at least one feature that is positive or negative for all inputs, the non-zero supremum (or infimum) can be uniformly approximated on the set of inputs.

Proof. W.l.o.g. consider the case that σ is upper bounded by B := suphσ(h) 6= 0 and feature j being positive for all inputs, i.e. h_ij > 0 for all i = 1, . . . , n. Then, for every ε > 0, one can always choose the jth element wj of the network unit weights sufficiently large (setting all other network unit weights to zero) such that supi |σ(w_jh_ij)−B|< ε.

Proposition 3 simply states that a network unit with a monotonic, bounded activation function (e.g., sigmoid or tanh) can be saturated for all inputs, if these inputs share at least one feature with the same sign. Such a saturated unit, by effectively being constant over the inputs, emulates a bias term in the subsequent layer, which again enables the network to more easily learn a constant map (see Proposition 2).

To summarize the above analysis: optimizing the hypersphere center c (due to the zero weights solution) as well as using bias terms and bounded activation functions in the network can foster a trivial hypersphere collapse solution for standard (non-regularized) Deep SVDD. Empirically, we found that using Batch Normalization [248] in the network φ_ω (with the top layer excluded, of course), which prevents a collapse in the lower layers due to the normalization, and fixing hypersphere center c as the mean of the data embeddings obtained from an initial forward pass on the training data to be a reasonable strategy for standard (non-regularized) Deep SVDD in practice. We found this strategy, together with the added stochasticity of mini-batch SGD optimization, to produce fairly stable and consistent results, which did not suffer from a hypersphere collapse. In the next Section 2.2.3, we will discuss further regularization techniques, which actively regularize against a trivial collapse solution. But before, we lastly prove that theν-property also holds for soft-boundary Deep SVDD which allows us to model some target false alarm rate or account for training data contamination.

Proposition 4 (ν-property). Hyperparameter ν ∈(0,1] in the soft-boundary Deep SVDD objective (2.10) is an upper bound on the fraction of points being outside and a lower bound on the fraction of points being outside or on the boundary of the hypersphere.

Proof. Define d_i = kφ_ω(x_i)−ck² for i= 1, . . . , n. W.l.o.g. assume d₁ ≥ · · · ≥ d_n. The number of points being outside the hypersphere is given bynout =|{i|di > R²}|

and we can write the soft-boundary objective J_soft (in radiusR) as J_soft(R) =R²−n_out

νn R² =1−n_out νn

R².

That is, radius R should be decreased as long as n_out ≤ νn holds and decreasing R gradually increases n_out. Thus, ⁿ^out_n ≤ν must hold in the optimum, i.e. ν is an

upper bound on the fraction of outliers, and the optimal radiusR^∗ is given for the largestn_out for which this inequality still holds. Finally, we have that R^∗2 =d_i for i=n_out+ 1 since radiusR is minimal in this case and points on the boundary do not increase the objective. Hence, we also have|{i|di≥R^∗2}| ≥nout+ 1≥νn.

Im Dokument Deep One-Class Learning A Deep Learning Approach to Anomaly Detection (Seite 43-46)