• Keine Ergebnisse gefunden

2.2 Deep One-Class Classification

2.2.2 Theoretical Properties of Deep SVDD

We here examine Deep SVDD theoretically. First, we analyze three properties (Propositions 1–3) that (in theory) can lead to trivial, uninformative solutions of Deep SVDD and thus should be accounted for (e.g., by means of regularization, see Section 2.2.3). Afterwards, we prove theν-property for soft-boundary Deep SVDD.

In the following let Jsoft(R, ω) andJoc(ω) be the soft-boundary and One-Class Deep SVDD objective functions, as defined in (2.10) and (2.11), respectively. We first show that including the hypersphere centerc∈ Z as a free optimization variable can produce a trivial solution for both (non-regularized) Deep SVDD objectives.

Proposition 1 (Zero weights solution). Let ω0 be the set of zero network weights, i.e., ωl = 0 for every ωlω0. For constant zero weights, the network maps any input to the same output, i.e., φω0(x) =:c0∈ Z is constant for every x∈ X. Then, ifc=c0, the optimal solution of Deep SVDD is given by ω =ω0 and R = 0. Proof. For every parameter configuration (R, ω) we have that both Jsoft(R, ω)≥0 andJoc(ω)≥0. As the output of the zero weights networkφω0(x) is constant, and the center of the hypersphere is given byc=φω0(x), all errors in the empirical term of the objectives become zero. Hence, R = 0 and ω = ω0 are optimal solutions since we haveJsoft(ω, R) = 0 andJoc(ω) = 0 in this case.

Proposition 1 implies that if we include the hypersphere centerc as an optimization variable, optimizing the (non-regularized) Deep SVDD objectives via SGD may converge to the trivial solution (R,c, ω) = (0,c0, ω0). We call such a solution, where the network learns a constant map to some fixed output, “hypersphere collapse”

since the hypersphere collapses to zero volume. Next, we identify two network architecture properties, which can also encourage such trivial hypersphere collapse solutions.

Proposition 2 (Bias terms). Let c∈ Z be any fixed hypersphere center. If there is a hidden layer in network φω :X → Z having a bias term, there exists an optimal solution (R, ω) of the Deep SVDD objectives (2.10) and (2.11) with R = 0 and φω(x) =c for every x∈ X.

Proof. Assume layerl∈ {1, . . . , L} with weightsωl has a bias termbl. For any input x∈ X, the output of layer l is then given by

zl(x) =σll·zl−1(x) +bl),

where “·” denotes a linear operator (e.g., matrix multiplication or convolution), σl(·) is the activation function of layerl, and zl−1(x) is the output of the previous layer which depends on input x. Then, for ωl=0, we have thatzl(x) =σl(bl), i.e., the output of layerlis constant for every input x∈ X. Therefore, the bias term bl (and the weights of the subsequent layers) can be chosen such that φω(x) =cfor every x∈ X (assumingc is in the image of the network as a function of bl and the weights ωl+1, . . . , ωL of the subsequent layers). Hence, selecting ω in this way results in an empirical term of zero in (2.10) and (2.11), and choosingR= 0 gives the optimal solution (ignoring the weight decay regularization term for simplicity).

In other words, Proposition 2 implies that networks with bias terms can easily learn any constant function that is independent of the input x∈ X.2

2Proposition 2 also explains why autoencoders with bias terms are vulnerable to converge to a constant mapping onto the mean, which is the optimal constant solution of the mean squared error.

Proposition 3 (Bounded activation functions). Consider a network unit having a monotonic activation function σ(·) that has an upper (or lower) bound with suphσ(h) 6= 0 (or infhσ(h) 6= 0). Then, for a set of unit inputs {h1, . . . ,hn} that have at least one feature that is positive or negative for all inputs, the non-zero supremum (or infimum) can be uniformly approximated on the set of inputs.

Proof. W.l.o.g. consider the case that σ is upper bounded by B := suphσ(h) 6= 0 and feature j being positive for all inputs, i.e. hij > 0 for all i = 1, . . . , n. Then, for every ε > 0, one can always choose the jth element wj of the network unit weights sufficiently large (setting all other network unit weights to zero) such that supi |σ(wjhij)−B|< ε.

Proposition 3 simply states that a network unit with a monotonic, bounded activation function (e.g., sigmoid or tanh) can be saturated for all inputs, if these inputs share at least one feature with the same sign. Such a saturated unit, by effectively being constant over the inputs, emulates a bias term in the subsequent layer, which again enables the network to more easily learn a constant map (see Proposition 2).

To summarize the above analysis: optimizing the hypersphere center c (due to the zero weights solution) as well as using bias terms and bounded activation functions in the network can foster a trivial hypersphere collapse solution for standard (non-regularized) Deep SVDD. Empirically, we found that using Batch Normalization [248] in the network φω (with the top layer excluded, of course), which prevents a collapse in the lower layers due to the normalization, and fixing hypersphere center c as the mean of the data embeddings obtained from an initial forward pass on the training data to be a reasonable strategy for standard (non-regularized) Deep SVDD in practice. We found this strategy, together with the added stochasticity of mini-batch SGD optimization, to produce fairly stable and consistent results, which did not suffer from a hypersphere collapse. In the next Section 2.2.3, we will discuss further regularization techniques, which actively regularize against a trivial collapse solution. But before, we lastly prove that theν-property also holds for soft-boundary Deep SVDD which allows us to model some target false alarm rate or account for training data contamination.

Proposition 4 (ν-property). Hyperparameter ν ∈(0,1] in the soft-boundary Deep SVDD objective (2.10) is an upper bound on the fraction of points being outside and a lower bound on the fraction of points being outside or on the boundary of the hypersphere.

Proof. Define di = kφω(xi)−ck2 for i= 1, . . . , n. W.l.o.g. assume d1 ≥ · · · ≥ dn. The number of points being outside the hypersphere is given bynout =|{i|di > R2}|

and we can write the soft-boundary objective Jsoft (in radiusR) as Jsoft(R) =R2nout

νn R2 =1−nout νn

R2.

That is, radius R should be decreased as long as noutνn holds and decreasing R gradually increases nout. Thus, noutnν must hold in the optimum, i.e. ν is an

upper bound on the fraction of outliers, and the optimal radiusR is given for the largestnout for which this inequality still holds. Finally, we have that R∗2 =di for i=nout+ 1 since radiusR is minimal in this case and points on the boundary do not increase the objective. Hence, we also have|{i|diR∗2}| ≥nout+ 1≥νn.