• Keine Ergebnisse gefunden

Experimental Evaluation on Using Many Auxiliary Anomalies

2.3 Deep Semi-Supervised One-Class Classification

2.3.6 Experimental Evaluation on Using Many Auxiliary Anomalies

unorganized data that is easily accessible online. In NLP, word embedding models, such as word2vec [370], and language models, such as BERT [135] or GPT-3 [79], which are trained on huge unlabeled text corpora from the web in a self-supervised manner, are the current state of the art and responsible for significant improvements on various NLP tasks. In computer vision, supervised pre-training on large auxiliary datasets [603] such as ImageNet [133], and self-supervised pre-training [105] have been found to be effective. Using such pre-trained models as a starting point is standard in many downstream computer vision tasks.

Hendrycks et al. [221] have introduced the idea of utilizing large unstructured data also for the task of anomaly detection by considering this auxiliary data as anomalous, which they call Outlier Exposure (OE) as mentioned previously. OE makes the assumption that the unstructured data is very unlikely to correspond to what is normal in a given application, but most likely is anomalous in most cases.

Although auxiliary anomalies may not be representative for anomalies at testing time (i.e., do not follow P-), the underlying hypothesis of OE is that this auxiliary data is nevertheless informative for a respective domain (e.g., natural images or the English language in general) and useful to learn an improved representation of the normal data. Exposing a normal model of cat images to random natural images (possibly including images of other animals), for instance, most likely is informative

for learning an improved semantic representation of cats.

In the following, we test the above hypothesis and the value of having many auxiliary anomalies available for training for the two deep semi-supervised one-class classification methods, Deep SAD and HSC, we have introduced above.

Setup We consider the CIFAR-10 and ImageNet one vs. rest benchmarks following Hendrycks et al. [222]. That is, in each setup one of the dataset classes is considered normal and the other respective classes are considered anomalous. In every setup, we train a model using only the training set of the respective normal class as well as random samples from a large OE dataset that is disjoint from the ground-truth

anomaly classes at testing time. We use the same auxiliary OE datasets as used in recent literature [221, 222]. For the CIFAR-10 benchmark, which comprises 10 classes, we use the 80 Million Tiny Images dataset [548] as OE (with CIFAR-10 and CIFAR-100 images removed). The ImageNet benchmark contains 30 classes from the ImageNet-1K dataset [133], for which we use the ImageNet-22K dataset as OE (with the ImageNet-1K classes removed). Experiments are iterated over all classes

and repeated for multiple random seeds.

Competitors We compare Deep SAD and HSC to recent deep anomaly detection methods that have shown state-of-the-art results on the two benchmarks. These include a self-supervised method based on predicting Geometric Transformations (GT) [181], which subsequently has been improved in [222] (GT+). GT+has been used with and without OE. We further include the results of a Focal loss classifier [327], which is a binary classifier that specifically addresses class imbalance, that is trained with OE [222]. Finally, we add the results of an autoencoder (AE), and for CIFAR-10 also the results for shallow and Deep SVDD, as unsupervised baselines.

Network Architectures and Optimization We use the same network φω in each experimental setup for Deep SAD and HSC. On CIFAR-10, we use a LeNet-type network having three convolutional layers with max-pooling, followed by two fully connected layers. We use (leaky) ReLU activations and apply batch normalization [248] in this network. On ImageNet, we use the same WideResNet [602] as [222], which has ResNet-18 as its architectural backbone. We use Adam [276] for optimization and balance every batch to contain 128 normal and 128 OE samples during training.

We apply standard data augmentation using color jitter, random cropping, horizontal flipping, and Gaussian pixel noise.

Results and Discussion The results on the CIFAR-10 and ImageNet one vs. rest benchmarks are shown in Table 2.4 and Table 2.5 respectively. For ImageNet, we report the mean AUC over the 30 classes here and provide the results on all individual classes in Appendix C.2. First, we can observe that using OE results in a markedly improved detection performance on both one vs. rest benchmarks. On CIFAR-10, Deep SAD and HSC achieve a detection performance of 94.5 and 95.9 AUC respectively, whereas unsupervised Deep SVDD resides at 64.8 AUC. GT+ with OE performs similar to Deep SAD and HSC on CIFAR-10. On ImageNet, Deep SAD and HSC show an improved detection performance over GT+. Comparing Deep SAD to HSC, we see that HSC slightly outperforms Deep SAD on both benchmarks.

However, using the squaredL2-norm with HSC yields similar results to Deep SAD (see ablation in Appendix A.2), so the advantage of HSC seems to be mainly due to using the robust pseudo-Huber loss, which arguably seems reasonable to use in the OE setting, where there is a lot of variation in the auxiliary OE corpus. Moreover, we remark that the self-supervised methods, GT and GT+, without OE show a marked improvement on the CIFAR-10 benchmark over the other unsupervised methods.

This indicates the potential of self-supervised methods for introducing inductive biases towards learning semantic representations (see also discussion in 5.2.5), which shows advantageous on these object level one vs. rest image benchmarks.

Table 2.4:Detection performance in mean AUC in % (over 10 seeds) for various methods on the CIFAR-10 one vs. rest benchmark using 80 Million Tiny Images as OE. Results taken from the literature are marked with an asterisk [181, 222].

without OE with OE

SVDD AE DSVDD GT* GT+* GT+* Focal* DSAD HSC

airplane 65.6 59.1 61.7 74.7 77.5 90.4 87.6 94.2 96.3 automobile 40.9 57.4 65.9 95.7 96.9 99.3 93.9 98.1 98.7 bird 65.3 48.9 50.8 78.1 87.3 93.7 78.6 89.8 92.7 cat 50.1 58.4 59.1 72.4 80.9 88.1 79.9 87.4 89.8 deer 75.2 54.0 60.9 87.8 92.7 97.4 81.7 95.0 96.6 dog 51.2 62.2 65.7 87.8 90.2 94.3 85.6 93.0 94.2 frog 71.8 51.2 67.7 83.4 90.9 97.1 93.3 96.9 97.9 horse 51.2 58.6 67.3 95.5 96.5 98.8 87.9 96.8 97.6 ship 67.9 76.8 75.9 93.3 95.2 98.7 92.6 97.1 98.2 truck 48.5 67.3 73.1 91.3 93.3 98.5 92.1 96.2 97.4

mean 58.8 59.4 64.8 86.0 90.1 95.6 87.3 94.5 95.9

Table 2.5: Detection performance in mean AUC in % (over 30 classes and 10 seeds) on the ImageNet-1K one vs. rest benchmark using ImageNet-22K (with the 1K classes removed) as OE.

Results taken from the literature are marked with an asterisk [222].

without OE with OE

AE Focal* GT+* DSAD HSC

mean 56.0 56.1 85.7 96.7 97.3

Lastly, we note that in [471], we present further results showing that using standard BCE classification (and a re-implementation of the Focal loss) with OE, surprisingly yields competitive detection results on the CIFAR-10 and ImageNet one vs. rest image benchmarks as well. Interestingly, the detection performance moreover is already competitive when using relatively few (~128) OE examples.

One possible hypothesis for this phenomenon, which we explore in [471], is that the multiscale structure of images makes few example anomalies exceptionally informative.

Understanding these counter-intuitive findings presents an interesting question to answer in future work. However, also note that this observation may be limited to object classes considered in the typical one vs. rest benchmarks, where anomalies for testing are fairly structured and distinct (objects from different classes), which is why more diverse and challenging anomaly detection benchmarks are needed in the community (see discussion in Section 5.2.4). In an application to detect more subtle defects in manufacturing in the next Chapter 3, we found the use of OE from general natural images to be limited (see experiments in Section 3.1.2).

Conclusions from this chapter:

Deep SVDD introduces a deep one-class classification method for unsupervised anomaly detection, which extends the one-class classification approach from fixed features towards learning data representations.

Deep one-class classification can significantly improve anomaly detection per-formance over shallow methods on complex data (e.g., images).

A key challenge in deep one-class classification is avoiding a trivial feature map collapse, which can be addressed through introducing constraints and regularization.

The Deep SAD and HSC methods present generalizations of Deep SVDD to the semi-supervised anomaly detection setting.

Including few true anomalies or many auxiliary anomalies can both significantly improve anomaly detection performance.

Parts of this chapter are mainly based on:

[466] L. Ruff*, R. A. Vandermeulen*, N. Görnitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, and M. Kloft. Deep One-Class Classification. InProceedings of the 35th International Conference on Machine Learning, volume 80, pages 4390–4399, 2018.

[469] L. Ruff, R. A. Vandermeulen, N. Görnitz, A. Binder, E. Müller, K.-R. Müller, M. Kloft. Deep Semi-Supervised Anomaly Detection. InInternational Conference on Learning Representations, 2020.

With added content from:

[111] P. Chong, L. Ruff, M. Kloft, and A. Binder. Simple and Effective Prevention of Mode Collapse in Deep One-Class Classification. InInternational Joint Conference on Neural Networks, pages 1–9, 2020.

[467] L. Ruff, R. A. Vandermeulen, N. Görnitz, A. Binder, E. Müller, and M. Kloft.

Deep Support Vector Data Description for Unsupervised and Semi-Supervised Anomaly Detection. InICML 2019 Workshop on Uncertainty & Robustness in Deep Learning, 2019.

[471] L. Ruff, R. A. Vandermeulen, B. J. Franks, K.-R. Müller, and M. Kloft. Rethinking Assumptions in Deep Anomaly Detection. InICML 2021 Workshop on Uncertainty &

Robustness in Deep Learning, 2021.

and NLP

The deep one-class classification methods we have introduced in Chapter 2 can be applied to any data type in any domain. In this chapter, we introduce two deep one-class classification variants that are also based on the basic principle of learning a concentrated feature space for the normal data, but additionally integrate domain-specific particularities. Fully Convolutional Data Description incorporates the property of spatial coherence that is important in computer vision, i.e. that neighboring pixels are correlated, by using a fully convolutional network architecture [342, 398], which produces an explanation heatmap in the output. Context Vector Data Description incorporates the aspect of multi-context found in NLP, i.e. that text samples can be interpreted and put into different semantic contexts, by using the self-attention mechanism which also enables model interpretability.

3.1 Explainable One-Class Classification for