ARNet for Robust Anomaly Detection

Augmented RealNet (ARNet) with Foreground Extraction for Robust Anomaly Detection

1,453

Grand Prize

VAND2.0 Challenge at CVPR

Story

Abstract

In this project, we introduce Augmented RealNet (ARNet), an enhanced version of the RealNet framework, for advanced feature reconstruction-based anomaly detection. ARNet integrates an improved training strategy and a foreground extraction module to achieve robust anomaly detection in real-world environments. By incorporating various augmentations over synthetic anomalies, ARNet simulates a broad spectrum of potential perturbations that might occur in practical scenarios. The foreground estimation module within ARNet precisely isolates objects from the background, utilizing this information to refine reconstruction residuals. This approach effectively mitigates false alarms caused by background variations and changes in object positioning within the image. Our empirical evaluations demonstrate the superior performance of ARNet across diverse categories, achieving mean image-level and pixel-level F1 scores of 0.962 and 0.672, respectively.

Visual Anomaly Detection

Visual anomaly detection (VAD) [1] is essential in the industrial sector for identifying defects during the manufacturing process. Accurate anomaly detection not only guarantees product quality and reliability but also minimizes costs by reducing waste and recalls. Recent advancements in this field have been largely driven by machine learning, specifically through supervised [4], unsupervised [3, 6], and semi-supervised learning approaches [5]. Supervised methods, though powerful, often require extensive labeled datasets, which are impractical to obtain in many industrial situations. Therefore, unsupervised learning, particularly one-class classification models, has become popular. These models typically learn from data representing the system's normal operation, aiming to identify deviations from this norm as potential anomalies. However, they face significant challenges in terms of robustness, especially when exposed to real-world variations not present during training. Factors such as changes in camera specifications, lighting conditions, or gradual wear of mechanical components can cause shifts in data distribution, known as domain shifts. These shifts can drastically impact the performance of anomaly detection systems, leading to false positives or missed detections. Thus, the development of robust VAD systems that can adapt over time and handle these real- world variations is crucial. Enhancing anomaly detection models' adaptability to cope with domain shifts and maintain consistent performance despite external changes is a key requirement for their successful deployment in industrial settings.

The Anomaly Detection Challenge

The Visual Anomaly and Novelty Detection (VAND) challenge addresses the need for anomaly detection systems that can handle real-world conditions not usually represented in training datasets. The main goal is to build models that can withstand unpredictable shifts in domains [2], reflecting real-world situations where data capture conditions may change over time.The dataset utilized for the challenge is MVTec AD [16], comprising 15 categories of diverse industrial objects and material textures. Participants are encouraged to adopt a one-class training paradigm, which involves creating models using images from normal conditions only. This approach is crucial to ensure that the models can generalize to unseen anomalies without previous exposure to specific defects or abnormalities. The test set for model robustness evaluation includes an undisclosed set of perturbations, simulating real-world changes and noises such as lighting variations, camera noise induced by equipment quality, and camera angle shifts. The challenge employs stringent evaluation criteria, based on image-level and pixel-level F1- max scores, to thoroughly assess the models' anomaly detection capabilities under various altered conditions. This ensures that the developed solutions are applicable in real-world industrial scenarios and can adapt to different environments, a critical aspect for industrial applications.

Model Selection Approach

To address robust anomaly detection, we start our model selection process by examining the most promising architectures. We compare their strengths and weaknesses in the context of robust anomaly detection, which leads us to select our initial architecture for model development. Recently, various strategies aiming to tackle industrial anomaly detection have emerged, with three main approaches leading the way: student-teacher methods [7], patch- matching methods [8,9], and reconstruction methods [10]. All three have shown impressive results, especially on the MVTec AD dataset [11]. For the student-teacher approach, we looked at EfficientAD [7], while PatchCore [8] represents the patch-matching method, and RealNet exemplifies the reconstruction-based method [10]. We compared their performance on an augmented image set, reflecting real- world scenarios. We focused on more challenging categories where anomaly detection models usually underperform, such as cable, screw, pill and capsule. The comparison of these three model’s performance is shown in Figure 1.

Figure 1 Comparison of Models Performance under Augmented MVTec AD Test set

Our preliminary findings show that RealNet [10], the reconstruction-based method, generally outperforms the others, except in the cable category. This isn't surprising since the supervised training in reconstruction methods helps the model identify anomalies, even with augmentations. Patch-based methods, on the other hand, require a larger memory bank to store diverse features, which can slow down training and inference. The student-teacher approach can be more complex, especially in the design of the auto-encoder to handle a variety of augmentations. Based on these findings, we've decided to use the reconstruction-based approach, using RealNet as our base model, to address this challenge. We've also added a foreground prediction block to this network and implemented an augmented training strategy. Therefore, we've named our model Augmented-RealNet (ARNet).

Model Architecture

The architecture of our proposed method, ARNet, is presented in Figure 2. The blue blocks depict the original RealNet blocks, while the orange ones represent our enhancements. The detection flow of the model is as follows. Pretrained features of the input image, which are generated by WideResNet50 backbone pass through the Anomaly-aware Feature Selection (AFS) block first. This block selects feature channels based on their ability to identify anomalous regions in an image. Essentially, it chooses the topK feature indexes that maximize the distance between normal and anomalous images in the anomaly regions, averaged over all training samples. This selection process reduces training and computation cost. The AFS block's output features then enter the feature reconstruction block, composed of several UNets. Each UNet corresponds to a layer of the pretrained features. The UNet's objective is to reconstruct the original features from the anomalous ones. By subtracting the features before and after reconstruction, their residuals can reveal the anomalous regions.

Figure 2 Architecture of ARNet: RealNet (blue blocks) and our additions on top (yellow blocks)

We also included a foreground prediction block which uses the foreground information to reweight the residuals of each layer to help the model focus on meaningful foreground regions, thereby eliminating false triggers and improving detection accuracy for object categories. This block takes the first 3 layers of the pretrained features and predicts the foreground area based on ground truth foreground maps. These ground truth maps are generated using an automated algorithm using SegmentAnythingModel (SAM) [11]. The foreground prediction block, designed like a UNet, upscales lower resolution features, concatenates them with the next higher-level layer, and then puts them through a convolutional block. This process repeats until all layers are combined and eventually interpolated to recover the full resolution foreground map. This foreground prediction is particularly useful when images could be taken from various camera angles, and the object may not always be in the image center. The final block of the model is Reconstruction Residuals Selection (RRS) module. This module upscales the lower resolution features, so all residuals have the same resolution. It then performs global average and global max pooling on the features to select TopK residuals. The selection process discards residuals with insufficient anomaly information. Finally, the selected residuals go to the discriminator, which is trained on the cross-entropy loss to predict anomalous pixels based on synthetic anomalies' ground truth. Originally, RealNet [10] has two losses: the feature reconstruction loss (𝐿 𝑟𝑒𝑐𝑜𝑛 ) and the discriminator's segmentation loss (𝐿 𝑠𝑒𝑔 ). We also added the cross-entropy loss for foreground prediction (𝐿 𝑓𝑟𝑔 ). The total loss is as shown below:

Data Augmentations

The main goal of incorporating augmentations is to help the model adjust to potential real- world domain shifts during inference. The augmentations used in training ARNet include Gaussian noise, blur, RGB shift, brightness and contrast, and rotation and translations. Gaussian noise and blur imitate camera sensor noise and defocus, respectively. RGB shift is employed to account for color reproduction variations in different camera brands. Brightness and contrast adjustments simulate different times of day and lighting conditions, which can change based on the deployment scenario. To simulate various camera angles, we use a variety of rotations and translations. It's worth noting that for texture categories, we only use fixed rotations of 90, 180, or 270 degrees, as intermediate angles would necessitate border filling the texture with black pixels. Conversely, for object categories, our model can perform background subtraction, so we use shifting, scaling, and rotation, occasionally with border filling. To induce camera zoom-in and translation effect, we use random resized crop augmentation. Finally, we use horizontal and vertical flip augmentations to further add to the variations of training images. Figure 3 illustrates some examples of augmented images.

Figure 3 Examples of real-world data augmentations when applied on screw and wood

Training and Evaluation

We mainly used the MVTec AD [12] dataset for this category including the synthesized version of SDAS images. We combined the synthesized images with the normal images for training and leave all the test images for evaluation. As the challenge aims to develop a system robust enough to handle data capture variations, we applied augmentations to both the training and test sets. To ensure our evaluation reflects our model's ability to handle diverse domain shifts in the images, we tripled the test set. This means we generated three versions of each test image, each with a different augmentation, to provide the most reliable evaluation of our model.

We use a one-class training paradigm, training each class individually. Each category is trained on a single RTX 3090 GPU with a batch size of 16 and a learning rate of 0.0001 using the Adam optimizer. We train for 1500 epochs, 500 more than RealNet, to account for the longer convergence time due to extensive augmentations in the training dataset.

For evaluation, we use the well-known image-level Area Under the Receiver Operator Curve (AUROC) metric and the F1 Max metric for both image and pixel levels. These are also the target metrics for challenge evaluation.

Results and Discussions

The performance evaluation of our proposed model, ARNet, is summarized in Table 1, showing the comparison of model’s performance on the test with and without augmentations. In spite of heavy augmentations, a mean image-level F1 Max score of 0.962 and a pixel-level F1 Max score of 0.672 on augmented test reveals the robustness of our model. It is evident from the results that the overall model’s performance remains quite close to the original test set depicting the adaptiveness of our model.

Table 1 Performance evaluation of ARNet on MVTec Test set with and without Real-world Augmentations

Conclusion and Future Work

In conclusion, we propose ARNet, a feature reconstruction model for robust anomaly detection. It uses synthetic data and image augmentations, and is trained in a supervised manner. It displays strong image and pixel detection performance on the MVTec AD dataset. This prediction helps to remove unintended disturbances in the background and makes the model more robust to changes in object position. Our carefully designed training data augmentations result in a strong image and pixel-level F1 Max performance of 0.962 and 0.672, respectively. This showcases the potential robustness of our model in addressing real-world anomaly detection challenges. As for future work, we plan to further study category-specific limitations of our model and make necessary adjustments to our detection framework.

References

[1] Pang, Guansong, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel. "Deep learning for anomaly detection: A review." ACM computing surveys (CSUR) 54, no. 2 (2021).

[2] Jeong, Jongheon, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. "Winclip: Zero-/few-shot anomaly classification and segmentation." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19606-19616. 2023.

[3] Bergmann, Paul, Kilian Batzner, Michael Fauser, David Sattlegger, and Carsten Steger. "Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and localization." International Journal of Computer Vision 130, no. 4 (2022): 947-969.

[4] Hojjati, Hadi, Thi Kieu Khanh Ho, and Naregs Armanfard. "Self-supervised anomaly detection in computer vision and beyond: A survey and outlook." Neural Networks (2024): 106106.

[5] Villa-Pérez, Miryam Elizabeth, et al. "Semi-supervised anomaly detection algorithms: A comparative summary and future research directions." Knowledge-Based Systems 218 (2021): 106878.

[6] Cui, Yajie, Zhaoxiang Liu, and Shiguo Lian. "A survey on unsupervised anomaly detection algorithms for industrial images." IEEE Access (2023).

[7] Batzner, Kilian, Lars Heckler, and Rebecca König. "Efficientad: Accurate visual anomaly detection at millisecond-level latencies." in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2024.

[8] Roth, Karsten, et al. "Towards total recall in industrial anomaly detection." in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[9] Li, Hanxi, et al. "Target before shooting: Accurate anomaly detection and localization under one millisecond via cascade patch retrieval." arXiv preprint arXiv:2308.06748 (2023).

[10] Zhang, Ximiao, Min Xu, and Xiuzhuang Zhou. "RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection." arXiv preprint arXiv:2403.05897 (2024).

[11] Kirillov, Alexander, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao et al. "Segment anything." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015-4026. 2023.

[12] Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Mvtec-ad: A comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9592–9600, 2019.