Visual anomaly detection in dynamic environments plays a crucial role in various applications. This work explores the challenges associated with this task, namely domain shift and the need to localize diverse anomaly types. We propose a solution that leverages data augmentation techniques to address domain shift and enhance model generalizability. The effectiveness of our approach is evaluated through robust data augmentation on test data.
IntroductionVisual anomaly detection automatically identifies and locates unexpected changes or deviations from the expected norm in images. This technology has a wide range of applications, including industrial inspections where it can detect defects in manufactured products without human intervention. This not only increases productivity but also enhances quality control by catching anomalies early in the production line.
However, a major challenge in visual anomaly detection is the difficulty of collecting data that encompasses every possible anomaly. To overcome this limitation, researchers are actively working on developing methods that can generalize well to unseen scenarios, allowing them to effectively detect different types of anomalies.
Current evaluation methods often rely on test sets with ideal conditions, such as good lighting, high image quality, and consistently centered objects. This gap between training data and real-world complexities can hinder performance.
Therefore VANDv2.0 challenge has emerged. This competition aims to address this limitation. Participants use the MVTec AD dataset [1] to develop models that can detect and localize defects in 15 objects. To simulate real-world challenges, the test set is randomly perturbed before evaluation.
MethodologyModel design
The scarcity of abnormal image data is a major challenge in visual anomaly detection. To tackle this issue, we propose a method based on DeSTSeg [2], a cutting-edge method that leverages the power of synthetically anomalous images for training. This approach allows the model to excel at identifying anomalies despite being trained primarily on normal data.
DeSTSeg [2] utilizes the student-teacher paradigm (knowledge distillation). It's comprised of three key components:
- Denoising Student Network: This network processes images with synthetically added anomalies, aiming to remove these distortions and reconstruct anomaly- free features.
- Teacher Network: This network is fixed. It serves as a mentor, receiving clean, original images as input. By learning from these pristine examples, the teacher network provides guidance to the student network.
- Segmentation Network: The segmentation network takes the features extracted by the denoising student network as input to pinpoint the exact location of the anomaly within the image.
DeSTSeg [2] training process occurs in two stages:
Stage 1: Guiding the Student Network: The student network takes images with random synthetic anomalies as input, while the teacher network receives clean, normal images.
The teacher guides the student by minimizing the cosine distance between reconstructed features from the student network and features of the teacher network.
Stage 2: Anomaly Localization with Segmentation: In a supervised learning manner, the segmentation network is trained using features extracted by both the student and teacher networks. This allows the segmentation network to pinpoint the exact location of anomalies within the image.
It's important to note that DeSTSeg [2] utilizes minimal data augmentation during training, primarily focusing on slight rotations. To enhance the model's robustness against real-world variations, we propose incorporating additional geometric and spatial data augmentation techniques. These techniques include translation, scaling, rotation (with a wider range), brightness adjustments, contrast variations, and the introduction of Gaussian noise (see Table 1).
The MVtec dataset [1] consists of 15 objects, with each object having only normal images in the training set. However, the test set incorporates various types of defects, such as cracks, scratches, contamination, and breakage etc.
Due to the diverse object types, a single data augmentation strategy wouldn't be effective. For instance, flipping an image containing a cable will lead to color swapping which is considered as anomaly, causing the model to learn and generate an anomalous feature instead of normal features. While it wouldn't affect objects like wood or hazelnuts. Therefore, we use category-specific settings listed in Table 1 for geometric transformations.
We implemented a two-step augmentation process:
Spatial Data Augmentation: This is applied first and includes techniques translation, scaling, and rotation within a specific range.
Pixel level Data Augmentation: This includes random brightness, contrast, and gaussian noise.
Our model's performance was evaluated using the harmonic mean of F1 scores (both pixel-level and image-level) to provide a comprehensive assessment.
ResultsTable 2 presents the harmonic mean scores (image-level F1 and pixel-level F1 metrics), comparing our model to DeSTSeg [2]. We observe an improvement in the harmonic mean for our model, suggesting that the data augmentation techniques were beneficial.
In visual anomaly detection, our goal is to identify anomalous pixels within images in dynamic environments. This task faces two main challenges:
1. DomainShift: Thisoccurswhenthetrainingdataandreal-worlddatacomefrom different distributions. To address this, we implemented various data augmentation techniques (as detailed in the methodology section) to increase the model's ability to generalize to unseen scenarios.
2. Localizing Diverse Anomalies: Different anomaly types have distinct characteristics. We categorized anomalies to:
- Object-Level Anomalies: These occur directly on the object (e.g., cracks in hazelnuts, scratches on wood). Synthetic anomalies like Perlin noise during training can be effective for this type.
- Scene-Level Anomalies: These involve the absence of entire or part of object (e.g., manipulated front in screw or an or misplaced object in transistor). Here, the model needs exposure to a wider variety of anomalies to learn scene reconstruction and identify missing/ objects/parts which Perlin noise won’t work.
To evaluate the model's robustness, we apply strong data augmentation techniques to the test data. This creates a more challenging benchmark for the model and allows us to assess its performance under more diverse conditions. The results of this evaluation, including the harmonic mean for each anomaly set, can be found in Table 3.
For future work, we plan to explore advanced data augmentation techniques and address scene-level anomalies by applying synthetic anomaly generation using diffusion models. Diffusion models could be used to inpaint objects to simulate their absence.
In this work, we addressed the challenges of visual anomaly detection in dynamic environments such as domain shift and localization of diverse anomaly types. To mitigate domain shift, we employed various data augmentation techniques, enhancing the model's ability to generalize to unseen scenarios. We evaluated the model's robustness using strong data augmentation on test data. The results, presented in Table 2 and Table 3, demonstrate the effectiveness of our approach.
References1. Bergmann, Paul, et al. "MVTec AD--A comprehensive real-world dataset for unsupervised anomaly detection." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
2. Zhang,Xuan,etal."Destseg:Segmentationguideddenoisingstudent-teacherfor anomaly detection." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
Comments