In the present manuscript, we delineate a pioneering unsupervised algorithm designed for the identification of industrial anomalies, with an emphasis on the detection of structural and logical discrepancies within the scope of constrained few-shot settings. The proposed algorithm commences with the deployment of the SAM (Kirillov et al., 2023) algorithm to effectuate precise image segmentation. Subsequent to this segmentation, an original alignment procedure is executed, leveraging the Multi-Object Tracking (MOT) algorithm DeAOT (Yang and Yang, 2022) for the categorization of segmented entities. The extraction of discriminative features is then conducted utilizing the DINOv2 (Oquab et al., 2023) framework, facilitating the derivation of class-conscious anomaly scores.
Upon achieving category alignment, our method diverges from traditional approaches by implementing a set of 7 bespoke feature similarity assessments: Class-agnostic Patch Representations, Class-aware Patch Representations, Foreground Patch Representations, Class Embeddings, Class Histograms, Class Colors, and Class Locations. These are ingeniously crafted to quantitatively measure the degree of abnormality by contrasting the test images with a corpus of normal reference images.
The culmination of our approach is the computation of an aggregate anomaly score, derived from an integrative analysis of the aforementioned feature similarities. This computation eschews the need for additional training data, underscoring the efficiency in detecting both structural and logical anomalies of our method. Empirical validation on the MvTec LOCO dataset has demonstrated the superiority of our approach, as evidenced by the attainment of the highest F1-scores across varying state-of-the-art few-shot methods, including WinCLIP (Jeong et al., 2023), ComAD (Liu et al., 2023), PSAD (Kim et al., 2024) and AnomalyDINO (Damm et al., 2024). The results underscore the potential of our algorithm in advancing the state-of-the-art in industrial anomaly detection.
IntroductionBackground
Anomaly detection presents a challenge frequently encountered in the quality inspection of product appearances on many industrial production lines. Within the MvTec and MvTec LOCO datasets, anomalies are generally categorized into two types:
- Structural anomalies: These typically refer to discernible differences from normal images in appearance, encompassing both subtle variations and more significant structural defects. Specific examples include scratches, dents, or contamination in manufactured products.
- Logical anomalies: These usually denote violations of underlying constraints, such as objects appearing in invalid locations or the complete absence of required objects.
Conventional methods have primarily addressed structural anomalies; however, issues such as missing components or misalignments in industrial production lines often necessitate methods capable of detecting logical anomalies.
Furthermore, the scarcity of clean training data in industrial lines has prompted a need for few-shot anomaly detection methods that require minimal samples. With the advancement of multimodal large models, some approaches have begun to leverage these models for few-shot anomaly detection tasks. Such methods typically start by defining prompts to describe normal and anomalous images. They then calculate similarity using text embeddings and image embeddings to detect anomalies. For instance, WinCLIP (Jeong et al., 2023) is one such method. In addition, other approaches forgo the use of prompts and instead utilize a potent pre-trained image encoder, such as AnomalyDINO (Damm et al., 2024). Lastly, a subset of methods has employed semantic segmentation and alignment to extract features specific to logical anomalies, thereby effectively identifying these anomalies, as exemplified by ComAD (Liu et al., 2023) and PSAD (Kim et al., 2024).
Challenge Description
In Challenge 2, our focus primarily lies on the MvTec LOCO dataset, which is characterized by logical anomalies. The challenge demands the detection of anomalies, encompassing both structural and logical deviations, under few-shot conditions—that is, with training datasets comprising 1, 2, 4, or 8 images. Upon further analysis of the data, the PUAD (Sugawara and Imamura 2024) paper delineates that within the MvTec LOCO, anomaly types can be further categorized into picturable anomalies and unpicturable anomalies. Picturable anomalies are defined as those that can be represented by anomaly maps. Conversely, unpicturable anomalies are defined as those that cannot be represented by anomaly maps, such as the absence of screws. Structural anomalies generally fall under picturable anomalies, whereas logical anomalies may be classified as either picturable or unpicturable anomalies.
We posit that the principal challenge resides in addressing unpicturable anomalies. Consequently, we have devised a novel methodology specifically tailored to tackle the issues encountered in the detection of unpicturable anomalies.
MethodologyModel Design
- Approach
To address the issue of unpicturable anomalies, we have developed a method grounded in semantic segmentation. Semantic segmentation allows us to partition the original image into distinct segments for analysis, enabling us not only to discern mere patch dissimilarities but also to comprehend the compositional makeup of the image. We leverage the variations in this compositional information to gauge the presence of logical anomalies between two images. Through the analysis of the aforementioned anomaly categories, we have ascertained that certain missing categories can be identified by the pixel area of the semantic categories, while issues such as misplacement can be discerned through location analysis.
In addition to this, we have encoded semantic color information and semantic feature information, which serve as supplements to the other compositional data. By amalgamating semantic information with patch dissimilarity data, we have formulated an algorithm capable of detecting not only logical anomalies but also structural anomalies.
Our architecture is depicted in the figure, with detailed descriptions of the utilization methods for the pre-trained model to be provided in the Training section. In fact, our approach does not involve the training of any model weights.
- Training
Initially, we process the training data through SAM for semantic segmentation, and through certain post-processing steps, we manage to disregard the background portion (this is because background anomalies are not our target for detection, and focusing solely on foreground for anomaly detection can enhance the precision of detection, as evidenced by the experimental results in the AnomalyDINO paper). Subsequently, we store the semantic segmentation map of one image (typically the first one) to serve as a reference image for MOT tracking with the DeAOT model during subsequent inference. Concurrently with obtaining the semantic segmentation maps of the training images, we employ DINOv2 for feature extraction, yielding Patch Representations with position information in H,W dimensions. Such Representations, when combined with the semantic segmentation maps, enable us to acquire the depth features of pixels for each semantic class. Based on these depth features enriched with class information, we have designed 7 feature values for both structural and logical anomalies, listed across two groups:
Structural anomaly:
- Class-agnostic Patch Representation: This is the direct output feature map from DINOv2. Its feature dimension is (H*W, D), where D represents the feature depth of DINOv2, being 1536.
- Class-aware Patch Representation: For each category, we have a feature vector of (X_i, D), where X represents the number of pixels within that semantic category.
- Foreground Patch Representation: Aggregating all semantic categories except the background into a single feature vector (X, D), where X is the total number of foreground pixels. The background is defined as a mask that touches two edges of the image.
Logical anomaly:
- Class Embedding: This involves averaging all features of each category, subsequently obtaining a one-dimensional feature vector of dimension (C,), used to describe the image's category embedding.
- Class Histogram: This is a straightforward feature. We tally the pixel count for each semantic category, resulting in a one-dimensional histogram of dimension (C,), serving as the histogram feature of the image.
- Class Color: We convert the RGB images of each category into CIELAB images, omitting the L dimension to indicate our disregard for brightness variations. Then, for each category, we divide A by B for every pixel and take the average pixel value, obtaining the color information for each category. Aggregating all categories, we obtain a color feature of dimension (C,).
- Class Location: We average the location of each category, then calculate the distance of this average to (0,0), which represents the relative location of the category to (0,0). Aggregating every category, we obtain a location feature of dimension (C,).
For both training and inference images, we obtain the aforementioned 7 features through a pre-trained large model. Subsequently, for each feature, we compare the test data with the training data to calculate a score, serving as a measure of anomaly severity. For these 7 features, the preparation of the memory bank for training data and the score calculation for test data will be explained separately:
- Class-agnostic Patch Representation: We aggregate the feature vectors (H*W, D) obtained from all training images to form a memory bank of (H*W*k, D), where k represents the number of images in k-shot training data. Then, a test vector (H*W, D) traverses in the H*W dimension, searching for the nearest neighbor (D,) in the memory bank to calculate the cosine similarity. Afterwards, we average the top 2% most dissimilar parameters from a (H*W) dimensional cosine similarity vector to obtain an anomaly score.
- Class-aware Patch Representation: Similarly, we obtain an anomaly score for each category, and then the score of the category with the highest score is taken as the anomaly score for this feature.
- Foreground Patch Representation: Similarly, we obtain an anomaly score for the foreground.
- Class Embedding: With only one one-dimensional feature vector per image, calculating the anomaly score is much simpler. We use the feature vectors of k training images as a memory bank, then the feature vector of the test image finds its nearest neighbor in the memory bank, and the dissimilarity of this nearest neighbor is the anomaly score.
- Class Histogram: Same as Class Embedding.
- Class Color: Same as Class Embedding.
- Class Location: Same as Class Embedding.
Ultimately, we sum all feature anomaly scores to obtain the final pred score representing the anomaly degree of the image. Each feature's anomaly score has a weight, so the final pred score is actually a weighted average. How are the weights for each feature's anomaly score calculated? During training, we take one image from the k-shot as a test image, with the remaining k-1 images as training images, and calculate a score using the above method. We test all images in the k-shot as test images in turn, and eventually, the highest score among these becomes the weight for that feature, which is stored. When testing actual images, the test score is divided by this weight to obtain a normalized test score, which is then summed. In the case of 1-shot, without this weight, we use some naive weights for initialization.
Note, we did not employ data augmentation involving rotation, brightness variation, etc., nor did we use text prompting. In fact, the AnomalyDINO paper has demonstrated that the feature extraction capability of DINOv2 is robust, even outperforming methods that use text prompting. Therefore, we also did not use text prompting.
Dataset & Evaluation
Dataset Utilization
The patch size of the DINOv2 model is 14. In order to obtain enough features for later processing, we resize the input images of size (256, 256) to (518, 518). Other than this, we did not use any augmentation techniques.
Evaluation Criteria
We use the f1_max as the sole evaluation metric. It is calculated as follows:
Precision: (Predicted anomalies and ground truth (GT) are anomalies) / Total number of images predicted as anomalies
Recall: (Predicted anomalies and GT are anomalies) / Total number of GT images that are anomalies
F1-score: (2 * precision * recall) / (precision + recall)
F1-max: The maximum value of F1-score obtained by iterating through thresholds
Results
We conducted comparative analyses with WinCLIP, ComAD, PSAD, and AnomalyDINO, all of which propose solutions akin to ours. The implementation of WinCLIP was based on the anomalib library. ComAD, PSAD, and AnomalyDINO were self-implemented without the use of open-source projects. Moreover, the semantic segmentation and semantic alignment strategies employed by the segmentation-based methods ComAD, PSAD, and AnomalyDINO (masking) were developed in-house, utilizing our proprietary SAM+DeAOT-based solution.
We observed that our proposed method, which was intended to fit the 'all' category, did not perform as well as expected. Analysis revealed that this was largely due to the segment-level MOT accuracy. We tested other semantic alignment methods based on Hungarian matching and nearest neighbor matching, but none performed as well as the DeAOT-based approach. However, the DeAOT-based method still fell short of the segmentation algorithm directly trained on the MvTec Loco dataset as demonstrated in the PSAD paper. Since our training could not utilize any data from MvTec Loco, we opted for DeAOT as our final method. For the purpose of the competition, we ultimately chose the empirically most effective approach, which was a weighted summation of methods based on class-agnostic patch representation, foreground patch representation and class histogram. This approach achieved a higher mean f1_max score than WinCLIP, ComAD, PSAD, and AnomalyDINO.
DiscussionChallenges & Solutions
Firstly, we found that the performance of our proposed method was suboptimal, and we ultimately pinpointed the issue to the inadequate segment-level alignment of DeAOT. Identifying this required extensive experimentation. Additionally, we tried other methods based on Hungarian matching or nearest neighbor matching, but none yielded satisfactory results. Our ultimate goal is to propose a superior segment-level MOT algorithm. Secondly, we noticed that the model loading and execution speed were slow, which made testing and debugging difficult. Eventually, we optimized many unnecessary processes, from using large model frameworks to meticulously enabling direct use of model capabilities with our own preprocessing and postprocessing connections, resulting in significant optimization efforts. We achieved inference speeds faster than the anomalib WinCLIP, despite our model being larger.
Model Robustness & Adaptability
As can be seen from the experimental results, our proposed method performs well on objects like breakfast boxes, juice bottles, and splicing connectors, but it performs poorly on pushpins and screw bags. We analyzed that this might be due to the small size of these objects and their random orientations. If we could incorporate rotational data augmentation, our method's performance would likely improve. However, experiments showed that if we applied rotation, the effectiveness of DeAOT would significantly decrease, so we decided to maintain the original state. Moreover, the method based on AnomalyDINO that we chose for the competition performs slightly worse on logical anomalies. If we had a better segmentation align model that could support features for logical anomalies, our model would perform better in this area, as indicated by the results of AnomalyDINO Combined + histogram.
Future Work
Our current primary issue is to develop a robust and precise segment-level semantic align model, which will significantly enhance the effectiveness of our proposed solution. Next, we aim to develop a smarter adaptive summation scheme, such as the single-layer linear layer trained in April GAN (Chen, Han, and Zhang, 2023), which is trained on non-test datasets, thereby circumventing issues not related to few-shot settings.
ConclusionWe have proposed a method based on the combination of 7 different feature groups to effectively extract and integrate the compositional information of images. Guided by a high-precision segment-level semantic alignment method, our model has tremendous potential. Furthermore, a powerful image encoder can obtain robust features for anomaly detection in a few-shot scenario without the need for text prompting. Our approach has confirmed that anomaly features can be effectively detected under few-shot conditions.
ReferencesKirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., ... & Girshick, R. (2023). Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 4015-4026).
Yang, Z., & Yang, Y. (2022). Decoupling features in hierarchical propagation for video object segmentation. Advances in Neural Information Processing Systems, 35, 36324-36336.
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., ... & Bojanowski, P. (2023). Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193.
Jeong, J., Zou, Y., Kim, T., Zhang, D., Ravichandran, A., & Dabeer, O. (2023). Winclip: Zero-/few-shot anomaly classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 19606-19616).
Liu, T., Li, B., Du, X., Jiang, B., Jin, X., Jin, L., & Zhao, Z. (2023). Component-aware anomaly detection framework for adjustable and logical industrial visual inspection. Advanced Engineering Informatics, 58, 102161.
Kim, S., An, S., Chikontwe, P., Kang, M., Adeli, E., Pohl, K. M., & Park, S. H. (2024, March). Few shot part segmentation reveals compositional logic for industrial anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, No. 8, pp. 8591-8599).
Damm, S., Laszkiewicz, M., Lederer, J., & Fischer, A. (2024). AnomalyDINO: Boosting Patch-based Few-shot Anomaly Detection with DINOv2. arXiv preprint arXiv:2405.14529.
Sugawara, S., & Imamura, R. (2024). PUAD: Frustratingly Simple Method for Robust Anomaly Detection. arXiv preprint arXiv:2402.15143.
Chen, X., Han, Y., & Zhang, J. (2023). A zero-/few-shot anomaly classification and segmentation method for cvpr 2023 vand workshop challenge tracks 1&2: 1st place on zero-shot ad and 4th place on few-shot ad. arXiv preprint arXiv:2305.17382.
Comments