Abstract
Based on the memory bank mechanism (referencing PatchCore) for few-shot anomaly detection, we have introduced significant innovations in the feature construction method. This challenge focuses on logical anomalies; therefore, we have incorporated GroundingDINO for object detection and used CLIP to extract semantic features to form an object-level memory bank. Positional encoding is added to the feature space to enhance positional sensitivity. We construct a logical relationship graph among object features to form a memory bank of logical relationships. Combined with the original patch-level memory bank, this results in a collection of three multi-angle detection algorithms. By constructing a validation set to align scales, the method becomes applicable to various categories.
Introduction
• Background: Logical anomaly detection usually entails detecting anomalies in the state and relationships of objects. Anomalous states of an object usually include damage, deviation, misalignment, etc. Anomalous relationships usually refer to anomalous objects, missing objects, displaced objects, quantity changes, etc.
• Challenge Description: The challenge aims to detect logical and structural anomalies with one, two, or four normal images.
Methodology
Model Design
• Approach:
• Architecture:
- GroundingDINO with SwinTransformer-Base.
- CLIP with ViT-B/32
- Feature Extractor(wide_resnet101_2, convnext_large, eva02_large_patch14_448)
• Training:
GroundingDINO, CLIP, and Feature Extractor pre-trained on natural scene datasets.
Dataset & Evaluation
• Dataset Utilization:
Natural scene datasets such as ImageNet
• Evaluation Criteria:
We use the training and validation dataset [4] to evaluate the model in terms of the F1Max score.[4]
Results
• Performance Metrics:
F1Max
• Comparison:
We take the original PatchCore as the baseline and report the 4-shot results as below.
F1Max Breakfast box Juice bottle Pushpins Screw bag Connector
Baseline 0.8070 0.8560 0.7137 0.7910 0.7644
Ours 0.8254 0.8844 0.7832 0.7861 0.8206
Discussion
• Challenges & Solutions: The local anomalous state is extremely hard to be detected by PatchCore. Thus, we crop each object detected by GroundingDINO and extract the features to detect missing objects. Finally, we calculate the similarity map of every object to model the logic.
• Model Robustness & Adaptability: Logical anomalies can be determined by GroundingDINO detecting all normal objects and usually do not depend on the number of normal samples. Structural anomalies, on the other hand, require fine-grained comparisons with normal samples to be detected, and the anomaly detection capability improves as the number of normal samples increases.
• Future Work: We plan to boost by first pre-training the model and then fine-tuning it on a few normal samples. In addition, we believe that normal and abnormal data synthesis is also crucial. If feasible, we will implement few-shot logical anomaly detection in combination with a large visual language model through in-context learning.
Conclusion
We discover that the Visual Linguistic Target Detection model is able to localize all objects in an image according to prompts. By comparing features of the whole image and local objects, as well as modeling relationships, it is possible to achieve a surprisingly few-shot logical anomaly detection capability.
Participant Information
• Name(s): Xi Jiang ; Hanqiu Deng;
• Affinion(s): Southern University of Science and Technology; Tencent Youtu Lab; University of Alberta;
• Contact Information: jiangx2020@mail.sustech.edu.cn
• Track: Track II - VLM Anomaly Track
References
[1] Liu S, Zeng Z, Ren T, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection[J]. arXiv preprint arXiv:2303.05499, 2023.
[2] Roth K, Pemula L, Zepeda J, et al. Towards total recall in industrial anomaly detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 14318-14328.
[3] Zagoruyko S, Komodakis N. Wide residual networks[J]. arXiv preprint arXiv:1605.07146, 2016.
[4] Bergmann P, Batzner K, Fauser M, et al. Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and localization[J]. International Journal of Computer Vision, 2022, 130(4): 947-969.
Comments