We propose a few-shot anomaly detection method based on Mixture of Experts (MoE), leveraging CLIP[3] and DINO[6] encoders. Specifically, we detect both structural and logical anomalies in test images by comparing them with textual features and normal reference images at different levels of granularity, ranging from global features to sub-region features and patch-level features.
MethodologyModel DesignApproachWe calculate four anomaly scores from the following aspects: (1) comparison of image and text features, (2) comparison of global features between the test image and few-shot normal images, (3) comparison of part features between the test image and few-shot normal images, and (4) comparison of patch-level features between the test image and few-shot normal images. These four anomaly scores are then combined to obtain the final anomaly score. The detailed calculation methods for these scores are introduced below.
(1) Comparison of Image and Text Features
Inspired by the WinCLIP[7] method, many anomaly detection methods determine whether an image is normal or anomalous by detecting the similarity between the image and texts representing "normal" and "abnormal" semantics, respectively. However, the textual content in WinCLIP[7] needs to be manually crafted by experts, and the text representing "abnormal" mostly describes structural anomalies, which is not suitable for logical anomalies. To address this issue, we use learnable text vectors to represent "normal" and "abnormal" semantics, respectively. Specifically, we adopt the CoCoOp[8] method to learn text templates and use the words "normal" and "abnormal" to denote normal and abnormal semantics. The resulting text features are:
T_{normal} = [V_1][V_2]...[V_n]\ normal\ [CLS\_NAME],
T_{abnormal} = [W_1][W_2]...[W_n]\ abnormal\ [CLS\_NAME],
where [V_i] and [W_i] are learnable vectors, and [CLS\_NAME] is the item category name.
After passing T_{normal} and T_{abnormal} through the CLIP[3] text encoder, we obtain text features representing normal and abnormal semantics F = [F_{normal}, F_{abnormal}]\in \mathbb{R}^{2\times C}, where C denotes the dimension of the text features.
Next, we input the query image to be tested into the CLIP[3] image encoder to obtain the global image feature I_g \in \mathbb{R}^{C}.
Finally, the text anomaly score is calculated by Eq (1):
$$
Score_{text} = softmax(I_g\cdot F^T).
$$
(2) Comparison of global features
After extracting the global features of few-shot normal images using the CLIP[3] encoder, we store these normal features in a repository called Global Memory. During the testing phase, we utilize the same CLIP[3] encoder to extract the global features I_g of the test image.
Then, we compute the cosine distances between I_g and all normal features N_g in the Global Memory, and use the minimum of these cosine distances as the global anomaly score. The global anomaly score indicates the cosine distance between the test sample and the most similar normal sample. The calculation formula is as follows:
$$
Score_{global} = \min(\text{distance}(I_g, N_g)).
$$
(3) Comparison of part features
In the MVTec LOCO[5] dataset, samples often comprise multiple parts. For instance, a "screw bag" may contain two screws, two nuts, and two washers, while a "breakfast box" might include two oranges and one peach on the left side, and cereal and slices of dates on the right side. Logical anomalies in these samples often manifest in one or several parts. For example, in the "screw bag," there might be an extra or missing nut, while in the "breakfast box," a peach could be replaced by an orange, or slices of dates could be missing from the right side. Therefore, separating each part of the sample and analyzing them individually is an effective measure for detecting logical anomalies.
To segment the various parts within an image, we utilize the DINO[6] encoder to extract feature maps from the images. Subsequently, we employ clustering[9] on these feature maps to partition the image into several parts, saving the mask for each part for future use.
After partitioning the image into several parts, we gather information about each part from the aspects of deep features, area, color, and quantity. By comparing this information, we determine whether each part of the test sample is normal.
We utilize feature maps extracted by two different backbones, CLIP[3] and DINO[6]. Using the obtained part masks, we filter out the deep features belonging to each part and compute the average pooling value of these deep features as the deep feature for each part. We store the part deep features of few-shot normal samples N_p \in \mathbb{R}^{N\times C} in a repository called part-deep Memory, where N represents the number of normal deep features. Then, we calculate the cosine distance between the part deep features of the test sample I_p \in \mathbb{R}^C and the deep features in the part-deep Memory. The anomaly score for each part is computed based on the minimum cosine distance, as shown in the following equation:
$$
Score_{part\_deep\_feat}= \min(\text{distance}(I_p, N_p)).
$$
In addition to deep features, simply calculating the area and quantity often allows for the detection of added or missing components within a part. By comparing colors, component replacements can be detected. Therefore, we construct geometric features for each part by considering three simple attributes: area, quantity, and color. By comparing these geometric features, we determine whether the test image is abnormal. The geometric feature can be constructed by this equation:
$$
F_{geo} = [area,quantity,color].
$$
Similarly to the approach mentioned earlier, we store the geometric features from few-shot normal samples N_{geo} in a repository named part-geo Memory. Then, we compute the minimum cosine distance between the geometric features of the test sample I_{geo} and those stored in the part-geo Memory to obtain anomaly scores, as shown in the following equation:
$$
Score_{part\_geo\_feat}= \min(\text{distance}(I_{geo}, N_{geo})).
$$
Combining the two scores mentioned above, we can obtain the part score:
$$
Score_{part} = \alpha\cdot Score_{part\_deep\_feat} + \beta\cdot Score_{part\_geo\_feat}.
$$
(4) Comparison of patch features
By comparing the patch-level features of the test sample with those of few-shot normal samples, we can easily identify patches that have never appeared in any normal sample, thus detecting structural anomalies in the test sample. Specifically, we extract patch features from the normal samples using CLIP[3] and DINO[6] encoders, and store these normal patch features N_{patch} in a repository called patch Memory. During the testing phase, for each patch feature of the test sample I_{patch}, we compute its minimum cosine distance with the corresponding samples in the patch Memory, which serves as the anomaly score for that patch.
$$
Score_{one\_patch} = \min(\text{distance}(I_{patch}, N_{patch})).
$$
Then, we calculate the maximum value among all patch scores of the test image, which serves as the patch score for the test image.
$$
Score_{patch} = \max{Score_{one\_patch}}.
$$
Finally, by combining the aforementioned scores, we can obtain the final anomaly score for the image:
$$
Score = x_1\cdot Score_{text}+x_2\cdot Score_{global}+x_3\cdot Score_{part}+x_4\cdot Score_{patch},
$$
where x_i are hyperparameters controlling the weights of each score.
ArchitectureTraining
Our method is trained on the MVTec[1], VisA[2], and CADSD[4] datasets, focusing on training the pixel decoder, image adapter, and learnable text vector components of the model.
Specifically, training the pixel decoder is aimed at optimizing the Score_{patch}. We utilize L2 loss for optimization and train the model with a learning rate of 1e-6:
$$
L_{decoder} = L2\_loss(Score_{patch}, label).
$$
Training the image adapter is primarily aimed at optimizing the Score_{global}. We continue to utilize L2 loss for optimization and train the model with a learning rate of 1e-6:
$$
L_{adapter} = L2\_loss(Score_{global}, label).
$$
Training the learnable text vectors is mainly for optimizing the Score_{text}. We employ cross-entropy loss for training optimization and train the model with a learning rate of 1e-5.:
$$
L_{text\_vec} = CE\_loss(Score_{text}, label).
$$
We sequentially train the three modules—pixel decoder, image adapter, and text vector—in the order mentioned. Each training session is conducted on the aforementioned three datasets.
Dataset & EvaluationDataset UtilizationOur method is trained on the MVTec[1], VisA[2], and CADSD[4] datasets, focusing on training the pixel decoder, image adapter, and learnable text vector components of the model.
MVTec[1] is a common anomaly detection dataset comprising 5354 images of both normal and abnormal instances across 15 different categories. The resolution ranges from 700x700 to 1024x1024. VisA[2] is an emerging anomaly detection dataset consisting of 10821 images from 12 categories, with resolutions around 1500 x 1000. CADSD[4] is a logical anomaly detection dataset containing 774 assembly images of screws and nuts. In the normal training images, each screw should be paired with only one nut. Therefore, violations of this constraint (such as missing nuts or double nut assembly) result in logical anomalies. The dataset also includes structural anomalies such as scratches and paint.
Evaluation CriteriaFollowing the competition requirements, we adopt image-level F1-max score as the evaluation metric. In addition, similar to previous anomaly detection methods, we also calculate the image-level AUROC to evaluate the performance of our method.
ResultsWe conduct experiments on the MVTec LOCO[5] dataset under 1/2/4/8-shot settings. For each value of k in k-shot, we randomly select k normal samples from the training set three times and average the results of these three trials as the experimental outcome for that shot. The experimental results are as follows:
We propose a comprehensive approach for anomaly detection by integrating comparisons of image and text features, global features, part features, and patch-level features. Our method effectively identifies both logical and structural anomalies, leveraging the strengths of CLIP[3] and DINO[6] encoders. The experimental results on the MVTec[1], VisA[2], and CADSD[4] datasets demonstrate the efficacy of our approach, achieving high accuracy in anomaly detection. This work provides a robust solution for anomaly detection and sets the stage for future research in this domain.
References[1] Bergmann, P., Fauser, M., Sattlegger, D., & Steger, C. (2019). MVTec AD--A comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9592-9600).
[2] Zou, Y., Jeong, J., Pemula, L., Zhang, D., & Dabeer, O. (2022, October). Spot-the-difference self-supervised pre-training for anomaly detection and segmentation. In European Conference on Computer Vision (pp. 392-408). Cham: Springer Nature Switzerland.
[3] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR.
[4] Ishida, K., Takena, Y., Nota, Y., Mochizuki, R., Matsumura, I., & Ohashi, G. (2023). Sa-patchcore: Anomaly detection in dataset with co-occurrence relationships using self-attention. IEEE Access, 11, 3232-3240.
[5] Bergmann, P., Batzner, K., Fauser, M., Sattlegger, D., & Steger, C. (2022). Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and localization. International Journal of Computer Vision, 130(4), 947-969.
[6] Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9650-9660).
[7] Jeong, J., Zou, Y., Kim, T., Zhang, D., Ravichandran, A., & Dabeer, O. (2023). Winclip: Zero-/few-shot anomaly classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 19606-19616).
[8] Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9), 2337-2348.
[9] Liu, T., Li, B., Du, X., Jiang, B., Jin, X., Jin, L., & Zhao, Z. (2023). Component-aware anomaly detection framework for adjustable and logical industrial visual inspection. Advanced Engineering Informatics, 58, 102161.
Comments