Problem Statement
Assume that we have data (images) with or without explicit knowledge of their labels . We want to learn a model that can create a representation of such that it can assign an image that was never seen before, to either the nominal category or the anomalous category (), correctly with high probability. This setting is known as Anomaly Detection (AD).In the AD literature we meet a category of problems that is associated with the localization of the anomaly. Localization is achieved by producing an image with a bounding box around the anomalous area. This problem is not in scope for this project that aims to produce a classification of the image as nominal or anomalous. Having said that, some of the models we have evaluated have inherent localization capabilities that can be used in the future.
Qdrant is used for storing such embeddings.
Unsupervised Anomaly Detection
When we have a fully or partially annotated dataset we have a supervised or semi-supervised classification problem respectively. The Seamagine dataset is a collection of images taken at different times from a production machine grouped in machine settings. Because of the grouped collection, it may appear on the first glance as a supervised dataset but it is not. The main reason is that multiple machine settings are used to only vary the data distribution such that some products are to fail when a physical stress test is performed. Despite that we have no direct mapping from a machine setting to a stress test outcome for all the images in our dataset, which is needed if one can claim supervision for our AD problem, we have seen some evidence that some machine settings lead to a higher number of failures than others. This is the only information we can use and we use it to only benchmark our models using industry-standard AD evaluation protocols. Since we do not use labels during training we have an unsupervised problem that is also known as one-class (OC) classification or out-of-distribution (OOD) detection. Let denote the set of all nominal images () available at training time, with denoting if an image is nominal () or anomalous (). Accordingly, we define to be the set of samples provided at test time, with . Since we need the labels only for evaluation / benchmarking purposes, we can use the labels in the validation dataset to evaluate the performance of our models. Most unsupervised AD methods attempt to create a model of the underlying nominal data distribution - as a direct consequence of having a single class present in our dataset. We call such model and we aim to make it as close as possible to the true nominal data distribution. We distinguish two parts in the construction of the model:- The embeddings / features extractor that produces the representation from the image .
- The distribution estimator that learns the distribution .
Embeddings
With respect to the embeddings we can broadly divide the ways that we obtain them into two categories: generative and discriminative. In the following we outline the relative merits of the methods and attempt to bet on the most promising one that solves the real-time unsupervised AD problem we are facing.- Reconstruction-based: One of the main generative approaches is the set of models that are reconstruction based. Here we learn to reconstruct the domain-specific nominal input images and use the reconstruction error as a score to determine if the image is anomalous or not. During training only nominal samples are seen by the network which means that when an anomalous image is seen during testing, the network is unable to reconstruct it with high quality lifting the reconstruction error above a threshold. Reconstruction based methods learn embedding from the domain-specific dataset itself and require an explicit training stage before the encoder can be used in production. Two main representatives in this category include Convolutional AutoEncoders (CAE) and Generative Adversarial Networks (GANs). Advances in the form of Masked Autoencoders (MAE) have been found to offer foundational modeling but are computationally heavy especially during inference.
- Representation-based: One of the main discriminative approaches is the set of models that produce the embeddings using a pretrained model - a vision transformer or a (residual) CNN. Such models are pretrained using a discriminative dataset such as ImageNet. There are a couple of promising approaches that we decided to focus on based on the excellent evaluation results of the MVTec dataset in combination with the distribution estimation methods. The main representatives in this category include OpenCLIP and ResNet-50.
Distribution Estimation
- Patchcore: A method that uses a patch-based approach to learn the distribution of the dataset. The method is based on the idea that the distribution of the patches of the nominal images is different from the distribution of the patches of the anomalous images. The method uses a pretrained model to extract the features from the images and then applies a nearest-neighbor algorithm to detect anomalies at the patch level. It relies on representative features that come from the mid sections of pre-trained backbone CNN and uses coreset sampling to reduce memory requirements.
- Efficient-AD: Introduces a fast patch descriptor and trains a student network to predict the features computed by a pretrained teacher network on nominal training images. Because the student is not trained on anomalous images, it generally fails to mimic the teacher on these. A large distance between the outputs of the teacher and the student thus enables the detection of anomalies at test time.
Anomaly Detection Scoring
Irrespective of the approach used and unless the method inherently scores the anomaly, a Nearest neighbors algorithm (kNN) is employed to assign an anomaly score to the validation image and a thresholding operation determines if the image is anomalous or not.Evaluation Protocol
For the anomaly detection task we have adopted the industry standard evaluation protocol and metrics. During training we use only thePASS class images while during the validation we use both PASS and FAIL class images and report key performance metrics: the Precision-Recall curve, the associated Area Under the Curve (AUC) score and also the validation confusion matrix.

