Skip to main content

Problem Statement

Assume that we have data (images) x\mathbf{x} with or without explicit knowledge of their labels yy. We want to learn a model that can create a representation of x\mathbf{x} such that it can assign an image that was never seen before, to either the nominal category or the anomalous category (k=1k=1), correctly with high probability. This setting is known as Anomaly Detection (AD).
In the AD literature we meet a category of problems that is associated with the localization of the anomaly. Localization is achieved by producing an image with a bounding box around the anomalous area. This problem is not in scope for this project that aims to produce a classification of the image as nominal or anomalous. Having said that, some of the models we have evaluated have inherent localization capabilities that can be used in the future.
We start by establishing some definitions and mathematical notation so that we understand the developed modeling approach. We need to extract a vector z\mathbf{z} from the image x\mathbf{x} that can be used to subsequently assign to it a score that will flag the AD event. The vector z\mathbf{z} is called the representation of the input image x\mathbf{x}. A synonymous way of referring to the representation is embedding where the image is embedded into a vector space with dimensions smaller (e.g., <2048< 2048) than the raw image space. In the Seamagine data, the raw image vector space for the grayscale cropped images is 1×224×224=50,1761 \times 224 \times 224 = 50,176 dimensions for one of the channels. We also refer to representations as features since they represent the features of the image needed for the task at hand.
Qdrant is used for storing such embeddings.

Unsupervised Anomaly Detection

When we have a fully or partially annotated dataset we have a supervised or semi-supervised classification problem respectively. The Seamagine dataset is a collection of images taken at different times from a production machine grouped in M=12M=12 machine settings. Because of the grouped collection, it may appear on the first glance as a supervised dataset but it is not. The main reason is that multiple machine settings are used to only vary the data distribution pdata(x)p_{data}(\mathbf{x}) such that some products are to fail when a physical stress test is performed. Despite that we have no direct mapping from a machine setting to a stress test outcome for all the images in our dataset, which is needed if one can claim supervision for our AD problem, we have seen some evidence that some machine settings lead to a higher number of failures than others. This is the only information we can use and we use it to only benchmark our models using industry-standard AD evaluation protocols. Since we do not use labels during training we have an unsupervised problem that is also known as one-class (OC) classification or out-of-distribution (OOD) detection. Let XT\mathcal{X}_T denote the set of all nominal images (xiXT:yi=0\forall x_i \in \mathcal{X}_T : y_i = 0) available at training time, with yi{0,1}y_i \in \{0, 1\} denoting if an image xix_i is nominal (yi=0y_i=0) or anomalous (yi=1y_i=1). Accordingly, we define XV\mathcal{X}_V to be the set of samples provided at test time, with xiXV:yi{0,1}\forall x_i \in \mathcal{X}_V : y_i \in \{0, 1\}. Since we need the labels only for evaluation / benchmarking purposes, we can use the labels in the validation dataset to evaluate the performance of our models.
It is important to emphasize the unsupervised problem setting - to develop a model we do not use any anomaly detection labels (PASS or FAIL) associated with the images, which is in direct contrast to a supervised classification setting.The images with anomaly detection label equal to FAIL are present only in the validation dataset during testing and should not be present in the training dataset. This means that our training dataset contains 11 machine configurations while the validation dataset contains 12 machine configurations.
Most unsupervised AD methods attempt to create a model of the underlying nominal data distribution pnom(x)p_{nom}(\mathbf{x}) - as a direct consequence of having a single class present in our dataset. We call such model pmodel(x)p_{model}(\mathbf{x}) and we aim to make it as close as possible to the true nominal data distribution. We distinguish two parts in the construction of the model:
  1. The embeddings / features extractor that produces the representation z\mathbf{z} from the image x\mathbf{x}.
  2. The distribution estimator that learns the distribution pmodel(x)p_{model}(\mathbf{x}).

Embeddings

With respect to the embeddings we can broadly divide the ways that we obtain them into two categories: generative and discriminative. In the following we outline the relative merits of the methods and attempt to bet on the most promising one that solves the real-time unsupervised AD problem we are facing.
  1. Reconstruction-based: One of the main generative approaches is the set of models that are reconstruction based. Here we learn to reconstruct the domain-specific nominal input images and use the reconstruction error as a score to determine if the image is anomalous or not. During training only nominal samples are seen by the network which means that when an anomalous image is seen during testing, the network is unable to reconstruct it with high quality lifting the reconstruction error above a threshold. Reconstruction based methods learn embedding from the domain-specific dataset itself and require an explicit training stage before the encoder can be used in production. Two main representatives in this category include Convolutional AutoEncoders (CAE) and Generative Adversarial Networks (GANs). Advances in the form of Masked Autoencoders (MAE) have been found to offer foundational modeling but are computationally heavy especially during inference.
  2. Representation-based: One of the main discriminative approaches is the set of models that produce the embeddings using a pretrained model - a vision transformer or a (residual) CNN. Such models are pretrained using a discriminative dataset such as ImageNet. There are a couple of promising approaches that we decided to focus on based on the excellent evaluation results of the MVTec dataset in combination with the distribution estimation methods. The main representatives in this category include OpenCLIP and ResNet-50.
We focus on the representation learning category since it is the most widely used in the industry and it is the most interpretable.

Distribution Estimation

  1. Patchcore: A method that uses a patch-based approach to learn the distribution of the dataset. The method is based on the idea that the distribution of the patches of the nominal images is different from the distribution of the patches of the anomalous images. The method uses a pretrained model to extract the features from the images and then applies a nearest-neighbor algorithm to detect anomalies at the patch level. It relies on representative features that come from the mid sections of pre-trained backbone CNN and uses coreset sampling to reduce memory requirements.
  2. Efficient-AD: Introduces a fast patch descriptor and trains a student network to predict the features computed by a pretrained teacher network on nominal training images. Because the student is not trained on anomalous images, it generally fails to mimic the teacher on these. A large distance between the outputs of the teacher and the student thus enables the detection of anomalies at test time.

Anomaly Detection Scoring

Irrespective of the approach used and unless the method inherently scores the anomaly, a Nearest neighbors algorithm (kNN) is employed to assign an anomaly score to the validation image and a thresholding operation determines if the image is anomalous or not.

Evaluation Protocol

For the anomaly detection task we have adopted the industry standard evaluation protocol and metrics. During training we use only the PASS class images while during the validation we use both PASS and FAIL class images and report key performance metrics: the Precision-Recall curve, the associated Area Under the Curve (AUC) score and also the validation confusion matrix.
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.