Test-Time Training for Industrial Anomaly Segmentation

Paper information

Title: Test Time Training for Industrial Anomaly Segmentation

Authors: Alex Costanzino et al.

Affiliations: CVLAB, Department of Computer Science and Engineering (DISI) – University of Bologna, Italy, and others

Abstract

Anomaly detection and segmentation (AD&S) are critical for industrial quality control. Many existing methods produce per-pixel anomaly score maps but practical deployment requires binary segmentation masks to identify defects. In many settings, labeled anomalous samples are scarce, so the common approach is to binarize score maps using statistics derived from a validation set that contains only normal samples, which degrades segmentation performance. This paper proposes a test-time training strategy called Test Time Training for Anomaly Segmentation (TTT4AS) to improve segmentation. At test time, the method extracts rich features from each anomalous sample and trains a classifier that distinguishes defects effectively. The approach can be applied downstream of any AD&S method that outputs anomaly score maps, including multimodal setups. Experiments on MVTec AD and MVTec 3D-AD show consistent improvements over baselines.

Key contributions

First investigation of test-time training (TTT) in AD&S.
Proposal of TTT4AS, a method that refines segmentation maps using anomaly scores from generic AD&S algorithms and features from common pre-trained networks.
Demonstration that the method improves binary anomaly segmentation across RGB and multimodal AD&S approaches, and avoids the need to select a fixed threshold for score map binarization.

Method overview

TTT4AS addresses the gap between anomaly score maps and the binary segmentation masks required in industrial inspection. Instead of relying on a threshold derived from normal-only validation statistics, the method performs a short, per-sample training step at test time. From the anomaly score map and features extracted by a pre-trained backbone, pseudo-labels are derived and a lightweight classifier is trained to separate defect pixels from normal background within the input sample. This per-sample adaptation enables richer use of test-time information and is compatible with any method that outputs anomaly score maps, including multimodal pipelines that combine RGB and 3D data.

Qualitative results

On MVTec AD, the authors show, for each category, the RGB image, ground truth mask, anomaly score map, thresholded binary segmentation, and the binary segmentation produced by applying TTT4AS to PatchCore using WideResNet50 and DINO-v2 backbones.

On MVTec 3D-AD, results include RGB images, point clouds, ground truth, anomaly score maps, thresholded binary segmentation, and binary segmentation using TTT4AS applied to M3DM and CMM multimodal methods.

Experimental results

Evaluation was performed on two datasets: MVTec AD (RGB) and MVTec 3D-AD (RGB plus 3D data). The authors tested TTT4AS with different feature extractors and AD&S backbones.

2D anomaly segmentation: On MVTec AD, experiments with PatchCore using WideResNet-50 and DINO-v2 features show that TTT4AS improves average precision, recall, and F1 score relative to the baseline. With WideResNet-50 features, precision increased by 2.1%, recall by 2.7%, and F1 improved by 15%. With DINO-v2 features, precision increased by 17.2%, recall by 8.5%, and F1 by 23.8%.

Multimodal anomaly segmentation: TTT4AS was applied to multimodal methods such as M3DM and CMM, which fuse RGB and 3D data. The method improved average precision and average F1 over baselines. For M3DM, TTT4AS reduced recall in some cases while increasing precision and F1; for CMM, precision and F1 improved by 10.5%.

Ablation studies: Ablations show that the exact percentile used to detect peaks in the score map is not critical, with stable performance across percentiles. Using features from pre-trained models as input to an SVM classifier outperformed directly using raw anomaly scores as classifier input.

Limitations and future work

TTT4AS improves anomaly segmentation by training a per-sample classifier at test time using pseudo-labels from anomaly score maps. Limitations include potential missed peaks in non-maximum suppression and other failure modes that warrant further refinement. Future work can address these limitations and explore broader applications of test-time training in unsupervised industrial anomaly detection.