Change Detection in Remote Sensing Using Foundation Models

Change detection is a key research area in remote sensing, playing a central role in observing and analyzing surface changes. Although deep learning methods have achieved notable results, performing high-precision change detection in spatiotemporally complex remote sensing scenarios remains challenging. Recent foundation models, with strong generalization capabilities, offer a potential solution, but bridging gaps across data and task domains is still difficult. This article summarizes a proposed method called Time-Traveling Pixels (TTP), which integrates the general knowledge of the Segment Anything Model (SAM) into change detection. TTP addresses domain shift during knowledge transfer and the difficulty large models have in representing homogeneity and heterogeneity across multi-temporal images. Results on the LEVIR-CD dataset demonstrate the method's effectiveness.

Background

With advances in Earth observation, change detection in remote sensing has become a prominent research topic. The primary goal is to analyze changes of interest across multi-temporal remote sensing products, typically formulated as pixel-level binary classification (changed/unchanged). Surface dynamics captured by remote sensing are influenced by both natural processes and human activities. Accurate detection of these changes is critical for quantitative analysis of land cover and can inform assessments of economic trends, human activity, and climate change.

High-resolution remote sensing imagery is a powerful tool for detecting complex changes. However, robust change detection in complex scenarios remains difficult. The task must focus on "effective changes" within "non-semantic change", meaning that non-semantic differences caused by atmospheric conditions, sensors, registration errors, and semantic changes irrelevant to downstream applications should be ignored. This requirement poses significant challenges for accurate change detection.

Deep learning has pushed the field forward. CNN-based algorithms use strong feature extraction to reveal robust change cues, producing impressive results across varied scenes. More recently, Transformer-based methods have advanced the field further by capturing long-range dependencies and providing a global receptive field, which benefits tasks requiring high-level semantic understanding such as change detection. Despite these advances, adaptability in complex, evolving spatiotemporal environments remains limited. The limited annotated data available for change detection further constrains the potential of very large models. Self-supervised representation learning and synthetic data generation have made progress but still cannot cover the full diversity of scenes caused by spatiotemporal variation, nor fully enable large-parameter models to perform consistently across different scenarios.

Foundation Models and Domain Challenges

Foundation models trained on large-scale data have shown strong generalization and adaptability. Visual foundation models such as CLIP and SAM capture broad, transferable representations that can reduce the need for task-specific annotations. However, most visual foundation models are designed for natural images, creating domain gaps when applied to remote sensing imagery for change detection. In addition, these models are typically optimized for single-image understanding and often struggle to represent homogeneity and heterogeneity across multiple temporal images, a capability that is essential for change detection.

Proposed Method: Time-Traveling Pixels (TTP)

This work integrates the general knowledge of visual foundation models into the change detection task, addressing domain shift during knowledge transfer and the challenge of representing multi-temporal image homogeneity and heterogeneity. The proposed Time-Traveling Pixels (TTP) method integrates temporal information into the pixel semantic feature space. Specifically, TTP leverages SAM's general segmentation knowledge and introduces low-rank fine-tuning parameters into the SAM backbone to mitigate spatial-semantic domain shift. TTP also introduces a Time-Travel Activation Gate that allows temporal features to permeate the pixel semantic space, enabling the foundation model to capture homogeneity and heterogeneity between two temporal images. Finally, the method includes a lightweight, efficient multi-level change prediction head to decode dense, high-level change semantics. Together, these components support accurate and efficient remote sensing change detection.

Main Contributions

The authors address the scarcity of annotated data by transferring the latent general knowledge of foundation models into the change detection task. They bridge spatiotemporal domain gaps in the knowledge transfer process via the Time-Traveling Pixels mechanism.
Technically, they introduce low-rank fine-tuning to alleviate spatial-semantic domain shift, propose a Time-Travel Activation Gate to enhance the foundation model's ability to recognize inter-image correlations, and design a lightweight multi-level prediction head to decode dense semantic information encoded in the foundation model.
The method is compared with several advanced approaches on the LEVIR-CD dataset. Results indicate state-of-the-art performance, highlighting the method's effectiveness and potential for further application.

Summary and Practical Notes

By injecting foundation model knowledge into change detection, the proposed approach mitigates generalization limitations in complex spatiotemporal remote sensing scenarios. Low-rank fine-tuning helps bridge spatial-semantic gaps between natural and remote sensing images, and the Time-Travel Activation Gate provides temporal modeling capability to the foundation model. The multi-level change prediction head decodes dense features into change predictions. Experiments on the LEVIR-CD dataset demonstrate the method's effectiveness. The authors note that the approach can be trained on a single NVIDIA RTX 4090 GPU.