Adapting Video Foundation Models for Spatiotemporal Wildfire Forecasting via Cross-Modal Progressive Fine-Tuning

Overview

Wildfires are increasingly frequent and severe, motivating the need for accurate, high-resolution spatiotemporal wildfire spread forecasts from satellite data. While large foundation models have reshaped AI, applying them directly to geospatial data is difficult due to domain gaps and limited labeled data.

In this work we explore video foundation models (e.g., VideoMAEv2) for wildfire spread prediction using multi-modal satellite and environmental inputs, and introduce a training strategy called Cross-Modal Progressive Fine-Tuning (CMPF).

Contributions

Cross-Modal Progressive Fine-Tuning (CMPF): a two-stage adaptation strategy that (1) selects video-centric transformer backbones aligned with the spatiotemporal nature of wildfire dynamics, and (2) progressively fine-tunes them via an intermediate geospatial task before the final wildfire forecasting task.
Architectural study of foundation models: comparison of ViT, MViTv2, and VideoMAEv2 for multi-temporal wildfire forecasting, demonstrating the advantages of video-based pretraining.
Progressive training & data efficiency: CMPF improves Average Precision over direct fine-tuning, accelerates convergence, and yields robust gains across fire sizes, seasons, and regions.

Abstract

Wildfires pose escalating threats to ecosystems, communities, and climate systems, highlighting the urgent need for accurate, high-resolution spatiotemporal forecasting. In this work, we explore the untapped potential of video foundation models for advancing wildfire spread prediction using multimodal satellite data. While large-scale foundation models have transformed artificial intelligence and show promise in Geospatial Artificial Intelligence (GeoAI), their direct application to domain-specific tasks like wildfire forecasting faces two major hurdles: (1) a substantial domain gap between general pretraining data (e.g., natural images and videos) and geospatial data (e.g., multispectral satellite imagery), and (2) limited labeled data for fine-tuning in real-world GeoAI tasks.

To address these challenges, we introduce a Cross-Modal Progressive Fine-Tuning (CMPF) strategy tailored for wildfire forecasting. CMPF combines: (1) informed cross-modal architectural alignment, leveraging video-based Transformers pretrained on spatiotemporal tasks to better capture wildfire dynamics, and (2) progressive fine-tuning, which gradually adapts models to wildfire-specific representations through intermediate domain adaptation before task-specific tuning.

We evaluate CMPF using multiple Transformer backbones, including ViT, MViTv2, and VideoMAEv2, on wildfire spread forecasting benchmarks. Our results show that video foundation models, especially when fine-tuned progressively, outperform conventional CNNs and static vision Transformers in modeling wildfire evolution. These findings also validate the effectiveness of the proposed CMPF approach for adapting general-purpose AI foundation models to complex spatiotemporal geospatial tasks.

Method: Cross-Modal Progressive Fine-Tuning (CMPF)

CMPF is designed to bridge both the modality gap (natural videos vs. satellite time series) and the data gap (limited labeled wildfire events) when adapting large video foundation models to GeoAI tasks.

VideoMAEv2 backbone ViT / MViTv2 baselines Focal Loss Spatio-temporal patching

1. Informed cross-modal architectural choice

We compare three transformer families:

Vision Transformer (ViT): image-based, with temporal information encoded via channel stacking.
MViTv2: multiscale, hierarchical vision transformer with progressive spatial downsampling for multi-scale feature learning.
VideoMAEv2: video foundation model with native 3D spatio-temporal patching and pretraining on large-scale video datasets.

For ViT and MViTv2, multi-temporal inputs are formed by stacking daily frames along the channel dimension, which requires re-initializing the first patch-embedding layer. In contrast, VideoMAEv2 can preserve its pretrained patch-embedding weights after a domain-specific pretraining stage, making it better aligned with spatiotemporal wildfire dynamics.

2. Progressive fine-tuning

Stage 1 – Input adaptation: adapt the patch embedding to the geospatial channel configuration (multi-modal satellite & environmental inputs) and attach a task-specific segmentation head for next-day active fire prediction.
Stage 2 – Intermediate geospatial adaptation: fine-tune on a large auxiliary wildfire dataset (NextDayWildfireSpread) using only modalities shared with the target dataset, with focal loss to handle extreme class imbalance.
Stage 3 – Target-task specialization: fine-tune the intermediately adapted model on the target WildfireSpreadTS dataset, using all available modalities and multi-day input sequences (e.g., 1-day vs. 5-day histories).

Experiments & Results

We evaluate CMPF on the WildfireSpreadTS dataset for next-day active fire prediction at 375 m resolution, using both 1-day and 5-day input sequences. Average Precision (AP) on the active fire class is the primary metric.

Architectural comparison (direct fine-tuning)

VideoMAEv2 consistently outperforms ViT and MViTv2 when directly fine-tuned on WildfireSpreadTS, for both 1-day and 5-day inputs.
All transformer models outperform CNN baselines such as U-Net, ConvLSTM, and UTAE, despite having many more parameters.

Impact of CMPF (progressive fine-tuning)

Introducing the intermediate adaptation stage improves AP over direct fine-tuning for all transformer backbones.
For VideoMAEv2, CMPF yields up to ~3% AP improvement and faster convergence (reaching baseline performance in fewer total epochs).
Using the shared-modalities strategy in both stages (Strategy S1) achieves the strongest performance and is robust across architectures.

Robustness across fire characteristics & regions

CMPF improves detection of small fires (few pixels) as well as medium and large events, mitigating severe class imbalance.
Gains are consistent across seasons (summer/fall fires dominate) and across Western US states including California, Oregon, Idaho, and Montana.

Impact of auxiliary data volume

Increasing the amount of auxiliary NextDayWildfireSpread data in the intermediate stage steadily improves AP on WildfireSpreadTS, with the largest gains between 0–50% of the auxiliary data.
Performance begins to plateau beyond ~75% of the auxiliary dataset: using 100% yields the best AP (up to 0.407 for (T=5)), but only marginally improves over 75%, suggesting a practical trade-off point between accuracy and compute.

Training efficiency

Compared to direct fine-tuning, CMPF reaches the same or better AP in substantially fewer total epochs.
With 40–60 intermediate epochs on the auxiliary dataset, the model matches or surpasses the direct fine-tuning baseline in roughly half the total training budget, demonstrating that progressive fine-tuning is both more accurate and more computationally efficient.

Interpretability and feature usage

Mutual information analysis shows that CMPF’s reliance on each input feature closely follows the feature’s intrinsic relevance to the ground truth, with near-perfect rank alignment between feature–ground-truth and feature–model dependencies.
Physically meaningful drivers such as active fire presence, wind, and temperature exhibit the highest dependence, indicating that CMPF bases its predictions on plausible geophysical controls rather than spurious correlations.
Integrated Gradients–based spatial attributions further confirm that CMPF focuses on coherent regions (e.g., fire perimeters, dry and vegetated areas, topographic gradients), and that the relative importance of variables adapts sensibly across different fire events.

Datasets

Target task: WildfireSpreadTS

Multi-modal, multi-temporal dataset for next-day wildfire spread forecasting across the United States.
607 fire events, 13,607 daily image sets, 375 m spatial resolution.
23 input channels including fuel, topography, historic & forecast weather, and vegetation indices (e.g., NDVI, EVI2).

Auxiliary task: NextDayWildfireSpread

Large-scale wildfire dataset (18,545 fire events, 2012–2020) for 1-day lead wildfire spread prediction at 1 km resolution.
12 input variables including active fire, fuel, topography, and meteorological conditions.

BibTeX

@ARTICLE{11343839,
          author={Li, Wenwen and Hsu, Chia-Yu and Wang, Sizhe},
          journal={IEEE Transactions on Geoscience and Remote Sensing}, 
          title={Adapting Video Foundation Models for Spatiotemporal Wildfire Forecasting via Cross-Modal Progressive Fine-Tuning}, 
          year={2026},
          volume={},
          number={},
          pages={1-1},
          keywords={Foundation models;Geospatial analysis;Wildfires;Forecasting;Artificial intelligence;Adaptation models;Videos;Transformers;Data models;Spatiotemporal phenomena;Geospatial Artificial Intelligence (GeoAI);Foundation Models;Spatiotemporal Forecasting;Wildfire Spread Prediction;Progressive Fine-tuning;Earth Observation;Satellite Imagery;Vision Transformer;Deep Learning},
          doi={10.1109/TGRS.2026.3652453}}