Colloquium

Improving the temporal consistency and accuracy of land cover fraction mapping and change using Vision Transformer

Organised by Laboratory of Geo-information Science and Remote Sensing
Date

Tue 15 April 2025 09:00 to 09:30

Venue Gaia, building number 101
Droevendaalsesteeg 3
101
6708 PB Wageningen
+31 (0) 317 - 48 17 00
Room 1

By Qin Xu

Abstract
Land cover fraction mapping and change detection are essential for sustainable environmental management and have gained significance due to advancements in spatial and temporal technologies. Land cover fraction mapping efficiently describes the Earth's surface as it represents the proportion of various classes within each pixel. Vision Transformers (ViT) represent an exciting alternative to convolutional neural networks (CNNs), effectively utilizing self-attention mechanisms to enhance the capture of global and long-range dependencies.

In this project, we propose an adapted vision transformer designed to capture spatial and temporal information. Three adapted vision transformer models were compared: monthly_15 ViT1 was trained on monthly time series with 15*15 spatial context, monthly_5 ViT2 was trained on monthly time series with 5*5 spatial context, and yearly_15 ViT3 was trained on yearly time series with 15*15 spatial context. Sentinel-2 imagery with 10 high-resolution bands was aggregated to 20m resolution, accompanied by fractional reference data at the same resolution for model input.

The overall error metrics, MAE and RMSE, indicate that yearly_15 ViT3 achieves the best performance in terms of overall error across all locations and locations that have experienced changes. Additionally, the fraction range metrics demonstrate that ViT3 performs optimally for mapping changes in land cover fraction when "no change" is excluded from consideration. However, monthly_5 ViT2 exhibits the lowest error in predicting fraction change according to the overall error metrics. The results suggest that focused spatial views can effectively detect subtle local transitions, but they are influenced by a significant number of temporally stable pixels. Additionally, a yearly temporal resolution yields better overall accuracy in predicting land cover fractions compared to a monthly resolution, despite utilizing fewer time points. Future research may investigate various loss functions and the application of transfer learning on temporal embeddings. Additionally, it could examine the Swin Transformer utilizing shifted window configurations, varying patch sizes, and different methods for temporal position encoding.