Pretrained Model release v0#

This changelog is a summary of the changes to the pretrained model weights for the Clay model. We follow the “Stanford Foundation Model Transparency Index

Model weights released on 2024/01/12.

For release notes for the source code, see


Clay v0 is a self-supervised modified vision transformer model trained on stacks of Sentinel-2, Sentinel-1 & DEM data. It is trained as a Masked Autoencoder (MAE) to reconstruct the original image from a masked image.

With the pre-trained model, you can input stacks of geospatial data and output vector embeddings, which capture spatial, temporal, and spectral information about Earth and represent these relationships numerically in high-dimensional space. Each embedding is representative of a certain area of Earth at a certain point in time.

Each data entry is a stack of 10 bands of Sentinel-2, 2 bands of Sentinel-1 & 1 band of DEM data. The model is trained with 3 timesteps of data for each location, with a total of 1203 MGRS tiles globally distributed, each of size 10km x 10km. The data was collected from the Microsoft Planetary Computer.

The model was trained on AWS on 4 NVIDIA A10G GPUs for 25 epochs (~14h per epoch) in December 2023.

Model weights are available on HuggingFace here.

We also generated embeddings for all trainning data, which can be found on Source Cooperative here.

Model Architecture#

Clay is a MAE, with a modified ViT encoder down to embeddings, and a decoder to reconstruct the masked parts of the original image. The loss function is the MSE between the original image and the reconstructed image.

For details, check the source code here.


  • Core Framework: Lightning and its dependencies, like PyTorch, etc.

  • Input modalities:

    • Fixed spec of 10 bands of Sentinel-2, 2 bands of Sentinel-1 & 1 band of DEM data. See below for details.

  • Output modalities:

    • As a masked auto-enconder, fixed spec of 10 bands of Sentinel-2, 2 bands of Sentinel-1 & 1 band of DEM data, to mimic the input as close as possible.

  • Model size:

    • Number of parameters: 127M

    • Model size on disk: ~500MB.

  • Model license:

    • Source code: Apache 2.0

    • Model weights: OpenRAIL-M

      • Prohibited uses: See OpenRAIL-M license section 5.

  • Feedback and redress mechanisms:

    • Please open an issue or discussion on the GitHub repository or send an email to

Model Card#

For v0 of CLAY, we used the clay_small setup model.

INPUT SIZE = 13 bands x 512 width x 512 height
PATCH SIZE = 32 x 32

    Learning rate = 1e-4
    Weight decay = 0.05
    Beta 1 = 0.9
    Beta 2 = 0.95

    T_0 = 1000
    T_mult = 2
    eta_min = Learning rate * 10

    dim = 768
    depth = 12
    heads = 12
    dim_head = 64
    mlp_ratio = 4
    dropout = 0.0
    emb_dropout = 0.0

    decoder_dim = 512
    decoder_depth = 8
    decoder_heads = 8
    decoder_dim_head = 64
    decoder_mlp_ratio = 4
    decoder_dropout = 0.0

Data Card#

We organize our input dataset creation in MGRS tiles. Each tile is a 10km x 10km area. We have 1203 tiles in total, each with 3 timesteps of data between 2017 and 2023, so 3609 Tiles in total. Each timestep is a stack of 10 bands of Sentinel-2, 2 bands of Sentinel-1 & 1 band of DEM data. Each tile is split into 512 x 512 chips, so we have around ~1.2 Million chips in total. Each chip contains 13 bands, 10 of which are the Sentinel-2 bands, 2 are Sentinel 1 bands & 1 DEM band. We store each chip as geotiff, along with their coordinate & timestamp information that is used for model training.

Tile locations

  • Training dataset size: 6.4 TB

  • Training dataset source links:

  • Training dataset items:

    • The actual list of files used is available here.

  • Data source selection and curation process:

    • We aim for fully open data, with global and historical coverage, with the highest spatial, temporal and spectral resolution, hosted on a cloud format that eases the process to search and download the needed sections.

    • Once these sources are selected, we make a statistical sample based on cover type, so that we have a good coverage of the different landscapes. The land cover data is from ESA WorldCover 2021.

  • Data augmentation:

    • We do not use any data augmentation techniques like affine transformations, random crops (except the masked autoencoder task), etc. We also do not use input mixing like CutMix, MixUp, etc.

    • Clouds, cloud shadows, smog, atmospheric scattering, mid-air planes and other non-ground registrations could be considered natural augmentations. We explicitly filter out large % of clouds on our chips, but small clouds and their shadows might be present. As we increase the number of observations per location, and bands, we expect the model to learn to ignore single events but register patterns (places that are often cloudy or with smog).

  • PII or harmful content:

    • We believe that satellites images at this resolution (10m/px) are not subject to PII or harmful content concerns.

  • Human evaluation, wages, and annotation process:

    • Besides tweaking the statistical samples as part of the model development team, and the stated dataset hosting partners, we do not use any human evaluation, or annotation process, or third party services.

We store each chip as geotiff, along with their coordinate & timestamp information that is used for model training.


Normalization parameters#

To normalize the data before passing it to the model, we computed the following normalization parameters from a random sample of the training data. The normalization parameters are used in the Data Module, for partial inputs it will be necessary to subset these as shown in the partial input tutorial.



Standard deviation

Sentinel-2 B02



Sentinel-2 B03



Sentinel-2 B04



Sentinel-2 B05



Sentinel-2 B06



Sentinel-2 B07



Sentinel-2 B08



Sentinel-2 B8A



Sentinel-2 B11



Sentinel-2 B12



Sentinel-1 VV



Sentinel-1 VH



Copernicus DEM



Training Card#

  • Compute Resources:

    • AWS EC2 g5.12xlarge with 4 NVIDIA A10G GPUs

  • Batch Size:

    • Batch Size = 10

    • Effective Batch Size = Batch Size x Number of GPUs x Gradient Accumulation Steps = 10 x 4 x 5 = 200

  • Training Time:

    • 25 epochs, each taking ~15h to train.

  • Carbon Emissions:

    • According to the “Customer Carbon Emission Tool”, there were no Scope 1 or Scope 2 carbon emissions. Following the documentation, we believe this is due to the usage of renewable energy sources. We are aware that Scope 3 emissions might be significant for data centers and that these are not included in the estimate.

  • Training stages:

    • While developing the model we run small tests locally and on the cloud. We estimate that all testing and development compute is less than the compute used for 1 epoch of training.

    • QA of the model is also done locally and on the cloud, and we estimate that it is less than the compute used for 1 epoch of training.

  • Release and distribution:

    • Model development happens in an open source repository on GitHub here.

    • We release the model weights on HuggingFace here.

    • We release the embeddings on Source Cooperative here.

    • We do not have other distribution channels at this time.

  • Production use:

    • We support our partners to build applications with the model, and we expect them to use the model in production.

    • We are developing a web application and expect to release it in 2024 Q1.

Learning Rate & Epoch

MSE Loss for Pixel Reconstruction


As a foundational model, it is designed to be used as a building block for other models. In this section we only a sample of the training objective, which is to reconstruct the original image from a 75% masked image.


Performance Metrics#

The model shows the following performance characteristics for its Masked Autoencoder objective:

  • Training loss: 0.52

  • Validation loss: 0.46

Known Limitations and Biases#

  • The model is trained on Sentinel data only.

  • Sentinel data only covers land and coastal waters.

  • We only train on a ver small sample of the Sentinel archives, both in terms of spatial coverage and time.

  • We do not train on the poles, and we do not train on open ocean, nor ocean nor atmospheric volumetric data.

  • We do not train on night time data.

  • We do not explicitly include extreme events in the training data.

  • We only train at most 3 different times per location.

Ethical Considerations#

Our goal is to lower the barrier to use EO data for biodiversity and climate change mitigation and adaptation. We have designed our model to support this goal.

We have also designed our model to be as open as possible, as modular as possible, as undifferentiated and general as possible, and as well documented as possible, so we can maximize the leverage of the resources needed for the creation of this model.

As a fully open model, we cannot however control how it is used. We are aware that EO data can be used for harmful purposes, and we are committed to work with our partners to prevent this from happening.