Training Data#
This section describes how we created the training dataset for the clay model.
Data sources selection#
The goal for the Clay model is for it to be as general as possible. It should be able to accept data from any platform coming from satellites, aerial, or drone platforms. For this to be possible, clearly the model design is the basis. Drawing inspiration from earlier works on Foundation models like Prithvi, SatMAE, ScaleMAE, DOFA, and SpectralGPT, we have developed a model architecture capable of accepting inputs of diverse spectral bands and resolutions in different sizes.
To train such a model, it is necessary to create a training dataset that contains data from multiple platforms, and is as varied as possible in terms of
spectral band definitions
spatial distribution
temporal distribution
ground sampling distance
To achieve this we have first complied a list of possible input platforms. The list of candidate systems is rather long, and will be growing in the future. To reduce complexity, we have converged to a shorter list of platforms for the first round of model training.
Criteria was availability in the cloud, existence of STAC catalogs, and cloud optimized formats. This resulted in the following list of systems that we have included in the training for Clay v1
Platform |
Spatial Coverage |
Spectral bands |
GSD (meters) |
---|---|---|---|
Landsat 8 and 9 |
Global |
6 optical bands |
30 |
Sentinel 2 L2A |
Global |
10 optical bands |
10 |
Sentinel 1 RTC |
Global |
2 radar bands |
10 |
NAIP |
USA |
4 optical bands |
< 1 |
LINZ |
New Zealand |
3 optical bands |
< 0.5 |
Sampling strategy#
Once imagery sources are selected, the next step is to develop a sampling strategy. We are not able to process the entire archive, and so it is important to select the right subset of the archives for training.
Our driving principle is that the model should learn natural features as well as human made features. Human made features are smaller and less evenly distributed in many cases. This has driven some of the decisions for the sampling, as described below.
Global sampling#
We created a single sampling strategy for all four global satellite systems that we included in the model training (Sentinel 1 and 2, and Landsat 8 and 9). To create a balanced dataset for model training, we used a sampling strategy based on land cover classes from the ESA WorldCover layer.
Our unit of analysis for sampling was the MGRS tile, the global tiling scheme that is used for distributing Sentinel-2 imagery. For each MGRS tile, we computed landcover statistics for all the classes in the WorldCover layer. To speed up processing, we used the third level overview in the WorldCover layer, which has a spatial resolution of 80 meters.
The goal of the landcover sampling was to ensure coverage of each class at a reasonable level. For each class, we selected a number of random MGRS tiles out of the subset of MGRS tiles with the highest fraction of that class present.
As an example, for “Wetlands” we selected 50 random ones out of the MGRS tiles with the highest wetland fraction globally. For the Built-up class on the other hand we selected the 400 most urban MGRS tiles.
In addition to the landcover classes, we also added diversity by selecting 500 tiles out of the 3000 tiles with the highest count of land cover classes present in the tile.
After selecting MGRS tiles for each of these criteria, we removed duplicates.
The following table summarizes the selection criteria for each class.
Class |
Nr of Tiles |
From highest |
---|---|---|
Diversity |
400 |
2000 |
Built-up |
300 |
300 |
Built-up |
1000 |
1500 |
Herbaceous wetland |
50 |
500 |
Mangroves |
50 |
500 |
Moss and lichen |
50 |
500 |
Cropland |
800 |
3600 |
Tree cover |
150 |
750 |
Shrubland |
100 |
500 |
Grassland |
200 |
500 |
Bare / sparse vegetation |
50 |
500 |
Snow and Ice |
25 |
500 |
Permanent water bodies |
50 |
1000 |
This resulted in a sample of 2728 MGRS tiles total in our sample. The resulting sample file can be downloaded from the following link
https://clay-mgrs-samples.s3.amazonaws.com/mgrs_sample_v02.fgb
We used these locations for all of the global platforms. For more details about how exactly we implemented the sample selection, review the corresponding stacchip processors.
Landsat 8 and 9 sampling strategy#
To further increase variety in the dataset, we used both L1 and L2 products for training. For each location and each level of the platform, we selected one random year between 2018 and 2023, and used the least cloudy scenes in each quarter of the selected year.
Sentinel-2 sampling strategy#
For each location we selected two random years between 2018 and 2023, and for each year we used the least cloudy scene in each quarter.
NAIP sampling strategy#
The sampling strategy for NAIP was based on Natural Earth data. The sample includes all popluated places, protected areas and parks, airports, and ports. In addition, we sampled one random point along each river, and one random location within each lake that is registered in Natural Earth. Finally, we sampled 4000 random points. All data was filtered to be within the CONUS region.
LINZ sampling strategy#
For LINZ we used simple random subsampling because there is no STAC api to do spatial search with. We selected a random subset of all scenes for the different sub-collections that are available for LINZ.
More specifically, we randomly select 50% the scenes, with a minimum of 10 and a maximum of 2000 scenes for each catalog that was included. We selected the latest imagery for each of the available regions of new zealand. The list of catalogs is in the linz processor file.
Data preparation#
To be able to include multiple platforms in model training, we worked on a standardisation of the processing pipeline. The goal for this was to develop a framework that can be used to collect data from a large variety of formats and locations in a consistent way. For this we developed stacchip, a library to help preparing training data images. Please consult the documentation of the library to know more, but at a high level the goals of stacchip are
Keeping the data in original format for as long as possible
Scalable extendable indexing of chips
Indexing processors for different platforms
Chipping utility that takes the index and dynamically creates images for training
Use geoparquet: fast storage option and easy to combine indexes from platforms
Can be used for training and inference on the fly
Dataset size#
Using stacchip, we created a dataset with a size of 33.8 TB of imagery, with about 70 million chips created. The following table shows the distribution of imagery chips used for Clay v1 training.
Source |
Number of chips |
---|---|
NAIP |
20984171 |
LINZ |
3299006 |
Sentinel-2-l2a |
18683945 |
Landsat-c2l1 |
5827333 |
Landsat-c2l2-sr |
5790651 |
Sentinel-1-rtc |
16133394 |
Older versions#
For older versions of the model we used the following sampling stragegies.
For model version v0.1#
For v0.1 we used a smaller sample that was slightly less focused on human landscapes. The distribution of the MGRS tiles we used was as follows
Class |
Nr of Tiles |
From highest |
---|---|---|
Diversity |
500 |
3000 |
Built-up |
400 |
400 |
Herbaceous wetland |
50 |
500 |
Mangroves |
50 |
500 |
Moss and lichen |
50 |
500 |
Cropland |
100 |
500 |
Tree cover |
100 |
500 |
Shrubland |
50 |
500 |
Grassland |
50 |
500 |
Bare / sparse vegetation |
50 |
500 |
Snow and Ice |
50 |
500 |
Permanent water bodies |
100 |
1000 |
This resulted in a sample of 1517 MGRS tiles total in our sample.
The resulting sample file can be downloaded from the following link