Train classification on Embeddings#

In this notebook we show how to find marinas and baseball fields in the San Francisco area based on about 30k Clay embeddings.

When this is executed as a script on a laptop, it should take only about 2.5 seconds to open the data, train a classifier, and make predictions.

# If not installed, add lonboard to the environment by uncommenting the following line
# ! pip install lonboard
from pathlib import Path

import geopandas as gpd
import numpy as np
import pandas as pd
import requests
from lonboard import Map, PolygonLayer
from lonboard.colormap import apply_categorical_cmap
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 3
      1 from pathlib import Path
----> 3 import geopandas as gpd
      4 import numpy as np
      5 import pandas as pd

ModuleNotFoundError: No module named 'geopandas'

Download data#

Download all the data from the following huggingface dataset into a local data folder.

files = [
    "embeddings_ca_m_3712213_ne_10_060_20220518.gpq",
    "embeddings_ca_m_3712213_nw_10_060_20220518.gpq",
    "embeddings_ca_m_3712213_se_10_060_20220518.gpq",
    "embeddings_ca_m_3712213_sw_10_060_20220518.gpq",
    "embeddings_ca_m_3712214_sw_10_060_20220518.gpq",
    "embeddings_ca_m_3712221_ne_10_060_20220518.gpq",
    "embeddings_ca_m_3712221_nw_10_060_20220518.gpq",
    "embeddings_ca_m_3712221_sw_10_060_20220518.gpq",
    "embeddings_ca_m_3712222_sw_10_060_20220518.gpq",
    "embeddings_ca_m_3712229_ne_10_060_20220518.gpq",
    "embeddings_ca_m_3712230_nw_10_060_20220518.gpq",
    "embeddings_ca_m_3712212_ne_10_060_20220519.gpq",
    "embeddings_ca_m_3712212_nw_10_060_20220519.gpq",
    "embeddings_ca_m_3712212_se_10_060_20220519.gpq",
    "embeddings_ca_m_3712228_ne_10_060_20220519.gpq",
    "embeddings_ca_m_3712221_se_10_060_20220518.gpq",
    "embeddings_ca_m_3712222_nw_10_060_20220518.gpq",
    "embeddings_ca_m_3712220_ne_10_060_20220519.gpq",
    "embeddings_ca_m_3712229_nw_10_060_20220518.gpq",
    "embeddings_ca_m_3712214_nw_10_060_20220518.gpq",
    "marinas.geojson",
    "baseball.geojson",
]

url_template = "https://huggingface.co/datasets/made-with-clay/classify-embeddings-sf-baseball-marinas/resolve/main/{filename}"

for filename in files:
    dst = f"../../data/classify-embeddings-sf-baseball-marinas/{filename}"
    print(dst)
    if Path(dst).exists():
        continue
    with requests.get(url_template.format(filename=filename)) as r:
        r.raise_for_status()
        with open(dst, "wb") as f:
            f.write(r.content)
../../data/classify-embeddings-sf-baseball-marinas/embeddings_ca_m_3712213_ne_10_060_20220518.gpq
../../data/classify-embeddings-sf-baseball-marinas/embeddings_ca_m_3712213_nw_10_060_20220518.gpq
../../data/classify-embeddings-sf-baseball-marinas/embeddings_ca_m_3712213_se_10_060_20220518.gpq
../../data/classify-embeddings-sf-baseball-marinas/embeddings_ca_m_3712213_sw_10_060_20220518.gpq
../../data/classify-embeddings-sf-baseball-marinas/embeddings_ca_m_3712214_sw_10_060_20220518.gpq
../../data/classify-embeddings-sf-baseball-marinas/embeddings_ca_m_3712221_ne_10_060_20220518.gpq
../../data/classify-embeddings-sf-baseball-marinas/embeddings_ca_m_3712221_nw_10_060_20220518.gpq
../../data/classify-embeddings-sf-baseball-marinas/embeddings_ca_m_3712221_sw_10_060_20220518.gpq
../../data/classify-embeddings-sf-baseball-marinas/embeddings_ca_m_3712222_sw_10_060_20220518.gpq
../../data/classify-embeddings-sf-baseball-marinas/embeddings_ca_m_3712229_ne_10_060_20220518.gpq
../../data/classify-embeddings-sf-baseball-marinas/embeddings_ca_m_3712230_nw_10_060_20220518.gpq
../../data/classify-embeddings-sf-baseball-marinas/embeddings_ca_m_3712212_ne_10_060_20220519.gpq
../../data/classify-embeddings-sf-baseball-marinas/embeddings_ca_m_3712212_nw_10_060_20220519.gpq
../../data/classify-embeddings-sf-baseball-marinas/embeddings_ca_m_3712212_se_10_060_20220519.gpq
../../data/classify-embeddings-sf-baseball-marinas/embeddings_ca_m_3712228_ne_10_060_20220519.gpq
../../data/classify-embeddings-sf-baseball-marinas/embeddings_ca_m_3712221_se_10_060_20220518.gpq
../../data/classify-embeddings-sf-baseball-marinas/embeddings_ca_m_3712222_nw_10_060_20220518.gpq
../../data/classify-embeddings-sf-baseball-marinas/embeddings_ca_m_3712220_ne_10_060_20220519.gpq
../../data/classify-embeddings-sf-baseball-marinas/embeddings_ca_m_3712229_nw_10_060_20220518.gpq
../../data/classify-embeddings-sf-baseball-marinas/embeddings_ca_m_3712214_nw_10_060_20220518.gpq
../../data/classify-embeddings-sf-baseball-marinas/marinas.geojson
../../data/classify-embeddings-sf-baseball-marinas/baseball.geojson

Load Clay embeddings#

We are going to use embeddings generated from 256x256 pixel chips created from NAIP imagery. The resolution of the imagery is 0.6m, so the chips represent squares of about 154x154 meters.

In this example, embeddings are stored in geoparquet format, one file per NAIP scene. Each row contains the ID of the NAIP scene, the embedding data, and a bounding box geometry. All embeddings from one NAIP scene are contained in a single file. There are about 1800 embeddings per NAIP scene.

Below we open the separate files and combine them into a single GeoPandas dataframe.

# Open embeddings DB
embeddings = []
for src in Path("../../data/classify-embeddings-sf-baseball-marinas/").glob("*.gpq"):
    gdf = gpd.read_parquet(src)
    embeddings.append(gdf)
embeddings = pd.concat(embeddings)
embeddings
item_id embeddings geometry
0 ca_m_3712212_nw_10_060_20220519 [-0.046016138, -0.04745782, -0.117088005, 0.05... POLYGON ((-122.62556 37.87702, -122.62557 37.8...
1 ca_m_3712212_nw_10_060_20220519 [-0.020038588, -0.0016635053, 0.15959166, -0.1... POLYGON ((-122.62382 37.87701, -122.62382 37.8...
2 ca_m_3712212_nw_10_060_20220519 [0.0027899437, 0.042846236, 0.17805487, -0.166... POLYGON ((-122.62207 37.87701, -122.62208 37.8...
3 ca_m_3712212_nw_10_060_20220519 [-0.023520006, 0.020058865, 0.2087154, -0.1701... POLYGON ((-122.62032 37.87700, -122.62033 37.8...
4 ca_m_3712212_nw_10_060_20220519 [-0.0062850136, 0.051816266, 0.2323465, -0.230... POLYGON ((-122.61858 37.87700, -122.61858 37.8...
... ... ... ...
1781 ca_m_3712228_ne_10_060_20220519 [0.20395686, -0.051615402, 0.0030432416, -0.22... POLYGON ((-122.50602 37.56316, -122.50603 37.5...
1782 ca_m_3712228_ne_10_060_20220519 [0.071957245, -0.07750198, -0.019656746, -0.16... POLYGON ((-122.50428 37.56315, -122.50429 37.5...
1783 ca_m_3712228_ne_10_060_20220519 [0.19023652, -0.038485188, -0.00046398104, -0.... POLYGON ((-122.50254 37.56314, -122.50255 37.5...
1784 ca_m_3712228_ne_10_060_20220519 [0.23228532, 0.008541599, -0.024506139, -0.051... POLYGON ((-122.50080 37.56313, -122.50081 37.5...
1785 ca_m_3712228_ne_10_060_20220519 [0.31645948, -0.039977077, -0.0001847957, -0.0... POLYGON ((-122.49906 37.56313, -122.49907 37.5...

36024 rows × 3 columns

Visualize embedding coverage#

We use lonboard to visualize the data used and produced in this exercise. The following map shows all embeddings. Some overlap between the scenes is visible too. So the scene edge areas are covered twice in the embeddings dataframe.

layer = PolygonLayer.from_geopandas(
    embeddings,
    get_fill_color=[255, 0, 200, 80],
    get_line_color=[130, 65, 100, 80],
    get_line_width=10,
    line_width_max_pixels=3,
)
m = Map(layer)
m

Training data#

For this area, we manually created two small datasets that can be used for the example classification.

One dataset marks locations of baseball fields, and the other one locations of marinas.

To use the training data, we open the point dataset and make a spatial join with the embeddigns. This results in a dataframe containing embeddings and their bounding boxes for all the training locations.

The join adds the “class” column, containing the class label (1 is the target, 0 are all other locations).

Choose your example#

In the following cell, choose which set of training points to use. The input shoudl be a point dataset with a class column, containing 1 for positive examples, and 0 for negative examples.

Use your own dataset or use one of the two provided ones.

# Open marinas training data
points = gpd.read_file(
    "../../data/classify-embeddings-sf-baseball-marinas/marinas.geojson"
)

# Uncomment this to use the baseball training dataset.
# points = gpd.read_file(
#     "../../data/classify-embeddings-sf-baseball-marinas/baseball.geojson"
# )

# Spatial join of training data with embeddings
merged = embeddings.sjoin(points)
print(f"Found {len(merged)} embeddings to train on")
print(f"{sum(merged['class'])} marked locations")
print(f"{len(merged) - sum(merged['class'])} negative examples")

merged
Found 216 embeddings to train on
29 marked locations
187 negative examples
item_id embeddings geometry index_right class
36 ca_m_3712229_nw_10_060_20220518 [0.13175833, -0.09973948, -0.040465936, 0.0651... POLYGON ((-122.43789 37.62681, -122.43790 37.6... 1 0
37 ca_m_3712229_nw_10_060_20220518 [0.082306586, -0.10373349, 0.020833228, 0.1467... POLYGON ((-122.43615 37.62680, -122.43616 37.6... 3 0
870 ca_m_3712212_ne_10_060_20220519 [0.15164891, -0.08649829, 0.053330045, -0.1168... POLYGON ((-122.50382 37.84638, -122.50383 37.8... 163 0
416 ca_m_3712213_se_10_060_20220518 [-0.09028356, -7.4659765e-05, -0.0054179365, 0... POLYGON ((-122.37530 37.80046, -122.37532 37.7... 151 0
791 ca_m_3712213_se_10_060_20220518 [-0.11083804, 0.08467115, -0.15396121, -0.0175... POLYGON ((-122.38414 37.78667, -122.38415 37.7... 149 0
... ... ... ... ... ...
64 ca_m_3712229_ne_10_060_20220518 [0.054444104, 0.076543145, -0.02271817, -0.036... POLYGON ((-122.39275 37.62552, -122.39276 37.6... 84 0
65 ca_m_3712229_ne_10_060_20220518 [-0.06287075, 0.04157688, -0.12616414, 0.04595... POLYGON ((-122.39101 37.62551, -122.39102 37.6... 90 0
66 ca_m_3712229_ne_10_060_20220518 [-0.0744138, 0.032512434, -0.025697831, 0.1158... POLYGON ((-122.38927 37.62550, -122.38928 37.6... 96 0
67 ca_m_3712229_ne_10_060_20220518 [0.050141867, 0.03144994, -0.089793496, 0.0052... POLYGON ((-122.38753 37.62549, -122.38754 37.6... 100 0
68 ca_m_3712229_ne_10_060_20220518 [0.0016791783, -0.009175467, -0.02405702, 0.06... POLYGON ((-122.38579 37.62549, -122.38580 37.6... 106 0

216 rows × 5 columns

Plot the baseball trainingdata#

Green squares show positive examples, where blue ones are locations without the target content.

training_layer = PolygonLayer.from_geopandas(
    merged,
    get_fill_color=apply_categorical_cmap(
        merged["class"], {0: [0, 150, 255, 100], 1: [0, 255, 150, 150]}
    ),
    get_line_color=[0, 100, 100, 0],
)
m = Map(training_layer)
m

Train a classifier#

We extract the embeddings as X and the class labels as y from the merged table.

Then we split the data into train and test groups, and fit a Random Forest classifier.

Some basic accuracy statistics are shown.

# Extract X and y and split into test/train set
X = np.array([dat for dat in merged["embeddings"].values])
y = merged["class"].values
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Fit Random Forest classifier
model = RandomForestClassifier()
model = model.fit(X_train, y_train)

# Make test prediction and evaluate
pred = model.predict(X_test)
print(f"Accuracy is {accuracy_score(y_test, pred)}")
print(f"Precision is {precision_score(y_test, pred)}")
print(f"Recall is {recall_score(y_test, pred)}")
Accuracy is 0.9076923076923077
Precision is 1.0
Recall is 0.4

Detect target locations embeddings dataset#

The last step is to make predictions with the newly trained classifier. We apply the Random Forest model to the entire dataset of 36k embeddings. The prediction runs in 222 milliseconds on a laptop with a GPU.

%%time
# Make inference on entire embedding dataset
X = np.array([x for x in embeddings["embeddings"]])
predicted = model.predict(X)
print(f"Found {np.sum(predicted)} locations")

# Add inference to geopandas df and export
result = embeddings[predicted.astype("bool")]
result = result[["item_id", "geometry"]]
CPU times: user 180 ms, sys: 526 µs, total: 181 ms
Wall time: 180 ms
Found 58 locations

Plot detected locations#

predicted_layer = PolygonLayer.from_geopandas(
    result,
    filled=False,
    get_line_color=[255, 0, 0, 100],
    get_line_width=50,
    line_width_max_pixels=5,
)
m = Map([training_layer, predicted_layer])
m