Overhead Geopose Challenge - Benchmark

by Katie Wetstone

Overhead Geopose Challenge - Benchmark

Overhead satellite imagery provides critical time-sensitive information for use areas like disaster response, navigation, and security. Most current methods for using aerial imagery assume images are taken from directly overhead, or “near-nadir”. However, the first images available are often taken from an angle, or are “oblique”. Effects from these camera orientations complicate useful tasks like change detection, vision-aided navigation, and map alignment.

In the Overhead Geopose Challenge, your goal is to make satellite imagery taken from an angle more useful for time-sensitive applications like disaster and emergency response.

The features for this challenge are a series of RGB satellite images from four cities - Jacksonville, FL; Omaha, NE; Atlanta, GA; and San Fernando, ARG. Images from Jacksonville, Omaha, and San Fernando are cropped from publicly available WorldView (WV) 2 and WV-3 satellite data, provided courtesy of DigitalGlobe. Images from Atlanta were derived from the public SpaceNet dataset (see disclaimer). In this notebook, we go through the process of fitting a model that transforms each of these RGB images to its "geocentric pose". Geocentric pose is an object’s height above the ground and its orientation with respect to gravity. Calculating geocentric pose helps with detecting and classifying objects and determining object boundaries.

RGB to geocentric pose
From left to right: (i) An RGB satellite image taken from an angle (oblique) rather than overhead (nadir). (ii) RGB image transformed into geocentric pose representation. Object height is shown in grayscale, and vectors for orientation to gravity are shown in red. (iii) Rectified height of each pixel in meters based on geocentric pose. Adapted from Christie et al. "Learning Geocentric Object Pose in Oblique Monocular Images." 2020.

In this notebook, we'll cover:

Let's get posing!

Disclaimer: Commercial satellite imagery is provided courtesy of DigitalGlobe. Atlanta images were derived from the public SpaceNet Dataset by SpaceNet Partners, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Getting Set Up

First, we'll import the basic tools we'll need to load our data and do some exploration.

In [97]:
import json
from pathlib import Path
import regex as re
import shutil
import sys
import tarfile

from libtiff import TIFF
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas_path import path
from PIL import Image
from tqdm import tqdm

%matplotlib inline

If you have trouble installing packages, we recommend executing (in the command line):

conda install gdal cython opencv gdal tqdm scikit-image pytorch torchvision cudatoolkit -c pytorch
pip install segmentation-models-pytorch

And installing the remaining packages with pip install [package_name].

Now we can read in our data. You can find everything you need on the data download page.

In [98]:
DATA_PATH = Path("../data/processed/final/public")
sorted([ for f in DATA_PATH.glob("*")])
In [99]:
metadata = pd.read_csv(DATA_PATH / "metadata.csv", index_col="id")
geopose_train = pd.read_csv(DATA_PATH / "geopose_train.csv", index_col="id")
geopose_test = pd.read_csv(DATA_PATH / "geopose_test.csv", index_col="id")

What data do we have?

To get started, make sure to check out the data section of the problem description. The data set for this challenge includes satellite imagery from four cities: Jacksonville, Omaha, Atlanta, and San Fernando.

We're provided three datasets: metadata.csv, geopose_train.csv, geopose_test.csv

And four folders: train, test_rgbs, submission_format, train_nano


Let's look at the metadata.

In [100]:
print("Metadata shape: ", metadata.shape)
Metadata shape:  (6948, 2)
city gsd
RYepOS JAX 0.319
RxztOj OMA 0.358
KEVJOi OMA 0.341
gvIlOD OMA 0.351
PKDXEu ARG 0.316

There are 6,948 images in the data. Our metadata has the following columns:

  • id: a randomly generated unique ID to reference each record
  • city: abbreviation for the geographic location
  • gsd: ground sample distance (GSD) in meters per pixel. GSD is the average pixel size in meters.

Geopose data

We have geopose_test.csv and geopose_train.csv. Let's look at geopose_test.csv first.

In [101]:
print("geopose_test shape: ", geopose_test.shape)
geopose_test shape:  (1025, 1)
cQLdkA ARG_cQLdkA_RGB.j2k
TKaane ARG_TKaane_RGB.j2k
zfoVHd ARG_zfoVHd_RGB.j2k
WxwKQa ARG_WxwKQa_RGB.j2k

There are 1,025 test records. Our test Geopose data has two columns:

  • id: a randomly generated unique ID to reference each record
  • rgb: name of the RGB image file, which is found in the test_rgbs folder

Based on the filepaths, we can see that RGB images have suffix .j2k. This means they are JPEG 2000 files. JPEG 2000 files offer compression similar to JPEG files, but are higher quality.

Now let's look at geopose_train.csv

In [102]:
print("geopose_train shape: ", geopose_train.shape)
geopose_train shape:  (5923, 3)
agl json rgb
RxztOj OMA_RxztOj_AGL.tif OMA_RxztOj_VFLOW.json OMA_RxztOj_RGB.j2k
gvIlOD OMA_gvIlOD_AGL.tif OMA_gvIlOD_VFLOW.json OMA_gvIlOD_RGB.j2k

There are 5,923 training images. Our train geopose data has columns for:

  • id: a randomly generated unique ID to reference each record
  • agl: name of the above ground level (AGL) height image file
  • json: name of the JSON file with vector flow information (scale and angle)
  • rgb: name of the RGB image file

The agl, json, and rgb files listed are all in the train folder. agl and json are part of what we are going to try to predict. They are provided for the training data, but not the test data.

AGL images have suffix .tif, meaning they are TIF images. TIFs are a high-quality graphics format. Since each pixel in our AGL image is only one value (height, cm), a lower-quality storage option would limit us to fairly low precision.

TAR files

The rest of the data is provided in tar.gz archives, or TARs.

  • train: all training data files, including RGB images, AGL images, and JSONs with vector flow information
  • test_rgbs: RGB images for the test data
  • submission_format: An example submission that demonstrates correct submission format. All values are placeholder values
  • train_nano: A small subset of 100 training records (RGB image, AGL image, and JSON vector flow for each). The nano set is provided for participants to more easily experiment with the data processing and modeling pipelines before running them on the full dataset.

For each TAR, let's extract all of the files to a regular directory using the tarfile library. Note that extraction takes a few moments.

In [106]:
tar_gzs = sorted(list(DATA_PATH.glob("*.tar.gz")))
In [107]:
# set folder paths to extract .tar.gzs to
submission_format = DATA_PATH / "submission_format"
TEST_DIR = DATA_PATH / "test_rgbs"
train_nano_dir = DATA_PATH / "train_nano"
newdirs = [submission_format, TEST_DIR, TRAIN_DIR, train_nano_dir]
In [108]:
# extract files from tar.gzs - this step takes a few moments
for tar_gz, newdir in zip(tar_gzs, newdirs):
    print(f"Extracting {} to {}")
    if not (newdir.exists()):
        with as tf:
    print(f"\tExtracted {len(list(newdir.glob('*')))} files to {}")
Extracting submission_format.tar.gz to submission_format
	Extracted 2050 files to submission_format
Extracting test_rgbs.tar.gz to test_rgbs
	Extracted 1025 files to test_rgbs
Extracting train.tar.gz to train
	Extracted 17769 files to train
Extracting train_nano.tar.gz to train_nano
	Extracted 300 files to train_nano

Examine the training data

The train folder contains three types of files for each image:

  • RGB image: JPEG 2000 file
  • AGL height image: TIF file
  • Vector flow information (scale and angle): JSON file

To more easily look through vector flow information, let's pull all of those JSONs into one dataframe first.

In [57]:
json_paths = sorted(list(TRAIN_DIR.glob("*.json")))
json_dicts = [json.load("r")) for pth in json_paths]
ids = ["_([a-zA-Z]*)_", for pth in json_paths]

train_vflow = pd.DataFrame.from_records(json_dicts, index=ids).rename(
    columns={"scale": "vflow_scale", "angle": "vflow_angle"}
vflow_scale vflow_angle
ACffgQ 0.012952 4.641966
AHJXSt 0.011553 4.020991
AHQjtV 0.008819 3.464433
AJMDPW 0.009689 3.931430
AJPPYS 0.012875 5.325513

Now that we can see vector flow info more easily, let's look into a few records from the train folder. We can use the file names in geopose_train.csv.

For exploring images, we use the Image module from the Pillow package (version 8.2.0). Pillow is useful because it is easily able to work with TIF files and JPEG 2000 files, both of which we have in our data.

In [58]:
import PIL

# double check our version of Pillow
In [59]:
def see_training_record(row, train_dir=TRAIN_DIR):

    # load images
    rgb = / row["rgb"])
    agl = / row["agl"])

    # print a few elements from the image arrays
        f"\n-----Features-----\nRGB array: {np.array(rgb)[0]}, dtype={np.array(rgb).dtype}, shape={np.array(rgb).shape}"
        f"\n-----Labels-----\nAGL array: {np.array(agl)[:2]}, dtype={np.array(agl).dtype}, shape={np.array(agl).shape}"

    # show images
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
    ax1.set_title("RGB image")
    ax2.set_title("AGL image")
In [60]:
city gsd
wZHmuM JAX 0.391
agl json rgb
wZHmuM JAX_wZHmuM_AGL.tif JAX_wZHmuM_VFLOW.json JAX_wZHmuM_RGB.j2k
RGB array: [[ 85 113  95]
 [ 74 104  78]
 [ 68 100  68]
 [237 234 224]
 [241 237 226]
 [245 241 229]], dtype=uint8, shape=(2048, 2048, 3)

AGL array: [[ 5 14 16 ...  1  0  0]
 [ 4 11 12 ...  2  0  0]], dtype=uint16, shape=(2048, 2048)
vflow_scale vflow_angle
wZHmuM 0.014955 4.947428

The RGB image is what we are given for the test set, and provides the features for the model. What we are trying to predict - our labels - is the AGL image with heights per pixel, the vector flow scale, and the vector flow angle. Together, these give us geocentric pose.

That was a very straightforward example - let’s look at another record that’s less clear.

In [61]:
city gsd
XVXAtb OMA 0.324
agl json rgb
RGB array: [[158 154 153]
 [159 154 154]
 [161 157 157]
 [ 54  50  51]
 [ 47  52  63]
 [ 40  54  70]], dtype=uint8, shape=(2048, 2048, 3)

AGL array: [[65535 65535 65535 ...   267   330   393]
 [65535 65535 65535 ...   271   335   398]], dtype=uint16, shape=(2048, 2048)
vflow_scale vflow_angle
XVXAtb 0.007886 3.989719

Well that AGL image is odd! Checking the problem description again, we see that 65535 is used as a placeholder for nan values. The missing values in AGL arrays represent locations where the LiDAR that was used to assess true height did not get any data. You don't have to worry about height predictions for these pixels - pixels that are missing in the ground truth AGLs will be excluded from performance evaluation.

Let's get a more useful visual of that AGL. First, what's the max height that's actually recorded in the image?

In [62]:
im = / "OMA_XVXAtb_AGL.tif")
k = sorted(set(np.array(im).flatten()))
[2864, 2876, 2890, 2949, 65535]

It looks like the max height in the image is 2,949 cm (96 ft high). Now we can visualize again, and use the vmax argument:

In [63]:
plt.imshow(im, vmax=3000)
<matplotlib.image.AxesImage at 0x7feb20a8ae80>

Much better! We can see the image, and we can see where there are missing values.

What are the ranges for vector flow scale and angle?

To answer this question, let's first look at how many images there are for each city across both the train set and the test set. We know that there are four cities in the dataset. Based on the metric section of the problem description, our score will be calculated independently for each city, and then those four scores will be averaged to get our final performance.

In [64]:
# add a column to the metadata indicating whether it is in the test set
metadata["test"] = metadata.index.isin(geopose_test.index)
metadata["train"] = ~metadata["test"]
False    5923
True     1025
Name: test, dtype: int64
In [65]:
metadata.groupby("city")[["test", "train"]].sum()
test train
ARG 463 2325
ATL 264 704
JAX 120 1098
OMA 178 1796
  • ARG: San Fernando, Argentina
  • ATL: Atlanta, Georgia
  • JAX: Jacksonville, Florida
  • OMA: Omaha, Nebraska

San Fernando has much more data than the other three cities in both the train and test sets.

Let's check the range of vector flow scale (vflow_scale), and see whether it varies by city.

In [66]:
# merge the metadata with geopose_train to get vector flow info combined with city
train_vflow = train_vflow.merge(
    metadata[["city", "gsd"]], how="left", left_index=True, right_index=True
In [67]:
fig, axs = plt.subplots(1, 4, sharey=True, sharex=True, figsize=(13, 4))
for city, ax in zip(sorted(, axs):
        train_vflow[train_vflow["city"] == city]["vflow_scale"],
fig.suptitle("Vector flow scale histograms", fontsize=14)
In [68]:
count mean std min 25% 50% 75% max
vflow_scale 5923.0 0.009683 0.003286 0.002119 0.008013 0.010234 0.012454 0.015345

What did we learn?

  • vflow_scale ranges from 0.0021 to 0.0153, and most of the images have a scale between 0.008 and 0.0125.
  • This means that for most images, the average pixel size is between 0.008 pixels/cm and 0.0125 pixels/cm.
  • Atlanta appears to be an exception and has lower scale values than other cities.

Per the data labels section of the problem description, scale is the pixels/cm conversion factor between vector field magnitudes in the 2D plane of the image and object height in the real world:

$$ \frac{\textrm{vector flow magnitude}}{\textrm{scale}} = \textrm{pixels}\times \frac{\textrm{cm}}{\textrm{pixels}} = \textrm{real world height (cm)} $$

The geocentric pose vectors are all divided by numbers less than 0.0153 to get real-world height in cm. As expected, this means all of the images were taken from fairly far away and need significant magnification to match real-world height.

Now let's check vector flow angle (vflow_angle)

In [69]:
fig, axs = plt.subplots(1, 4, sharey=True, sharex=True, figsize=(13, 4))
for city, ax in zip(sorted(, axs):
        train_vflow[train_vflow["city"] == city]["vflow_angle"],
fig.suptitle("Vector flow angle histograms", fontsize=14)
In [70]:
count mean std min 25% 50% 75% max
vflow_angle 5923.0 3.437382 1.794936 0.004981 1.849203 3.743574 4.984773 6.282265

vflow_angle ranges from 0.0049 to 6.282. This matches our expectation - since angle is in radians, it can only have a value between 0 and 2$\pi$ = 6.283.

Vector flow angle example

Atlanta has a much smaller range of vector flow scale and angle than any of the other three cities. Atlanta also has fewer images than the other cities, so overall there is a smaller range of satellite paths and possible camera angle in the data. By contrast, vector flow angle for San Fernando covers the full range of possibilities, indicating that there are many satellite paths and possible camera angles in the data.

What is the height range?

What heights are captured in the images and how does it vary by city? To get height, we can open individual AGL images and examine the image array.

In [71]:
city_pixels = dict()
for city in sorted(
    pixel_values = []
    # take a random sample of each city's images for efficiency
    city_images = geopose_train.merge(
        metadata[metadata["city"] == city],
    city_sub = city_images.sample(n=250, replace=False, random_state=50)
    for idx, row in tqdm(city_sub.iterrows(), total=len(city_sub)):
        img = / row["agl"])
        # replace 65535 placeholder with np.nan
        imarray = np.array(img).astype("float32")
        np.putmask(imarray, imarray == 65535, np.nan)
        pixel_values = np.concatenate((pixel_values, imarray), axis=None)
    # filter out nan values
    len_withna = len(pixel_values)
    pixel_values = pixel_values[~np.isnan(pixel_values)]
    print(f"{city}: Dropped {len_withna - len(pixel_values)} pixels with NA values")
    city_pixels[city] = pixel_values
100%|██████████| 250/250 [07:51<00:00,  1.89s/it]
  1%|          | 2/250 [00:00<00:21, 11.53it/s]
ARG: Dropped 219853265 pixels with NA values
100%|██████████| 250/250 [07:49<00:00,  1.88s/it]
  1%|          | 2/250 [00:00<00:15, 15.94it/s]
ATL: Dropped 2466139 pixels with NA values
100%|██████████| 250/250 [07:43<00:00,  1.85s/it]
  1%|          | 2/250 [00:00<00:16, 15.00it/s]
JAX: Dropped 41106031 pixels with NA values
100%|██████████| 250/250 [07:49<00:00,  1.88s/it]
OMA: Dropped 16082855 pixels with NA values
In [77]:
# generate statistics for each city and compile into a dataframe
city_descs = pd.DataFrame(index=["count", "mean", "min", "max", "25%", "50%", "75%"])
for city in sorted(
    pixel_values = city_pixels[city]
    city_desc = [
    city_desc += list(np.percentile(pixel_values, [25, 50, 75]))
    city_descs[city] = city_desc
count mean min max 25% 50% 75%
ARG 8.287227e+08 384.0 0.0 10000.0 15.0 284.0 616.0
ATL 1.046110e+09 845.0 0.0 20000.0 9.0 377.0 1313.0
JAX 1.007470e+09 433.0 0.0 19993.0 0.0 16.0 675.0
OMA 1.032493e+09 214.0 0.0 19276.0 0.0 1.0 270.0
In [74]:
# plot histogram of pixel value by city
bins = list(range(0, 20040, 40))
city_hists = pd.DataFrame(index=bins[:-1])
for city in sorted(
    hist_vals, bin_edges = np.histogram(city_pixels[city], bins=bins)
    city_hists[city] = hist_vals

city_hists.plot(logy=True, figsize=(8, 4))
plt.xlabel("Pixel value (cm)")
plt.ylabel("Log10 of frequency")
plt.title("Pixel value histogram by city")

Based on this sample of 250 images from each city, low pixel height values are fairly common - this is particularly true for Jacksonville and Omaha. In both of those cities at least 25% of pixels have a height of 0 cm. In Omaha half of pixels have a height less than 1 cm (0.4 inches), and Jacksonville is not much taller. That's pretty short! The vast majority of what is captured in the satellite images is relatively close to ground level.

Atlanta is the tallest city, but the majority of pixels are still not large enough to represent tall buildings. 50% of the pixels in Atlanta have a height less than 377 cm (12 ft) and 75% have a height less than 1,313 cm (43 ft).

This means very tall objects are pretty rare in the data, with Atlanta being somewhat of an exception.

Enough playing around. It's model time!

About the benchmark model

The benchmark code is a neural network built using the PyTorch library, created by the Johns Hopkins University Applied Physics Laboratory (JHU/APL). They parameterize the model in a multi-task deep network with a ResNet-34 encoder and U-Net decoder. The model estimates geocentric pose for image $I$ with:

$$ g(I) = \{s, \theta, h\} $$$$ s = \textrm{scale factor (pixels/cm), }\theta = \textrm{angle of parallel projection (radians), }h = \textrm{objects heights (cm)} $$

The model uses the Adam optimizer with a learning rate of $1^{-4}$, which JHU/APL found to improve convergence over the default $1^{-3}$. We will train for 200 epochs.

Affine projection properties.
The benchmark model explicitly encodes the simple affine relationship between object heights and vector field magnitudes, $s=m/h$, with a custom least squares solver layer. Illustration is adapted from Christie et al., 2021.

The benchmark code relies on a number of different functions, specified in a handful of scripts. We will not go over every step of the model fitting process since it requires so many lines of code. Instead, we have shared the full benchmark code in a GitHub repository, and will review some key points here. Participants can clone the repo to access and run the full benchmark code. Our overview here is drawn from the thoughtful example write-up of the benchmark solution by JHU/APL.

Taking advantage of affine projection

The benchmark regression model is designed to exploit the known affine relationship between object heights and their corresponding vector field magnitudes, which map surface-level features to ground level in an image. An affine transformation is a method of mapping variables in an image (ie. pixel values) to new variables by applying a linear combination of translation, rotation, and scaling. Affine transformation preserves the property of parallelism.

Affine projection properties.
The geometric projection of a local sub-image extracted from a large satellite image is well-approximated with an affine camera, which preserves invariant properties of parallelism and ratio of lengths on parallel lines. For each image in our data set we provide object heights at the pixel level, angles at the image level, and scale factors at the image level. Together, these define the vector field that maps surface features to ground level. Illustration is adapted from Christie et al., 2021.

$\theta$ and $s$ are defined by the known relationship between vector field magnitudes $m$ (pixels) and objects heights $h$ (cm) from an affine camera, $s = m/h$.


The total loss $L$ that we'll minimize during training is a weighted sum of mean squared error (MSE) losses for all terms: $$ L=f_\theta L_\theta +f_s L_s + f_h L_h + f_m L_m $$

$\theta$ = vector flow angle (radians), $s$ = scale (pixels/cm), $h$ = real-world object height (cm), $m$ = vector flow magnitude (pixels)

We set weighting factors $f_\theta=10$, $f_s=10$, $f_h=1$, and $f_m=2$ to normalize value ranges: $$ L=10 L_\theta + 10L_s + L_h + 2L_m $$

Note that for the loss equation above, scale is in pixels/meter and height is in meters.

For height and magnitude, MSE is calculated as the mean of MSE values for each labeled image in a batch to both reduce sensitivity to unlabeled pixels and allow for training images without height labels to directly supervise height and magnitude. In the latter case, height and magnitude MSE losses are zero and only the scale loss is back-propagated through those layers.

Augmentation to address bias

The distributions for angle of parallel projection $\theta$, scale factor $s$ relating height and magnitude, and object height $h$ are all heavily biased. Angle and scale are biased by the limited viewing geometries from satellite orbits, and very tall objects are rare. We'll encourage generalization and address bias with image remap augmentations $w_\theta$, $w_s$, and $w_h$. The function $w_\theta$ generates a new angle based on the original angle $\theta$ and some probability of an image's angle being augmented. Similarly, $w_s$ is a function of scale and probability, and $w_h$ is function of height and probability.

Image augmentation for rotation ($w_\theta$) and scale ($w_s$) are commonly applied to regularize training in deep networks, and depend only on image coordinates. This height augmentation ($w_h$) synthetically increases building heights. $w_h$ more specifically addresses the model's ability to accurately predict height for very tall objects, even though they are rare in the training data set.

Affine projection properties.
Left to right: (a) RGB (b) Augmented RGB (c) Vector fields (d) Augmented vectors
The benchmark model remaps above-ground features from the original RGB image (a) and geocentric pose vector fields (c) to an augmented RGB and vector field (b and d) with 2.3x height. Illustration is adapted from Christie et al., 2021.

The following high-level code from utilities/ applies height and rotation augmentation:

def augment_vflow(image, mag, xdir, ydir, angle, scale, agl=None, rotate_prob=0.3, flip_prob=0.3, scale_prob=0.3, agl_prob=0.3):
    # increase heights
    if np.isnan(mag).any() or np.isnan(agl).any():
        agl_prob = 0
    if random.uniform(0, 1) < agl_prob:
        max_agl = np.nanmax(agl)
        max_building_agl = 200.0
        max_factor = 2.0
        max_scale_agl = min(max_factor, (max_building_agl / max_agl))
        scale_height = random.uniform(1.0, max(1.0, max_scale_agl))
        image, mag, agl = warp_agl(image, mag, angle, agl, scale_height, max_factor)

    # rotate
    if random.uniform(0,1) < rotate_prob:
        rotate_angle = random.randint(0,359)
        xdir,ydir = rotate_xydir(xdir, ydir, rotate_angle)
        image,mag,agl = rotate_image(image, mag, agl, rotate_angle)


warp_angle is also defined in utilities/

Note that shadows are not adjusted by this simple but effective augmentation. Relying too much on shadows in model training can be risky because they are only sometimes observed. Maybe YOU could be the one to figure a better way to take advantage of hints from the shadows!


We're going to use a deep neural network to predict geocentric pose. A couple of our key objectives are:

  • Take advantage of what we know about affine projection to define the relationship between objects heights and vector magnitudes, $\textrm{scale} = \frac{\textrm{object height}}{\textrm{vector magnitude}}$
  • Use image augmentation to address some of the bias in the data. In particular, by synthetically increasing building heights to correct for how rare tall objects are in the data.
  • Define loss - what we are going to minimize - as a weighted sum of MSE losses for vector flow angle ($L_\theta$), vector flow scale ($L_s$), object height ($L_h$), and vector flow magnitude ($L_m$)
  • Tools: PyTorch library, ResNet-34 encoder, and U-Net decoder

Executing the benchmark model

1. Clone the repo

In the command line, enter git clone

Let's take a look at the files in the benchmark repo:

├── utilities
│   ├──
│   ├──
│   ├──
│   ├──
│   ├── invert_flow.pyx
│   ├──
│   ├──
│   └──
  • Defines the command line function that runs training and testing. Once arguments are passed to the command line, calls functions in other scripts based on those arguments.

In utilities:

  • implements affine transformation and augmentation - $w_\theta$, $w_s$, and $w_h$ as discussed above
  • down-samples images from a directory and creates a directory at half resolution. This reduces memory requirements and enables model training to use larger batch sizes. When the benchmark was trained on half-resolution data, there was no reduction in prediction accuracy
  • gets the performance metrics (R-squared) for each city and returns the average across cities. This code is what will be used to judge participant submissions. This script can also be called to write rectified height images.
  • miscellaneous functions that are used across scripts - eg. saving images, loading images, and calculating Root Mean Squared Error (RMSE).
  • implements training and testing using PyTorch
  • defines the Unet model class used in, including encoders and key methods for training like forward and predict
  • invert_flow.pyx: code for inverting the flow written in python-type syntax, but able to be parsed by Cython.
  • compiles the invert_flow function

Part of the height augmentation is inverting the vector flow, which is done through invert_flow.pyx and The code for inverting flow in invert_flow.pyx loops over every row and column of each image. This would run VERY slowly in python, so is implemented using Cython to improve speed. Cython is a compiler that allows you to write code in python that will be compiled and executed in C, but is still able to communicate with the rest of the code written in python. Running code in C rather than python significantly increases computational efficiency.

2. Set up environment

If you did not run the conda install command from Getting set up, you should now run:

conda install gdal cython opencv gdal tqdm scikit-image pytorch torchvision cudatoolkit -c pytorch
pip install segmentation-models-pytorch

In addition to the dependencies listed in setup, here we will build the invert_flow Cython function by executing (in the command line):

cd utilities
python build_ext --inplace

3. Down-sample training data

The benchmark is trained with images down-sampled by a factor of two (i.e. at half resolution). It is tested with full-resolution images. We found that training on half-resolution images did not decrease prediction accuracy, but did save time.

Tip: You can execute shell commands in a Jupyter notebook by adding ! before the code, eg. !cd utilities

In [78]:
train_half_res = DATA_PATH / "train-half-resolution"
In [83]:
!python monocular-geocentric-pose/utilities/ \
--indir={str(TRAIN_DIR)} \
--outdir={str(train_half_res)} \
--unit="cm" \
100%|█████████████████████████████████████████| 100/100 [00:47<00:00,  2.11it/s]

4. Train

We will train for 200 epochs. Command line arguments are explained in

Be aware that training the model can take a long time, up to multiple days. With 200 epochs, it took our benchmark a little less than three days to run using a Tesla T4 GPU (about 20 minutes per epoch). Before training on the full data set with your final hyperparameters, we recommend testing your pipeline by training with a small number of epochs and/or on a small "nanoset" of the data, like the one provided on the data download page.

To make sure we don't lose any work if we are disconnected from our remote ssh session, we used tmux to launch our Jupyter notebook and run the model training. In our notebook, we will set up the training cell to print to a .txt file rather than in the notebook.

In [84]:
checkpoints = DATA_PATH / "benchmark_checkpoints"
In [85]:
# # print to txt file instead of console
orig_stdout = sys.stdout
f = open(DATA_PATH / "training.txt", "w")
sys.stdout = f

# the downsampling process converts AGLs from cm to m, so in training we specify units as m
# downsampling also saves out RGBs with the suffix .tif instead of .j2k
!python monocular-geocentric-pose/ \
--train \
--num-epochs=200 \
--checkpoint-dir={str(checkpoints)} \
--dataset-dir={str(train_half_res)} \
--batch-size=4 \
--gpus="0" \
--augmentation \
--num-workers=4 \
--save-best \
--train-sub-dir="" \
--rgb-suffix="tif" \

# # reset to original std out
sys.stdout = orig_stdout

To see the model's progress:

!tail -2 {str(DATA_PATH / "training.txt")}
Epoch: 199
train: 100%|█| 1185/1185 [19:24<00:00,  1.02it/s]

The benchmark code saves Pytorch model checkpoints throughout training in the directory specified by the --checkpoint-dir command line argument. Even though model training has the potential to get interrupted, you should always be able to pick back up where you left off without having to restart training from scratch.

The benchmark does not set aside part of the training data for validation. We recommend that you add a validation step.

Generating a Submission

Now that our model is trained, we are finally ready to perform inference and make a submission. You'll only want to perform inference on the test set once you determine your top performing model, to avoid inadvertently overfitting.

First, let's look at the submission format

Submissions have to match the submission format from the problem description. In the data provided, there is a submission_format folder that demonstrates the correct submission format with placeholder values. You can check your submission against the format folder to make sure it is in the correct format and has the right files.

In [87]:
submission_format_dir = DATA_PATH / "submission_format"
submission_files = sorted(
    [ for filepath in submission_format_dir.glob("*")]

For each image in the test_rgbs folder, the submission format contains two files:

  1. A 2048 x 2048 .tif file with height predictions. The name of the AGL file should be [city_abbreviation]_[image_id]_AGL.tif. AGLs should show height in centimeters and have data type uint16.
  2. A JSON file with vector flow information. The name of the JSON file should be [city_abbreviation]_[image_id]_VFLOW.json. Scale is in pixels/cm. Angle is in radians, starting at 0 from the negative y axis and increasing counterclockwise.

For more details and an explanation of vector flow scale and angle, see the section on data labels in the problem description.

Run inference on the test data

Since the model was trained with 2x down-sampled images, we set downsample=2. This down-samples the test images before making predictions, so that they better match the way the model was trained. Predictions are then up-sampled before writing output files.

The dataset specified consists of test set RGBs, downloaded from the competition data page.

In [91]:
TEST_DIR = DATA_PATH / "test_rgbs"
preds_dir = DATA_PATH / "submission"
In [92]:
!python monocular-geocentric-pose/ \
--test \
--model-path={str(checkpoints / "model_best.pth")} \
--predictions-dir={str(preds_dir)} \
--dataset-dir={str(TEST_DIR)} \
--batch-size=8 \
--gpus=0 \
--downsample=2 \
--test-sub-dir="" \
100%|█████████████████████████████████████████| 129/129 [05:50<00:00,  2.71s/it]
100%|███████████████████████████████████████| 1025/1025 [04:42<00:00,  3.63it/s]

submission now has everything we need to make a submission!

Check against the submission format

To make sure we have all of the right files, we can check that every file in the submission format appears in our predictions folder.

In [93]:
# get a list of file we should have and a list of files we do have
my_preds_files = [ for pth in preds_dir.iterdir()]
submission_format_files = [ for pth in submission_format_dir.iterdir()]

# the following set will be empty if both dirs have exactly the same filenames in them
assert not set(my_preds_files).symmetric_difference(submission_format_files)

Success, we have all of the right files! However, our predictions folder takes up a lot of space - roughly 8.6 GB.

Compress TIF files

Our predictions were generated as float values in meters. To make the size of participant submissions manageable, we'll convert pixel values to centimeters and store as data type uint16. Submissions are expected to be in centimeters and pixel heights in any other unit will not be scored correctly. Submission AGL images should also be saved using a lossless TIFF compression. In the benchmark, we compress each AGL TIFF by passing tiff_adobe_deflate as the compression argument to the function from the Pillow library.

In our command to run inference above, we set --convert-predictions-to-cm-and-compress=False in order to walk through these steps explicitly. Setting that argument to True accomplishes the same thing by calling this function.

In [94]:
def compress_submission_folder(
    Compress a folder of tifs. Returns output directory

        input_dir (pathlib.PosixPath): folder of raw TIF files
        output_dir (pathlib.PosixPath): folder to save compressed TIFs
        compression_type (str, optional): `compression` argument for in Pillow. Defaults to "tiff_adobe_deflate".
        folder_search (str, optional): string to filter files in the input directory to TIFs for compression.
            Defaults to "*_AGL*.tif*"
        replace (bool, optional): whether to overwrite existing files. Defaults to True
        add_jsons (bool, optional): whether to copy over JSON files from input_dir to output_dir
        conversion_factor (int, optional): conversion factor to multiple AGL heights by. Defaults to 
            100 for converting m to cm.
        dtype (str, optional): data type in which to save AGL heights. Defaults to "uint16".
    if not output_dir.exists():

    # AGLS - convert from m to cm and save with compression
    tifs = list(input_dir.glob(folder_search))
    for tif_path in tqdm(tifs, total=len(tifs)):
        path = output_dir /
        if replace or not path.exists():
            imarray = np.array(
            imarray = np.round(imarray * conversion_factor).astype(dtype)
            im = Image.fromarray(imarray)
  , "TIFF", compression=compression_type)

    # JSONS - convert from pixels/m to pixels/cm and save
    if add_jsons:
        for json_path in input_dir.glob("*.json"):
            if replace or not (output_dir /
                vflow = json.load('r'))
                vflow['scale'] = vflow['scale'] / conversion_factor
                new_json_path = output_dir /

    return output_dir
In [95]:
# compress our submission folder
submission_compressed_dir = DATA_PATH / "submission_compressed"
compress_submission_folder(preds_dir, submission_compressed_dir)
100%|██████████| 1025/1025 [04:56<00:00,  3.46it/s]

Our new, shiny, compressed submission folder only takes up roughly 1.7 GB! Much better.

Save into a TAR archive

The required submission format for the platform is a TAR file (.tar.gz) - other formats will be rejected and not scored. We use the tarfile library to create a TAR and then add all of our submission files.

In [96]:
submission_tar_path = DATA_PATH / "submission_compressed.tar.gz"
with, "w:gz") as tar:
    # iterate over files and add each to the TAR
    files = list(submission_compressed_dir.glob("*"))
    for file in tqdm(files, total=len(files)):
100%|██████████| 2050/2050 [00:34<00:00, 59.01it/s]

Your final submission .tar.gz should be around 1.6 GB. Submissions that are very large will be rejected by the platform.

You're ready to submit!

Head over to the competition submissions page and click "Make new submission". Upload submission_compressed.tar.gz.

Congrats! We got a final R-squared of 0.7996. Not bad, but definitely room for improvement. Some of the main instances when the benchmark does not perform well are buildings that:

  • Have an unusual appearance or shape
  • Are observed from very oblique viewpoints, which can make the building appear unusual or oddly shaped
  • Have no hints observed in the image from shadows

In particular, any taller buildings that fall into these categories are difficult to predict.

A couple of ideas to explore for improving the benchmark:

  • Architectural improvements to the workhorse U-Net decode with ResNet34 backbone
  • Ensemble methods
  • Additional creative augmentations to improve image generalization
  • Optimizations to improve training efficiency, since the images are so large
  • Innovative loss functions. For example, noisy height predictions can be inconsistent with input image gradients along roof boundaries, and we did not experiment with additional priors in the loss function. It might also be interesting to experiment with surface normal loss, as that has been shown to improve accuracy in depth prediction tasks.

This competition has a bonus model write-up track. The example writeup of the benchmark solution is a good place to look for more details on suggested improvements (question 7), as well as general model explanation.

Check out the Overhead Geopose Competition challenge homepage to get started. We can't wait to see what you build!

Approved for public release, 21-553.

In [ ]: