How to Estimate Surface-level NO2 using OMI Column NO2 Data

by Tammy Glazer

How to estimate surface-level NO2 using OMI column NO2 data

Welcome to the getting started notebook for the trace gas track of the NASA Airathon: Predict Air Quality challenge! The goal of this tutorial is to provide some helpful pointers and starter code to ingest, preprocess, explore, and work with one of the atmospheric data products derived from the Ozone Monitoring Instrument (OMI) onboard the Aura satellite. The specific scientific parameter covered by this tutorial is Tropospheric Column Nitrogen Dioxide (NO2), made available in Hierarchical Data Format (HDF). If OMI, NO2, or HDF are new concepts to you, then read on!

Before getting started, check out the Airathon competition homepage to learn more about the goal of this important challenge. In this tutorial, we'll:

Let's get started!

The Data

What is OMI?

Aura is a multi-national NASA scientific research satellite that orbits the Earth to study Earth's ozone layer, air quality, and climate. Launched in 2004, it passes from south to north over the equator in the afternoon and is a member of a constellation of satellites with different science objectives that fly in close proximity known as the "A-Train". Their orbits are sun-synchronous, meaning that they pass the equator at the same local solar time on every orbit.

A sun-synchronous orbit allows consistent scientific observations with the angle between the Sun and the Earth's surface remaining relatively constant. Credit: NASA Earth Observatory

OMI is the name of a remote sensing instrument administered by NASA aboard the Aura satellite. This down-looking instrument measures backscattered sunlight in the visible and ultraviolet wavelengths and captures most of the Earth's surface each day. It tracks pollutants that threaten human health and agricultural productivity, and has ability to distinguish between aerosol types, such as smoke, dust, and sulfates. OMI can also measure cloud pressure and coverage, which are used to derive ozone. Sophisticated retrieval algorithms are used to convert measured radiation to pollution concentrations.

This exercise will focus on the OMNO2d Version 3 gridded 0.25 degree x 0.25 degree Level 3 product, which is pre-approved for the competition. Level 3 products represent values that have been filtered, pre-processed, and aggregated across different dates or footprints. 0.25 degree x 0.25 degree represents the resolution.

What is Tropospheric Column NO2?

OMNO2d reports Total and Tropospheric Column NO2. Nitrogen dioxide (NO2) is a harmful pollutant produced during combustion. According to NASA, it has been designated as a criteria pollutant by the U.S. Environmental Protection Agency (EPA) because of its negative effects on human health and the environment. Vertical column density (VCD) refers to the number of molecules between a remote sensing instrument and the Earth's surface. For NO2, the amount high in the atmosphere (in the stratosphere) is often removed to derive the column in the lower part of the atmosphere (in the troposphere), known as Tropospheric Vertical Column Density (TVCD). VCD and TVCD are typically measured in molecules of NO2 per square centimeter, and may be divided by 1015 for readability. These column densities have been found to be helpful for estimating surface-level NO2 levels.

The troposhere is the first major layer in the Earth's atmosphere, extending up to 10 km above the surface of the Earth. Credit: NOAA

While the OMNO2d data product contains a variety of data fields, we will focus on ColumnAmountNO2Trop. You are encouraged to explore and leverage all of the data fields made available by this data product using the OMI Data User's Guide.

Data loading

What's in our metadata?

First, let's organize our local repository using Cookiecutter Datascience and download the grid_metadata.csv, no2_satellite_metadata.csv, submission_format.csv, and train_labels.csv files from the data download page to our data/raw folder.

$ tree data/
├── external
├── interim
├── processed
└── raw
    ├── grid_metadata.csv
    ├── no2_satellite_metadata.csv
    ├── submission_format.csv
    └── train_labels.csv

4 directories, 4 files

Next, let's import some handy tools for working with geospatial data and load the metadata to see what we have.

In [ ]:
from pathlib import Path
import random
from typing import Dict, List, Optional
import warnings

from cloudpathlib import S3Path
import geopandas as gpd
import pandas as pd
import rasterio

DATA_PATH = Path.cwd().parent / "data"
RAW = DATA_PATH / "raw"
INTERIM = DATA_PATH / "interim"

warnings.simplefilter(action='ignore', category=UserWarning)
In [2]:
no2_md = pd.read_csv(
    RAW / "no2_satellite_metadata.csv",
    parse_dates=["time_start", "time_end"],

grid_md = pd.read_csv(
    RAW / "grid_metadata.csv",

Satellite metadata

Our trace gas track satellite metadata file contains information about OMI and TROPOMI data made available for model training and testing. Let's create and inspect a subset of the OMI training data.

In [3]:
Index(['time_start', 'time_end', 'product', 'location', 'split', 'us_url',
       'eu_url', 'as_url', 'cksum', 'granule_size'],
In [4]:
train    3156
test     1857
Name: split, dtype: int64
In [5]:
tropomi_hires    2750
tropomi          1193
omi              1070
Name: product, dtype: int64
In [6]:
omi_md = no2_md[(no2_md["product"] == "omi") & (no2_md["split"] == "train")].copy()
(671, 10)
In [7]:
time_start time_end product location split us_url eu_url as_url cksum granule_size
20190101T235959_omi_dl-la-tpe_0.he5 2019-01-01 00:00:00+00:00 2019-01-01 23:59:59+00:00 omi dl-la-tpe train s3://drivendata-competition-airathon-public-us... s3://drivendata-competition-airathon-public-eu... s3://drivendata-competition-airathon-public-as... 2423440162 11127998
20190102T235959_omi_dl-la-tpe_0.he5 2019-01-02 00:00:00+00:00 2019-01-02 23:59:59+00:00 omi dl-la-tpe train s3://drivendata-competition-airathon-public-us... s3://drivendata-competition-airathon-public-eu... s3://drivendata-competition-airathon-public-as... 3526775882 11169726
20190103T235959_omi_dl-la-tpe_0.he5 2019-01-03 00:00:00+00:00 2019-01-03 23:59:59+00:00 omi dl-la-tpe train s3://drivendata-competition-airathon-public-us... s3://drivendata-competition-airathon-public-eu... s3://drivendata-competition-airathon-public-as... 388814420 10937679

As OMI sweeps along an orbit, it collects data that is sectioned into pieces called granules. Each OMI granule contains Total Column NO2 and Total Tropospheric Column NO2 for all atmospheric conditions and for sky conditions where cloud fraction is less than 30 percent across each pixel. A single granule has a size of 720 x 1440 pixels. Each granule is identified by a unique granule_id with the following naming convention: {time_end}_{product}_{location}_{number}.hdf, where time_end is formatted as YYYYMMDDHHmmss. When locations and times have more than one associated file, this is denoted by number. For additional information about each of these fields, check out the NO2 data download instructions.

We can see that there are 671 OMI files to use for training. Each file covers the Los Angeles South Coast Air Basin, Delhi, and Taipei, denoted by dl-la-tpe.

Let's take a look at the time distribution in our training data.

In [8]:
2019    365
2020    306
Name: time_end, dtype: int64
In [9]:
1     62
2     57
3     62
4     60
5     62
6     60
7     62
8     62
9     60
10    62
11    31
12    31
Name: time_end, dtype: int64

We have a roughly balanced distribution of files across years (2019-2020) and months, with the fewest granules from the months of November and December.

Note that a single granule contains data that was captured within a 24 hour time window. For each location, OMI makes its observations mid-day.

Grid metadata

Next, let's take a look our grid cell metadata.

In [10]:
location tz wkt
1X116 Taipei Asia/Taipei POLYGON ((121.5257644471362 24.97766123020391,...
1Z2W7 Delhi Asia/Calcutta POLYGON ((77.30453178416276 28.54664454217707,...
3A3IE Los Angeles (SoCAB) Etc/GMT+8 POLYGON ((-117.8889090614936 34.13085940091023...
In [11]:
(68, 3)
In [12]:
Delhi                  33
Los Angeles (SoCAB)    28
Taipei                  7
Name: location, dtype: int64

Each grid cell has a 5-character alphanumeric grid_id that uniquely identifies a 5 x 5 km grid cell and is associated with an observation in our train or test set. This file also contains important location and timezone (tz) information that can be used to localize dates, along with the polygonal coordinates of the grid cell. There is metadata available for 68 grid cells.

How do I load a OMI HDF5 file?

Our OMI data is stored in a HDF5 file format. At its lowest level, Hierarchical Data Format (HDF) is a set of file formats (HDF4 and HDF5) designed to store and organize large amounts of scientific data. HDF5 is the newer version of the format and is actively supported by a non-profit corporation called The HDF Group. This file format supports different data models, including multidimensional arrays, raster images, and tables. A single HDF file can hold a mix of related objects that can be accessed as a group or as individual objects.

In Python, we can read in HDF5 files using one of three libraries:

  1. GDAL
  2. PyTables
  3. h5py

For this example, we'll be using h5py and code adapted from a script on the HDFEOS zoo.

The File module of the h5py package supports scientific datasets. Let's import this module along with some helpful tools to get started.

In [13]:
import os
import re

from h5py import File
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pyproj

mpl.rcParams['figure.dpi'] = 100

The OMI metadata file contains the locations of each OMI HDF5 file on s3. We can use the S3Path class from the cloudpathlib package to generate a path to a sample data file.

In [14]:
filepath = S3Path(omi_md.iloc[0].us_url)

To read this file, we can instantiate a File object using a string representation of our path. Note that calling fspath downloads the file locally to a temporary directory.

In [15]:
hdf = File(filepath.fspath, mode="r")

# List available datasets
for k, v in hdf.items():
    print(f"{k}: {v}")
HDFEOS: <HDF5 group "/HDFEOS" (2 members)>

Groups are the container mechanism by which HDF5 files are organized. They operate somewhat like dictionaries, where "keys" (Group objects) denote names of group members and "values" (Dataset objects) are the members themselves. The NO2 data we are interested in is contained in the HDFEOS group within the hdf object.

The File.visititems method allows us to recursively visit all items in a group. Let's use this method to explore the structure of our file.

In [16]:
def print_structure(name, obj):

HDFEOS/GRIDS/ColumnAmountNO2/Data Fields
HDFEOS/GRIDS/ColumnAmountNO2/Data Fields/ColumnAmountNO2
HDFEOS/GRIDS/ColumnAmountNO2/Data Fields/ColumnAmountNO2CloudScreened
HDFEOS/GRIDS/ColumnAmountNO2/Data Fields/ColumnAmountNO2Trop
HDFEOS/GRIDS/ColumnAmountNO2/Data Fields/ColumnAmountNO2TropCloudScreened
HDFEOS/GRIDS/ColumnAmountNO2/Data Fields/Weight

Our hdf object contains four main variables listed under the Data Fields subtree:

  • ColumnAmountNO2: total NO2 column density
  • ColumnAmountNO2CloudScreened: total NO2 column density where cloud fraction is less than 30%
  • ColumnAmountNO2Trop: column density of NO2 in the tropospheric layer
  • ColumnAmountNO2TropCloudScreened: column density of NO2 in the tropospheric layer where cloud fraction is less than 30%

For more information about each of these variables, see this helpful Fact Sheet. For this tutorial, we will focus on the ColumnAmountNO2Trop variable. To access the ColumnAmountNO2Trop raw data, we can use [:] to create a copy of the dataset in memory.

In [17]:
trop_NO2 = hdf["HDFEOS"]["GRIDS"]["ColumnAmountNO2"]["Data Fields"]["ColumnAmountNO2Trop"]
raw_NO2 = trop_NO2[:]
array([[ 4.0071399e+13,  3.9828905e+13,  3.9601280e+13, ...,
         4.0904572e+13,  4.0607393e+13,  4.0330330e+13],
       [ 3.3805671e+13,  3.3609338e+13,  3.3471198e+13, ...,
         3.4300810e+13,  3.4166213e+13,  3.3991558e+13],
       [ 3.1301571e+13,  3.1778666e+13,  3.1665884e+13, ...,
         2.3723984e+13,  2.3852875e+13,  2.9897645e+13],
       [-1.2676506e+30, -1.2676506e+30, -1.2676506e+30, ...,
        -1.2676506e+30, -1.2676506e+30, -1.2676506e+30],
       [-1.2676506e+30, -1.2676506e+30, -1.2676506e+30, ...,
        -1.2676506e+30, -1.2676506e+30, -1.2676506e+30],
       [-1.2676506e+30, -1.2676506e+30, -1.2676506e+30, ...,
        -1.2676506e+30, -1.2676506e+30, -1.2676506e+30]], dtype=float32)
In [18]:
shape = raw_NO2.shape
(720, 1440)
In [19]:
# Plot raw value heatmap
for_plot = raw_NO2.copy() 

f, ax = plt.subplots(1, 1, figsize=(8, 3))
img = plt.imshow(for_plot, origin="lower", cmap="GnBu")
plt.suptitle("NO2 TVCD Raw Value Heatmap", fontsize=12);

According to our raw values, a single swath of our Tropospheric Column NO2 data contains 720 x 1440 pixels that represent averaged values across approximately 14 orbits in a single day. Given that our NO2 data covers the entire globe and our data has a resolution of 0.25 x 0.25 degrees, this shape is to be expected:

  • Given latitude bounds of [-90, 90], 180 / 0.25 = 720 pixels
  • Given longitude bounds of [-180, 180], 360 / 0.25 = 1440 pixels

Data processing

Now that we have our raw data, let's calibrate and map NO2 values to the competition grid cells.

How do I calibrate NO2 data?

To work with this data, we'll need to calibrate raw values by applying a set of accompanying attributes:

  1. remove missing data
  2. remove any remaining negative values
  3. remove the offset and multiply by the scale factor (if applicable)
  4. cast the data to a numpy masked array to handle missing entries

We can save these dataset attributes to a calibration dictionary using attrs on the trop_NO2 object.

In [20]:
# List available attributes
for k, v in trop_NO2.attrs.items():
    print(f"{k}: {v}")
Description: b'Field=ColumnAmountNO2Trop,StdField=ColumnAmountNO2TropStd,SolarZenithAngle=[0:85],VcdQualityFlags=~19,XTrackQualityFlags=~[1:254],TerrainReflectivity=[0:1000]'
MissingValue: [-1.2676506e+30]
Offset: [0.]
ScaleFactor: [1.]
Title: b'NO2 tropospheric column density'
Units: b'molec/cm2'
_FillValue: [-1.2676506e+30]
In [21]:
# Save attributes to a calibration dictionary
calibration_dict = {
    "scale_factor": trop_NO2.attrs["ScaleFactor"][0],
    "offset": trop_NO2.attrs["Offset"][0],
    "missing_data_value": trop_NO2.attrs["_FillValue"][0]
{'scale_factor': 1.0, 'offset': 0.0, 'missing_data_value': -1.2676506e+30}

Scale factors and offsets are simply used to optimize data storage. In this specific case, scaling by a factor of one and adjusting by an offset of zero won't result in any changes to the values.

Missing data denotes low quality readings. This can happen for a number of reasons, including cloud cover. We'll convert missing data to nans.

In [22]:
def calibrate_data(data: File, calibration_dict: Dict):
    """Given an OMI dataset and calibration parameters, return a masked
    array of calibrated data.
        data (File): dataset in the File format (e.g. column NO2).
        calibration_dict (Dict): dictionary containing, at a minimum,
            `scale_factor` (float), `offset` (float),
            `missing_data_value` (float).
        corrected_NO2 ( masked array of calibrated data
            with a fill value of nan.
    # Convert missing data to nan
    data[data == calibration_dict["missing_data_value"]] = np.nan
    # Remove negative values
    data[data < 0] = np.nan
    # Calibrate using offset and scale factor
    data = (
        (data - calibration_dict["offset"]) *
    ) / (10 ** 15) # Divide by 10^15 for readability
    # Convert to masked array
    data =, data)
    data.fill_value = np.nan

    return data
In [23]:
corrected_NO2 = calibrate_data(raw_NO2, calibration_dict)
  data=[[0.040071398948864, 0.039828905263104, 0.039601280385024, ...,
         0.04090457227264, 0.040607393251328, 0.040330330112],
        [0.0338056708096, 0.033609337536512, 0.033471198134272, ...,
         0.034300810493952, 0.03416621318144, 0.033991558168576],
        [0.031301570658304, 0.031778666446848, 0.03166588370944, ...,
         0.023723983765504, 0.023852874727424, 0.029897644834816],
        [--, --, --, ..., --, --, --],
        [--, --, --, ..., --, --, --],
        [--, --, --, ..., --, --, --]],
  mask=[[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]],
In [24]:
pd.DataFrame(, columns=['NO2']).describe()
count 507458.000000
mean 0.662362
std 1.918917
min 0.000002
25% 0.092257
50% 0.254832
75% 0.540912
max 70.631948

How do I align NO2 data with coordinates?

We can now align each corrected NO2 pixel with grid coordinates. From there, we will be able to determine which NO2 readings overlap with our training grid cells.

We will follow four steps to map values in a 2-dimensional matrix to 3-dimensional "world" coordinates. This process is known as georeferencing:

  1. identify the original coordinate reference system (CRS) of the dataset
  2. define reference points on our grid
  3. perform any necessary reprojections of these points
  4. connect points to NO2 values

We can choose to use one of two georeferencing approaches: vector- or raster-based. You may choose to use one or the other based on computational performance, memory usage, and/or library preferences.

Vector data is composed of discrete geometric locations, or vertices, that define the shape of a spatial object (e.g. points, lines, or polygons). Each feature can carry multiple attributes about a location. Geospatial vector data can be used with the convenient geopandas.GeoDataFrame class.

Raster data is stored as a grid of values that are rendered on a map as pixels, where each pixel value represents an area on the Earth's surface. Raster data is useful for representing continuous surfaces at high levels of detail and for performing efficient geometric computations. Rasters are commonly saved as GeoTIFFs, which contain geographic metadata.

Vector-based approach

Step 1: A CRS defines the mathematical approach used to transform real locations on the Earth into a flattened map. The CRS of our data is the latitude-longitude projection called WGS 84, also known as EPSG:4326. WGS 84 is the most common CRS in the world and is a standard among GPS datasets.

Step 2: Since the OMI Level 3 NO2 product represents global data, the bounds for latitude are [-90, 90] and the bounds for longitude are [-180, 180]. Given the 0.25 x 0.25 degree resolution of the data, we can create equispaced arrays at a 0.25 interval maintaining the bounds of the latitude and longitude.

In [25]:
def define_coordinates(
    lat_bounds: List[float],
    lon_bounds: List[float],
    resolution: List[float]
    """Given latitude bounds, longitude bounds, and the data product 
    resolution, create a meshgrid of points between bounding coordinates.
        lat_bounds (List): latitude bounds as a list.
        lon_bounds (List): longitude bounds as a list.
        resolution (List): data resolution as a list.
        lon (np.array): x (longitude) coordinates.
        lat (np.array): y (latitude) coordinates.
    # Interpolate points between bounds
    # Add 0.125 buffer, source: (HDFEOS script)
    lon = np.arange(lon_bounds[0], lon_bounds[1], resolution[1]) + 0.125
    lat = np.arange(lat_bounds[0], lat_bounds[1], resolution[0]) + 0.125

    return lon, lat
In [26]:
lon, lat = define_coordinates([-90, 90], [-180, 180], [0.25, 0.25])

Step 3: To determine which NO2 readings overlap with our training grid cells, we'll want to work in a consistent CRS. Luckily, the CRS of the competition grid cells is also WGS 84, so we won't need to reproject either set of coordinates.

Step 4: Finally, we can connect each NO2 value with a specific latitude and longitude by flattening our data into a GeoDataFrame of non-missing values. Each row represents a pixel value at a specific time and location. An OMI granule can contain up to 720 x 1440 rows. Our helper function, convert_array_to_df, will allow us to repeat this process for any data. Notice that we've added an additional keyword argument, total_bounds. Later on, this will allow us to optimize our operations by filtering out irrelevant points. For now, let's take a look at the entire dataset.

In [27]:
from pyproj import CRS

wgs84_crs = CRS.from_epsg("4326")
In [28]:
def convert_array_to_df(
    lat: np.ndarray,
    lon: np.ndarray,
    granule_id: str,
    total_bounds: Optional[np.ndarray] = None
    """Align data values with latitude and longitude coordinates
    and return a GeoDataFrame.
        corrected_arr ( data values for each pixel.
        lat (np.ndarray): latitude for each pixel.
        lon (np.ndarray): longitude for each pixel.
        granule_id (str): granule name.
        total_bounds (np.ndarray, optional): If provided, will filter
            out points that fall outside of these bounds. Composed of
            xmin, ymin, xmax, ymax.
    lons, lats = np.meshgrid(lon, lat)
    values = {
        "lat": lats.ravel(),
        "lon": lons.ravel(),
        "granule_id": [granule_id] * lons.shape[0] * lons.shape[1]
    df = pd.DataFrame(values).dropna()
    if total_bounds is not None:
        x_min, y_min, x_max, y_max = total_bounds
        df = df[df.lon.between(x_min, x_max) &, y_max)]
    gdf = gpd.GeoDataFrame(df)
    gdf["geometry"] = gpd.points_from_xy(gdf.lon, = wgs84_crs
    return gdf[["granule_id", "geometry", "value"]].reset_index(drop=True)
In [29]:
gdf = convert_array_to_df(corrected_NO2, lat, lon, filepath.stem)
granule_id geometry value
0 20190101T235959_omi_dl-la-tpe_0 POINT (-179.87500 -89.87500) 0.040071
1 20190101T235959_omi_dl-la-tpe_0 POINT (-179.62500 -89.87500) 0.039829
2 20190101T235959_omi_dl-la-tpe_0 POINT (-179.37500 -89.87500) 0.039601
In [30]:
(507458, 3)

Notice that our dataframe contains over 500 thousand values. Let's plot the data using matplotlib.

In [31]:
def plot_gdf(gdf: gpd.GeoDataFrame):
    """Plot the Point objects contained in a GeoDataFrame.
        gdf (gpd.GeoDataFrame): GeoDataFrame with, at a minimum,
            columns for `orbit`, `geometry`, and `value`.
    Displays a matplotlib scatterplot.
    f, ax = plt.subplots(1, 1, figsize=(8, 3))
    img = ax.scatter(
    plt.suptitle("Global Tropospheric Column NO2", fontsize=12)
In [32]:

Note that white pixels represent missing data.

Raster-based approach

Steps 1-3: Again, the CRS of our data is WGS 84 or EPSG:4326. Since NO2 represents global data, the bounds for latitude are [-90, 90] and the bounds for longitude are [-180, 180]. Since the CRS of the competition grid cells is WGS 84, we won't need to reproject either set of coordinates.

Step 4: We can define a matrix called an affine transformation that will allow us to transform image coordinates to real-world latitude and longitude coordinates. An affine transformation is a named tuple with elements a, b, c, d, e, f, which correspond to the elements in the matrix equation below, in which a pixel's image coordinates are x, y and its world coordinates are x', y'.

The only parameters we'll need are the bounds and resolution of our data file. We'll use rasterio.transform.from_bounds helper function to compute the matrix product of a translation and a scaling, which is the underlying formula to calculate this transformation matrix.

In [33]:
def compute_affine_transform(
    x0: float,
    x1: float,
    y0: float,
    y1: float,
    height: int,
    width: int
    """Given bounding coordinates and resolution, compute
    the affine transformation.
        x0 (float): left coordinate.
        x1 (float): right coordinate.
        y0 (float): top coordinate.
        y1 (float): bottom coordinate.
        height (int): image height.
        width (int): image width.

        affine_transform (Affine): Affine transformation.
    affine_transform = rasterio.transform.from_bounds(
    return affine_transform
In [34]:
# Compute WGS 84 affine transform
transform_wgs84 = compute_affine_transform(-180, 180, 90, -90, 720, 1440)
Affine(0.25, 0.0, -180.0,
       0.0, -0.25, 90.0)

Putting this all together, we can write out a GeoTIFF file that holds our values, data type, missing data value, shape, and the coordinates of each point described by transform_wgs84.

In [35]:
# We can save our new tif to the "interim" folder
final_data = np.expand_dims(, 0)
with (INTERIM / as f:
    ) as dst:

Note: while it may be computationally advantageous to take the raster-based approach to data alignment, you'll want to experiment with different fill values, reprojections, and resampling methods if you decide to follow a raster-based approach. This tutorial is a good starting point.

Create and save feature data

We now have everything we need to iterate through our training files, calibrate and align NO2 values, mask the images to the extent of our grid cell bounds, and prepare our new training features.

In [36]:
# Identify training OMI granule filepaths
fp = list(omi_md.us_url)

# Load training labels
train_labels = pd.read_csv(RAW / "train_labels.csv", parse_dates=["datetime"])
train_labels.rename(columns={"value": "no2"}, inplace=True)

# Identify training grid cells
gc = grid_md[grid_md.index.isin(train_labels.grid_id)].copy()


Our training data contains up to 66 unique grid cells. Using geopandas, we can transform our grid cell metadata into a GeoDataFrame where each observation represents a Polygon geometry.

In [37]:
polys = gpd.GeoSeries.from_wkt(gc.wkt, crs=wgs84_crs) = "geometry"
polys_gdf = gpd.GeoDataFrame(polys)
polys_gdf["location"] = gc.location

Since the OMI data are sparse, it is expected that many values will not overlap our grid cells. We can use the spatial join sjoin_nearest function from geopandas to identify the nearest non-null NO2 value to each grid cell.

But first, let's do some optimization. Remember that our gdf contains nearly 500 thousand points. Because the granule covers a much larger swath of land than our geographies of interest, the vast majority of those points will be irrelevant. For each city, we can generate a bounding box using the total_bounds attribute of our grid cell GeoDataFrame. We can then filter out points outside of the bounding box with the cx GeoDataFrame method. You may have noticed that in our convert_array_to_df helper function, we added a parameter that allows us to remove points outside of location bounds. By removing these points, we can dramatically speed up spatial joins!

In [38]:
# Determine bounding boxes
locations = polys_gdf.location.unique()
bounds = [polys_gdf[polys_gdf.location == x].total_bounds for x in locations]
In [39]:
# Visualize grid cells and OMI points in LA
xmin, ymin, xmax, ymax = bounds[-1]
la_gdf =[xmin:xmax, ymin:ymax]
la_cells = polys_gdf[polys_gdf.location == locations[-1]]

f, ax = plt.subplots(1, 1, figsize=(8, 3))
la_gdf.plot(ax=ax, c="red", alpha=1, markersize=5)
plt.suptitle("Grid cells and OMI data in LA", fontsize=12);

We're now ready to use a subset of our previously generated gdf to identify points that are close to our training grid cells in a location, such as Los Angeles, and can compute the average Tropospheric Column NO2 per cell.

In [40]:
gpd.sjoin_nearest(la_cells, la_gdf, how="inner").groupby("grid_id")["value"].mean().head(3)
3A3IE    4.481723
3S31A    5.436216
7II4T    6.966879
Name: value, dtype: float64

It's time to pull everything together! As a reminder, we've already created the following helper functions:

  • calibrate_data
  • define_coordinates
  • convert_array_to_df

We'll add a helper function to create_calibration_dict.

In [41]:
from h5py import Dataset
In [42]:
def create_calibration_dict(data: Dataset):
    """Define calibration dictionary given a Dataset object,
    which contains:
        - scale factor
        - offset
        - missing data value
        data (Dataset): dataset in the Dataset format.
        calibration_dict (Dict): dict of calibration parameters.
    calibration_dict = {
        "scale_factor": data.attrs["ScaleFactor"][0],
        "offset": data.attrs["Offset"][0],
        "missing_data_value": data.attrs["_FillValue"][0]
    return calibration_dict

As an example, we'll write a GeoDataframe that contains NO2 values for all points in our grid cells for which we have available Tropospheric Column NO2 data. To speed things up, we can parallelize this process using pqdm. Because we have granules that cover a range of times and locations, we expect the number of images per grid cell to vary.

Keep in mind, this represents just one of many approaches to using the satellite data for estimating surface-level NO2.

For example, you may choose to save out rasters cropped to the extent of the competition grid cells, compute a set of zonal statistics, or may want to include values from other datasets like ColumnAmountNO2 as additional image layers.

In [43]:
from pqdm.processes import pqdm
In [44]:
def preprocess_omi_data(
    granule_path: str,
    grid_cell_gdf: gpd.GeoDataFrame,
    dataset_name: str,
    locations: List[str]
    Given a granule s3 path and competition grid cells, 
    create a GDF of each intersecting point and the accompanying
    dataset value (e.g. Tropospheric NO2).
        granule_path (str): a path to a granule on s3.
        grid_cell_gdf (gpd.GeoDataFrame): GeoDataFrame that contains,
            at a minimum, a `grid_id` and `geometry` column of Polygons.
        dataset_name (str): specific dataset name (e.g. "ColumnAmountNO2Trop").
        locations (List[str]): location names to filter points
            using bounding coordinates. Options include Taipei, Delhi,
            and Los Angeles (SoCAB).
        GeoDataFrame that contains Points and associated values.
    # Load NO2 data
    s3_path = S3Path(granule_path)
    hdf = File(s3_path.fspath, mode="r")
    NO2 = hdf["HDFEOS"]["GRIDS"]["ColumnAmountNO2"]["Data Fields"][dataset_name]
    raw_NO2 = NO2[:]
    shape = raw_NO2.shape

    # Calibrate data
    calibration_dict = create_calibration_dict(NO2)
    corrected_NO2 = calibrate_data(raw_NO2, calibration_dict)

    # Determine location bounds
    bounds = [grid_cell_gdf[grid_cell_gdf.location == x].total_bounds for x in locations]
    # Save values that align with granules
    lon, lat = define_coordinates([-90, 90], [-180, 180], [0.25, 0.25])
    dfs = []
    for i, bound in enumerate(bounds):
        granule_gdf = convert_array_to_df(corrected_NO2, lat, lon,, bound)
        location_gdf = grid_cell_gdf[grid_cell_gdf.location == locations[i]]
        city_df = gpd.sjoin_nearest(location_gdf, granule_gdf, how="inner")
    df = pd.concat(dfs)

    # Clean up files
    return df.drop(columns="index_right").reset_index()

def preprocess_trop_no2(
    granule_paths: List[str],
    grid_cell_gdf: gpd.GeoDataFrame,
    dataset_name: str,
    locations: List[str],
    n_jobs: int = 2
    Given a set of granule s3 paths and competition grid cells, 
    parallelizes creation of GDFs containing NO2 values and points.
        granule_paths (List[str]): list of paths on s3.
        grid_cell_gdf (gpd.GeoDataFrame): GeoDataFrame that contains,
            at a minimum, a `grid_id` and `geometry` column of Polygons.
        dataset_name (str): specific dataset name (e.g. "ColumnAmountNO2Trop").
        locations (List[str]): location names to filter points
            using bounding coordinates. Options include Taipei, Delhi,
            and Los Angeles (SoCAB).
        n_jobs (int): The number of parallel processes. Defaults to 2.
        GeoDataFrame that contains Points and associated values for all granules.
    args = ((gp, grid_cell_gdf, dataset_name, locations) for gp in granule_paths)
    results = pqdm(args, preprocess_omi_data, n_jobs=n_jobs, argument_type="args")
    return pd.concat(results)
In [ ]:
train_gdf = preprocess_trop_no2(fp, polys_gdf, "ColumnAmountNO2Trop", locations)
In [46]:
grid_id geometry location granule_id value
0 1Z2W7 POLYGON ((77.30453 28.54664, 77.30453 28.58609... Delhi 20190101T235959_omi_dl-la-tpe_0.he5 10.310496
1 6EIL6 POLYGON ((77.07995 28.54664, 77.07995 28.58609... Delhi 20190101T235959_omi_dl-la-tpe_0.he5 10.310496
2 7334C POLYGON ((77.12487 28.54664, 77.12487 28.58609... Delhi 20190101T235959_omi_dl-la-tpe_0.he5 10.310496
In [47]:
(26403, 5)

train_gdf describes 26,403 Tropospheric Column NO2 values near our competition training grid cells!

Simple benchmark model

Using train_df, let's prepare an extremely simple benchmark solution for the Airathon trace gas track.

In [48]:
def calculate_features(
    feature_df: gpd.GeoDataFrame,
    label_df: pd.DataFrame,
    stage: str,
    """Given processed NO2 data and training labels (train) or 
    submission format (test), return a feature GeoDataFrame that
    contain a single features for mean NO2.
        feature_df (gpd.GeoDataFrame): GeoDataFrame that contains
            Points and associated values.
        label_df (pd.DataFrame): training labels (train) or
            submission format (test).
        stage (str): "train" or "test".
        full_data (gpd.GeoDataFrame): GeoDataFrame that contains `mean_no2`,
            for each grid cell and datetime.   
    # Add `day` column to `feature_df` and `label_df`
    feature_df["datetime"] = pd.to_datetime(
        feature_df.granule_id.str.split("_", expand=True)[0],
    feature_df["day"] =
    label_df["day"] =

    # Calculate average NO2 per day/grid cell for which we have feature data
    avg_no2_day = (
        feature_df.groupby(["day", "grid_id"]).mean()
                  .rename(columns={"value": "mean_no2"})

    # Join labels/submission format with feature data
    how = "inner" if stage == "train" else "left"
    full_data = pd.merge(
        left_on=["day", "grid_id"],
        right_on=["day", "grid_id"]
    return full_data
In [49]:
full_data = calculate_features(train_gdf, train_labels, stage="train")
datetime grid_id no2 day mean_no2
0 2019-01-01 08:00:00+00:00 3A3IE 8.695000 2019-01-01 4.481723
1 2019-01-01 08:00:00+00:00 3S31A 10.496667 2019-01-01 5.436216
2 2019-01-01 08:00:00+00:00 7II4T 37.208333 2019-01-01 6.966879

We do not want to overstate our model's performance by overfitting to one or more dates or locations. To be sure that our method is sufficiently generalizable, let's set aside a portion of the data to validate the model. To ensure that we are not using data collected in the future to estimate prior NO2, we can subset the training data into training and validation sets using the datetime field.

In [50]:
# 2020 data will be held out for validation
train = full_data[full_data.datetime.dt.year <= 2019].copy()
test = full_data[full_data.datetime.dt.year > 2019].copy()

In scikit-learn, we can set up a random forest with a few lines of code.

In [51]:
from sklearn.ensemble import RandomForestRegressor
In [52]:
# Train model on train set
model = RandomForestRegressor()[["mean_no2"]], train.no2)

# Compute R-squared using our holdout set
model.score(test[["mean_no2"]], test.no2)
In [53]:
# Refit model on entire training set[["mean_no2"]], full_data.no2)

Finally, let's make predictions for the competition test set and save them to the submission format. To keep things simple, we'll impute missing features using the average NO2 value from the training data.

In [ ]:
# Identify test granule s3 paths
test_md = no2_md[(no2_md["product"] == "omi") & (no2_md["split"] == "test")]

# Identify test grid cells
submission_format = pd.read_csv(RAW / "submission_format.csv", parse_dates=["datetime"])
test_gc = grid_md[grid_md.index.isin(submission_format.grid_id)]

# Process test data
test_fp = list(test_md.us_url)

# Create grid cell GeoDataFrame
test_polys = gpd.GeoSeries.from_wkt(test_gc.wkt, crs=wgs84_crs) = "geometry"
test_polys_gdf = gpd.GeoDataFrame(test_polys)
test_polys_gdf["location"] = test_gc.location

# Preprocess NO2 for test granules
test_df = preprocess_trop_no2(test_fp, test_polys_gdf, "ColumnAmountNO2Trop", locations)
In [55]:
grid_id geometry location granule_id value
0 1Z2W7 POLYGON ((77.30453 28.54664, 77.30453 28.58609... Delhi 20180908T235959_omi_dl-la-tpe_0.he5 1.304367
1 6EIL6 POLYGON ((77.07995 28.54664, 77.07995 28.58609... Delhi 20180908T235959_omi_dl-la-tpe_0.he5 1.304367
2 7334C POLYGON ((77.12487 28.54664, 77.12487 28.58609... Delhi 20180908T235959_omi_dl-la-tpe_0.he5 1.304367
In [56]:
# Prepare NO2 features
submission_df = calculate_features(test_df, submission_format, stage="test")

# Impute missing features using training set mean/max/min
submission_df.mean_no2.fillna(train_gdf.value.mean(), inplace=True)
submission_df.drop(columns=["day"], inplace=True)

# Make predictions using NO2 features
submission_df["preds"] = model.predict(submission_df[["mean_no2"]])
submission = submission_df[["datetime", "grid_id", "preds"]].copy()
submission.rename(columns={"preds": "value"}, inplace=True)

# Ensure submission indices match submission format
submission_format.set_index(["datetime", "grid_id"], inplace=True)
submission.set_index(["datetime", "grid_id"], inplace=True)
assert submission_format.index.equals(submission.index)
In [57]:
datetime grid_id
2018-09-08 08:00:00+00:00 3A3IE 26.832817
3S31A 15.404687
7II4T 18.713218
In [58]:
count 16350.000000
mean 23.821617
std 11.104485
min 2.956299
25% 20.288840
50% 20.476257
75% 26.188990
max 85.957874

Finally, we will save our submission locally.

In [59]:
# Save submission in the correct format
final_submission = pd.read_csv(RAW / "submission_format.csv")
final_submission["value"] = submission.reset_index().value
final_submission.to_csv((INTERIM / "submission.csv"), index=False)

We can now head to the competition submissions page and upload our submission to obtain our model's R2.

We should see an R2 of -0.1543 on the leaderboard. Now it's up to you to make improvements and increase this score! We hope this tutorial provides a helpful framework for loading and exploring the data, correcting and aligning values, and working with geospatial vector and raster data.

Head over to the challenge homepage to get started. We can't wait to see what you build!

Banner image, artist's rendering of Aura, courtesy of NASA