blog

A peek inside DrivenData's code execution competitions


by Robert Gibboni

In this post, we introduce an exciting development in the DrivenData competition platform ― code execution competitions! It's not exactly a new development, since our first code execution competition launched in November 2019.

Over the past couple years the DrivenData community has used this infrastructure to identify wildlife, diagnose cervical biopsies, test differential privacy algorithms, forecast space weather, and map floods!

Now that we have a few under our belt, we thought we'd share a little about the tech that makes this kind of competition possible. This post will cover:

If you want to try it out yourself, head over to STAC Overflow: Map Floodwater from Radar Imagery, our current code execution competition that is running through the end of September 2021!

What is a code execution competition?

A table summarizing where different stages of a competition are run, on the participant's own hardware 💻 or in the cloud . Highlighted columns indicate stages that differ between traditional and code execution competitions.
Competition type Download train features Download train labels Train a model Write code to predict Download test features Predict target variable Score predictions
Traditional 💻 💻 💻 💻 💻 💻
Code execution 💻 💻 💻 💻

The biggest difference between a "traditional" competition and a code execution competition is in how predictive outputs are generated when testing different solutions.

Let's start by reviewing how a traditional competition works. Let's say you're trying to help conservation researchers use ML to detect wildlife species in camera trap images. In a traditional competition, a participant downloads training images and the matching labels that indicate which species are in each image. They train a model on their own hardware to learn the associations between images and labels. Then, to test the performance of the model, they run it on a set of unlabeled images (the test features) to predict which species are present. They submit those predictions as a CSV file to the DrivenData platform for scoring against the true labels. The resulting score is updated on a live leaderboard.

In a code execution version of this competition, the participant still downloads the training images and labels, and fits the model on their own hardware. However, they do not have direct access to the test images to run their trained model. Instead, the participant uploads their model and inference code to the platform, which executes them in the cloud. The resulting predictions that their model produces (which species are present in each test image) are automatically evaluated against the true labels on the platform, and the leaderboard updates with the best scores.

In this way, code execution competitions bring winning solutions one step closer to application, by requiring that they demonstrate successful execution on unseen test data in a reproducible containerized environment.

Why code execution?

While a code execution competition is not the right fit for every challenge, there are a number of advantages it can bring.

🤝 Leveling the playing field

Not everyone can afford to run a 20 GPU cluster for a week to compute model predictions. While we realize that much of the computational cost comes from training models, participants are on comparatively more equal footing if they have access to the same resources for inference. (We are also thinking about ways to bring the training stage into the code execution infrastructure.)

⏳ Sponsors can constrain the winning solutions

Sometimes the best model for an application is not simply the model that gets the highest score. Many applications need to weigh additional factors like resource cost. For example, a sponsor may only want to consider solutions that can run in under x hours on a machine with y GB of GPU RAM. Code execution competitions offer a solution: we can set limits on the total compute time or hardware available for submissions. Sponsors may also have constraints on the software runtime of the solution. The sponsor of the MagNet competition NOAA needed their solution to be written in Python, so the code execution environment only included a Python runtime.

🔒️ Prevent hand-labeling the test data

In some competitions (e.g., TissueNet), the number of examples in the test set is small enough that a motivated (if unethical) participant could hand label them to obtain a winning score. We would discover this kind of cheating when reviewing their code, but why not avoid the chaos altogether? In a code execution competition, the test features never leave the code execution infrastructure.

☁️ Participants can engage with remote computational resources

Working with multi-terabyte datasets and GPU-accelerated models is computationally intensive and often difficult to run on one machine alone. Containerization allows data scientists to work efficiently with remote resources by developing code on their local machines that can run smoothly on more powerful hardware elsewhere. A code execution competition provides a way for participants to gain experience working with Docker, a popular containerization option. Each competition comes with a repository that has the container definition (as a Dockerfile) and helper scripts to work with the container.

📈 Automatic review of winning submissions

At the end of a traditional competition, winning participants submit their source code and trained models for DrivenData to review. Each participant's submission may run bug-free on their own computer, but it does not always go as smoothly when we move it to ours. It's a matter of runtime, the particular hardware and software context in which code runs, including the specific versions of the operating system, drivers, software packages, etc. We do our best to set up the same runtime that the participant used, but the process of debugging and back-and-forth with the participant usually takes several days, often longer. Wouldn't we rather just have a clear answer as soon as the competition ends? In a code execution competition, we know every submission works, because we ran it ourselves in a standardized runtime as soon as it was submitted!

Sounds pretty cool! So how does it work?

How does code execution work?

Each code execution competition involves a bit of setup prior to competition launch. We create the following resources:

  • Competition runtime: A Docker image that defines the submission runtime (shared on our public Docker Hub account).
  • Submission storage: An Azure Blob Storage account to hold submissions.
  • Competition data storage: Additional Azure storage to hold other competition data, e.g., test features.
  • Kubernetes cluster: An Azure Kubernetes Service (AKS) cluster to run the submissions.

After launch, a participant can download the training dataset and start modeling! Here's a more detailed step-by-step explanation of how it works.

A diagram showing the process of training a model locally and submitting it for execution in the code execution cluster.

On the participant side:

  1. Training: The participant trains a model on the training dataset. They zip up their model and the code needed to run it (infer.py) into submission.zip.
  2. Local testing: To be sure that their model will run correctly, they test their submission locally using the exact same runtime that will be used in the cloud. They download the Competition runtime from Docker Hub and run it on a sample of the test features.
  3. Submission: When everything is running smoothly locally, the participant uploads their submission.zip to the DrivenData competition platform. From there, the platform sends the submission on a fantastic journey.

On the platform side:

  1. Upload: The DrivenData competition platform uploads the submission.zip to Submission storage.
  2. Resource provision: It tells our Kubernetes cluster to create an Inference pod, a self-contained compute unit whose sole purpose is to run the submission. The pod starts up using the Competition runtime hosted on Docker Hub and mounts two "disks": the Submission storage container with submission.zip and the Competition data storage, another cloud storage resource with the test features.
  3. Execution: The Inference pod unzips submission.zip and runs the participant's code. If their submission is bug-free, the code loads and processes the test features and uses the model assets to predict labels for the test set, the test predictions. The test predictions are saved to the same Submission storage as before. At last, the pod shuts down.
  4. Scoring: The platform (which has been baby-sitting the pod throught these steps) sees the pod expire, downloads the test predictions from Submission storage, scores them against true output, and posts the score to the leaderboard!

Phew! Let's dive into a bit more detail for some of the key components.

One runtime to rule them all

One of the challenges of a code execution competition is getting code written on a participant's machine to run in the code execution cluster. It comes down to differences in runtime―code written for one runtime could very easily fail to run on another because of differences in operating system version, drivers, software packages, etc. In our case, there are probably as many unique runtimes as there are participants. How do we ensure that we can run all of their code in our compute cloud? This problem is not unique to DrivenData code execution competitions―containerization software (we use Docker) lets us use the same runtimes across different systems.

Every code execution competition comes with its own submission runtime repo which has a Dockerfile that specifies the submission runtime. We start by including the most common machine learning packages (scikit-learn, fast.ai, tensorflow, torch, etc.) and upload it to a public Docker Hub repository. From there, participants can download the container to their local machines, where they can develop and test their code using the same exact runtime that the cloud compute cluster uses.

Since the repo is public, any participant can make a pull request to add a new package to the runtime. It is usually as simple as just adding the package to the conda environment YAML files. (For example, here is a pull request to the TissueNet submission runtime repo adding two packages that were useful for working with images of microscope slides.) When a PR is submitted, a GitHub Action automatically tests that that the Docker image builds successfully and that all of the packages can be imported. If things look good, we merge the PR. On commits to main, another GitHub Action rebuilds the Docker image with the updates and pushes the updated image to the Docker Hub repository. The next submission will run using the updated image!

Running submissions on Azure Kubernetes Service

Kubernetes is an open-source framework for managing containerized resources. We use Azure Kubernetes Service (AKS), Azure's implementation of Kubernetes, to run our code execution cluster. Central to the Kubernetes framework is the notion of a "pod," a self-contained compute environment. We configure a submission pod to run the competition runtime Docker image, and to connect to the necessary storage resources. We also configure the pod with a few security measures:

  • Disallow network/internet access.
  • Set the test dataset permissions to read only.

This prevents an unscrupulous participant from emailing themselves the test labels or overwriting them to hack their score. Once the pod is configured, we give the DrivenData platform credentials to start, stop, and monitor pods.

Storing submissions and competition data on Azure

We configure two cloud storage resources for the purposes of code execution: submission storage and competition data storage.

Submission storage

The submission storage resource holds participant submissions, the zipped files of model assets and code. We found that Azure Blob Storage was a good option, allowing submissions (typically no more than a few hundred MBs) to be uploaded and downloaded in a reasonable amount of time.

Competition data storage: Need for (read) speed

The competition data storage holds the competition assets shared across all submissions, for example test features, which we manually upload once prior to competition launch. Competition data is only uploaded once and is static throughout the competition. Each submission pod reads the test features from this resource when running the submission. The choice of storage backend depends on the size of the test competition data. For competitions with relatively small test datasets (on the order of a few gigabytes), we found that Azure Blob Storage is fast enough. However, some competitions (e.g., Hakuna Ma-data and TissueNet) involve test datasets on the order of terabytes. Reading the data would take hours if hosted on blob storage. In these cases, we turn to some of Azure's more performant storage options―Azure Disks and Azure Files. Here's a quick comparison of the published read speeds for a 1TB storage resource:

Storage type Read speed
(MB/second)
Standard Disk S30 (HDD) 60
Standard Disk E30 (SSD) 60
Premium Azure Files 122
Premium Disk P30 200
Ultra Disk 2,000

Of course, the faster storage options are more expensive, but they are absolutely critical to working with huge competition datasets.

Conclusion

There are a bunch of advantages to running a code execution competition―they are a great way for participants to get familiar with remote/containerized workflows, they allow for automatic verification of the results, they start to equalize the computational resources available to participants, and they let sponsors constraint the computational requirements of the winning solutions. It has been rewarding to see the infrastructure we designed handle thousands of submissions (and counting!). We hope you enjoyed hearing about our journey, and we look forward to building out more features to our code execution infrastructure!