Genetic Engineering Attribution - Benchmark

In this post we'll show you how to train a random forest model to predict the lab-of-origin from DNA sequences!

Christine Chung
Senior Data Scientist

Genetic engineering is a powerful tool with a wide array of applications. It's being used to do everything from creating safe alternatives to pesticides to producing critical ingredients for medicines and industry. However, it's rise and increasing accessibility also comes with the potential for lab accidents, purposeful wrongdoing, and abuse.

Genetic engineering attribution is a tool that can help promote accountability and deter bad actors. In our newest competition, we're asking you to to create an algorithm that identifies the lab-of-origin for genetically engineered DNA—with the highest accuracy level possible. The metric for this competition is top ten accuracy.

CDC scientist examines the results of a hemagglutinin inhibition (HI) test.

The DNA for this challenge comes from an online dataset of plasmids. Plasmids are small, circular DNA molecules that replicate independently from chromosomes. Plasmids used commonly in research come from cells in mammals, bacteria, worms, yeast, and insects. Plasmid information can be uploaded and accessed digitally thanks to AddGene, a large, non-profit plasmid repository based in Watertown, MA.

In this benchmark, we'll cover how to:

  • load the data
  • create text-based features from DNA sequences
  • train a random forest model and make predictions
  • create a custom scorer based on top-ten accuracy
  • submit predictions to the competition

To get started, we import the standard tools of the trade.

In [1]:
from pathlib import Path

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns


%matplotlib inline
sns.set()

Loading the Data

On the data download page, we provide everything you need to get started:

  • Training Features: These are the features you'll use to train a model. There are 39 features, including the plasmid DNA sequence and metadata about the plasmids. Each row has an identifier called sequence_id, a unique 5-character alphanumeric string.
  • Training Labels: These are the labels. Every sequence_id in the training values data has a corresponding label in this file, given by a lab_id. A lab_id is a unique, 8-character alphanumeric identifier for a particular lab. The labels are one-hot encoded.
  • Test Features: These are the features you'll use to make predictions after training a model. We don't give you the labels for these samples; it's up to you to predict labs for these sequence_ids.
  • Submission Format: This is what a submission should look like, including the file, column, and row names. Since evaluation is based on top ten accuracy, you'll need to submit class probabilities instead of a single class prediction. The submission format has random floats between 0 and 1 filled in as a baseline. Your submission to the leaderboard must be in this exact form (with different probability values, of course) in order to be scored successfully!

Since this is a benchmark, we're only going to use a subset of the features in the dataset. It's up to you to take advantage of all the information!

In [2]:
DATA_DIR = Path.cwd().parent / "data" / "final" / "public"
In [3]:
train_values = pd.read_csv(DATA_DIR / "train_values.csv", index_col="sequence_id")
train_labels = pd.read_csv(DATA_DIR / "train_labels.csv", index_col="sequence_id")
test_values = pd.read_csv(DATA_DIR / "test_values.csv", index_col="sequence_id")

Explore the Data

Let's take a quick look at all of the data we have available. First let's check out our features.

Training set

In [4]:
train_values.head()
Out[4]:
sequence bacterial_resistance_ampicillin bacterial_resistance_chloramphenicol bacterial_resistance_kanamycin bacterial_resistance_other bacterial_resistance_spectinomycin copy_number_high_copy copy_number_low_copy copy_number_unknown growth_strain_ccdb_survival ... species_budding_yeast species_fly species_human species_mouse species_mustard_weed species_nematode species_other species_rat species_synthetic species_zebrafish
sequence_id
9ZIMC CATGCATTAGTTATTAATAGTAATCAATTACGGGGTCATTAGTTCA... 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
5SAQC GCTGGATGGTTTGGGACATGTGCAGCCCCGTCTCTGTATGGAGTGA... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
E7QRO NNCCGGGCTGTAGCTACACAGGGCGGAGATGAGAGCCCTACGAAAG... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
CT5FP GCGGAGATGAAGAGCCCTACGAAAGCTGAGCCTGCGACTCCCGCAG... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
7PTD8 CGCGCATTACTTCACATGGTCCTCAAGGGTAACATGAAAGTGATCC... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 40 columns

We have one non-numeric column, sequence, which contains the DNA plasmid sequence.

In [5]:
train_values.dtypes
Out[5]:
sequence                                 object
bacterial_resistance_ampicillin         float64
bacterial_resistance_chloramphenicol    float64
bacterial_resistance_kanamycin          float64
bacterial_resistance_other              float64
bacterial_resistance_spectinomycin      float64
copy_number_high_copy                   float64
copy_number_low_copy                    float64
copy_number_unknown                     float64
growth_strain_ccdb_survival             float64
growth_strain_dh10b                     float64
growth_strain_dh5alpha                  float64
growth_strain_neb_stable                float64
growth_strain_other                     float64
growth_strain_stbl3                     float64
growth_strain_top10                     float64
growth_strain_xl1_blue                  float64
growth_temp_30                          float64
growth_temp_37                          float64
growth_temp_other                       float64
selectable_markers_blasticidin          float64
selectable_markers_his3                 float64
selectable_markers_hygromycin           float64
selectable_markers_leu2                 float64
selectable_markers_neomycin             float64
selectable_markers_other                float64
selectable_markers_puromycin            float64
selectable_markers_trp1                 float64
selectable_markers_ura3                 float64
selectable_markers_zeocin               float64
species_budding_yeast                   float64
species_fly                             float64
species_human                           float64
species_mouse                           float64
species_mustard_weed                    float64
species_nematode                        float64
species_other                           float64
species_rat                             float64
species_synthetic                       float64
species_zebrafish                       float64
dtype: object

The length of these sequences vary. Some are as short as 20 characters and others are as long as 60,000!

In [6]:
sequence_lengths = train_values.sequence.apply(len)

sequence_lengths.describe()
Out[6]:
count    63017.000000
mean      4839.025501
std       3883.148431
min         20.000000
25%        909.000000
50%       4741.000000
75%       7490.000000
max      60099.000000
Name: sequence, dtype: float64
In [7]:
sequence_lengths.plot(
    kind="hist", title="Distribution of DNA Sequence Lengths", bins=100, xlim=(0, 20000)
);  # don't plot super long sequences

The rest of our features are binary, one hot encoded columns.

In [8]:
# exclude the 0th column which is the dna sequence
train_values.iloc[:, 1:].apply(pd.value_counts)
Out[8]:
bacterial_resistance_ampicillin bacterial_resistance_chloramphenicol bacterial_resistance_kanamycin bacterial_resistance_other bacterial_resistance_spectinomycin copy_number_high_copy copy_number_low_copy copy_number_unknown growth_strain_ccdb_survival growth_strain_dh10b ... species_budding_yeast species_fly species_human species_mouse species_mustard_weed species_nematode species_other species_rat species_synthetic species_zebrafish
0.0 19590 60506 48706 62078 58592 18750 57313 50059 61841 61285 ... 61417 61478 36890 57732 61978 62322 56381 61949 55170 61742
1.0 43427 2511 14311 939 4425 44267 5704 12958 1176 1732 ... 1600 1539 26127 5285 1039 695 6636 1068 7847 1275

2 rows × 39 columns

Since these columns are binary, the mean value shows us the prevalence, meaning what proportion of sequences have that characteristic. Check out the problem description page for definitions of each feature.

This plot shows us that a few features are common across sequences but most are quite rare.

In [9]:
sorted_binary_features = train_values.iloc[:, 1:].mean().sort_values()

ax = sorted_binary_features.plot(
    kind="barh", stacked=True, figsize=(5, 12), title="Prevalence of Binary Features"
)
ax.set_xlabel("Proportion of sequences");

Test set

Our test set features are set up in the same way as the training set.

In [10]:
test_values.head()
Out[10]:
sequence bacterial_resistance_ampicillin bacterial_resistance_chloramphenicol bacterial_resistance_kanamycin bacterial_resistance_other bacterial_resistance_spectinomycin copy_number_high_copy copy_number_low_copy copy_number_unknown growth_strain_ccdb_survival ... species_budding_yeast species_fly species_human species_mouse species_mustard_weed species_nematode species_other species_rat species_synthetic species_zebrafish
sequence_id
E0VFT AGATCTATACATTGAATCAATATTGGCAATTAGCCATATTAGTCAT... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
TTRK5 GCGCGCGTTGACATTGATTATTGACTAGTTATTAATAGTAATCAAT... 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2Z7FZ GCTTAAGCGGTCGACGGATCGGGAGATCTCCCGATCCCCTATGGTG... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
VJI6E ATGATGATGATGTCCCTGAACAGCAAGCAGGCGTTTAGCATGCCGC... 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
721FI GGTACCGAGCTCTTACGCGTGCTAGCCATACTATCAGCCACTTGTG... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 40 columns

In [11]:
test_values.sequence.apply(len).describe()
Out[11]:
count    18816.000000
mean      4875.523810
std       4004.117614
min         19.000000
25%        894.750000
50%       4732.000000
75%       7341.250000
max      38638.000000
Name: sequence, dtype: float64

Training Labels

Now let's look at our labels. It's important to recognize that these are one-hot encoded. Each column represents a lab ID. The correct lab for a given sequence_id is indicated by a 1.0 in that column, and 0.0 otherwise.

In [12]:
train_labels.head()
Out[12]:
00Q4V31T 012VT4JK 028IO5W2 03GRNN7N 03Y3W51H 09MQV1TY 0A4AHRCT 0A9M05NC 0B9GCUVV 0CL7QVG8 ... ZQNGGY33 ZSHS4VJZ ZT1IP3T6 ZU6860XU ZU6TVFFU ZU75P59K ZUI6TDWV ZWFD8OHC ZX06ZDZN ZZJVE4HO
sequence_id
9ZIMC 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5SAQC 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
E7QRO 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
CT5FP 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7PTD8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 1314 columns

To make our modeling and analysis simpler, let's collapse the labels dataframe into one column that shows us the lab ID for each sequence.

In [13]:
# get the column with the max value in each row
lab_ids = pd.DataFrame(train_labels.idxmax(axis=1), columns=["lab_id"])
lab_ids.head()
Out[13]:
lab_id
sequence_id
9ZIMC RYUA3GVO
5SAQC RYUA3GVO
E7QRO RYUA3GVO
CT5FP RYUA3GVO
7PTD8 RYUA3GVO

Labs can have anywhere from 1 to over 8,000 sequences in the data. On average, there are about 48 sequences per lab in the training data.

In [14]:
# get the distrubtion of lab prevalence in the training set
lab_ids["lab_id"].value_counts().describe()
Out[14]:
count    1314.000000
mean       47.958143
std       262.552258
min         1.000000
25%         9.000000
50%        15.000000
75%        34.000000
max      8286.000000
Name: lab_id, dtype: float64

Sorting labs by prevalence, we see that lab ID I7FXTVDP accounts for 13% of the sequences in the train set. The top 3 labs account for more than 20% of the data. We'll have to be mindful of class imbalance in our training.

In [15]:
# Sort lab ids by prevalence
(lab_ids["lab_id"].value_counts(normalize=True).sort_values(ascending=False)).head()
Out[15]:
I7FXTVDP    0.131488
RKJHZGDQ    0.043353
GTVTUGVY    0.042401
A18S09P2    0.016884
Q2K8NHZY    0.015440
Name: lab_id, dtype: float64

Construct Features from DNA Sequences

For this benchmark, we're going to use the DNA sequences as the basis for the features in our model. We've got some preprocessing work to do to turn these lengthy strings into useful features!

These sequences are composed of five characters. G, C, A, and T represent the four nucleotides commonly found in DNA (guanine, cytosine, adenine, thymine). N stand for any nucleotide (not a gap).

In [16]:
bases = set("".join(train_values.sequence.values))
bases
Out[16]:
{'A', 'C', 'G', 'N', 'T'}

One common way to turn strings into useful features is to count n-grams, or continuous subsequences of length n. Here, we'll split up the DNA sequences into four-grams, or subsequences consisting of 4 bases.

With 5 unique bases, we can produce 120 different sequence permutations consisting of 4 bases. You can play around with the length of n to see how it affects your model.

In [17]:
from itertools import permutations

n = 4
subsequences = ["".join(permutation) for permutation in permutations(bases, r=n)]
In [18]:
print(f"Number of subsequences: {len(subsequences)}")
subsequences[:10]
Number of subsequences: 120
Out[18]:
['ANTG',
 'ANTC',
 'ANGT',
 'ANGC',
 'ANCT',
 'ANCG',
 'ATNG',
 'ATNC',
 'ATGN',
 'ATGC']

We can now turn our strings into features by taking a simple count of each subsequence. We could do this one of two ways:

  1. Overlapping substrings: To count overlapping substrings, we would use a sliding window such that the sequence "ATTATTA" will result in a count of 2 for the substring "ATTA" ("ATTA-TTA" and "ATT-ATTA").

  2. Non-overlapping substrings: To count non-overlapping substrings, we search for each substring. If we find it, we count it and then continue our search at the end of the substring. In this case, "ATTATTA" will result in a count of 1 for the substring "ATTA" ("ATTA-TTA").

For simplicity, we're going to use the second method of non-overlapping substrings. We can use the built in count method on strings. Feel free to try both methods to see how it affects your model.

In [19]:
# Example of built-in count method on strings
# Because it's non-overlapping, "atta" is only counted twice
"gattattattaca".count("atta")
Out[19]:
2

Now let's create a helper function to generate features. get_ngram_features will create a new dataframe that holds the counts for each subsequence and row in our data.

In [20]:
def get_ngram_features(data, subsequences):
    """Generates counts for each subsequence.

    Args:
        data (DataFrame): The data you want to create features from. Must include a "sequence" column.
        subsequences (list): A list of subsequences to count.

    Returns:
        DataFrame: A DataFrame with one column for each subsequence.
    """
    features = pd.DataFrame(index=data.index)
    for subseq in subsequences:
        features[subseq] = data.sequence.str.count(subseq)
    return features

Using our helper function, we can generate feautures for all the sequences in our training set! This will take a minute or two to run.

In [21]:
# Calculate n-gram features on our training set
ngram_features = get_ngram_features(train_values, subsequences)
ngram_features.head()
Out[21]:
ANTG ANTC ANGT ANGC ANCT ANCG ATNG ATNC ATGN ATGC ... CTNA CTNG CTGA CTGN CGAN CGAT CGNA CGNT CGTA CGTN
sequence_id
9ZIMC 0 0 0 0 0 0 0 0 0 30 ... 0 0 44 0 0 17 0 0 14 0
5SAQC 0 0 0 0 0 0 0 0 0 2 ... 0 0 6 0 0 0 0 0 0 0
E7QRO 0 0 0 1 0 1 0 0 0 1 ... 0 0 2 2 1 1 0 0 0 0
CT5FP 0 0 0 0 0 0 0 0 0 2 ... 0 0 8 0 0 0 0 0 1 0
7PTD8 0 0 0 1 0 0 0 0 0 6 ... 0 0 4 0 0 1 0 0 2 0

5 rows × 120 columns

In [22]:
ngram_features.shape
Out[22]:
(63017, 120)

We now have features for all 120 possible subsequences. Their values show the counts of each 4-gram within the full DNA sequence.

Let's join them with our one-hot endcoded binary features.

In [23]:
all_features = ngram_features.join(train_values.drop("sequence", axis=1))
In [24]:
all_features.head()
Out[24]:
ANTG ANTC ANGT ANGC ANCT ANCG ATNG ATNC ATGN ATGC ... species_budding_yeast species_fly species_human species_mouse species_mustard_weed species_nematode species_other species_rat species_synthetic species_zebrafish
sequence_id
9ZIMC 0 0 0 0 0 0 0 0 0 30 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
5SAQC 0 0 0 0 0 0 0 0 0 2 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
E7QRO 0 0 0 1 0 1 0 0 0 1 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
CT5FP 0 0 0 0 0 0 0 0 0 2 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
7PTD8 0 0 0 1 0 0 0 0 0 6 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 159 columns

In [25]:
all_features.shape
Out[25]:
(63017, 159)

The Error Metric: Top Ten Accuracy

The goal for this competition is to help narrow down the field of possible labs-of-origin from thousands to just a few. To that end, predictions will be evaluated based on top-ten accuracy. That means we'll consider a prediction "correct" if the true lab-of-origin is in the top ten most likely labs.

There is not a built in evaluation metric for top-k accuracy in scikit-learn, so we'll be constructing a custom scorer. We'll use this to determine the final accuracy of our model.

First, we'll need to create a custom scorer that can take in an estimator, validation data, and labels, and output a score based on the top ten results from each predicton. Scikit-learn let's us do this if we adhere to a specific signature.

From the documentation:

For a callable to be a scorer, it needs to meet the protocol specified by the following two rules:

It can be called with parameters (estimator, X, y), where estimator is the model that should be evaluated, X is validation data, and y is the ground truth target for X (in the supervised case) or None (in the unsupervised case).

It returns a floating point number that quantifies the estimator prediction quality on X, with reference to y. Again, by convention higher numbers are better, so if your scorer returns loss, that value should be negated.

In [26]:
def top10_accuracy_scorer(estimator, X, y):
    """A custom scorer that evaluates a model on whether the correct label is in
    the top 10 most probable predictions.

    Args:
        estimator (sklearn estimator): The sklearn model that should be evaluated.
        X (numpy array): The validation data.
        y (numpy array): The ground truth labels.

    Returns:
        float: Accuracy of the model as defined by the proportion of predictions
               in which the correct label was in the top 10. Higher is better.
    """
    # predict the probabilities across all possible labels for rows in our training set
    probas = estimator.predict_proba(X)

    # get the indices for top 10 predictions for each row; these are the last ten in each row
    # Note: We use argpartition, which is O(n), vs argsort, which uses the quicksort algorithm
    # by default and is O(n^2) in the worst case. We can do this because we only need the top ten
    # partitioned, not in sorted order.
    # Documentation: https://numpy.org/doc/1.18/reference/generated/numpy.argpartition.html
    top10_idx = np.argpartition(probas, -10, axis=1)[:, -10:]

    # index into the classes list using the top ten indices to get the class names
    top10_preds = estimator.classes_[top10_idx]

    # check if y-true is in top 10 for each set of predictions
    mask = top10_preds == y.reshape((y.size, 1))

    # take the mean
    top_10_accuracy = mask.any(axis=1).mean()

    return top_10_accuracy

Build the Model

Random forests are often a good first model to try so we'll start there. We'll leave more complicated modeling and feature selection up to you!

Random Forest

It's easy to build a random forest model with Scikit Learn. We're going to create a simple model with a few specified hyperparameters.

Tip: If you wanted to get fancy, you could perform a cross-validated grid search with GridSearchCV. You could even use the custom scorer we made earlier to train your model by passing it in via the scoring parameter.

In [27]:
from sklearn.ensemble import RandomForestClassifier

Let's generate our labels and rename our features to match.

In [28]:
# Rename our feature array
X = all_features

# Create our labels
y = lab_ids.values.ravel()

We've got our features and our labels. Time to train!

Wait a sec... aren't we forgetting something? Oh right, we still have to address the class imbalance we discovered earlier. Luckily, scikit-learn has an easy solution for us. We can set class_weight to "balanced". This will set class weights inversely proportional to the class frequency.

Great! Now let's generate and fit our model.

In [29]:
# instantiate our RF Classifier
rf = RandomForestClassifier(
    n_jobs=4,
    n_estimators=150,
    class_weight="balanced",  # balance classes
    max_depth=3,  # shallow tree depth to prevent overfitting
    random_state=0,  # set a seed for reproducibility
)

# fit our model
rf.fit(X, y)
Out[29]:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight='balanced',
                       criterion='gini', max_depth=3, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=4,
                       oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

Let's see how our model scored on our training data.

In [30]:
rf.score(X, y)
Out[30]:
0.1929955408857927

Nice! 19.30% accuracy isn't too shabby for our first try. However, that's using the default scorer, which is vanilla top-1 accuracy. We should expect to do better on the competition metric, top-10 accuracy. Let's use our custom defined scorer to see how we did:

In [31]:
top10_accuracy_scorer(rf, X, y)
Out[31]:
0.3784534332005649

Our model got about 37.85% top ten accuracy. Not bad! Now let's make some predictions using our newly trained model.

Time to Predict and Submit

Now we'll create features from our test set and generate predictions. We need to make sure to generate probabilities for each lab ID so that the top 10 accuracy metric can be computed.

First, let's create the 4-gram features and join it with the binary features.

In [32]:
test_ngram_features = get_ngram_features(test_values, subsequences)
all_test_features = test_ngram_features.join(test_values.drop("sequence", axis=1))

Make Predictions

It's important to use predict_proba here, which gives us class probabilities, instead of predict, which would give us class predictions.

In [33]:
probas = rf.predict_proba(all_test_features)

# Examine first row
probas[0]
Out[33]:
array([0.00087402, 0.00107841, 0.00107348, ..., 0.00095991, 0.00026639,
       0.00017803])

Save Submission

Now let's get our matrix of probabilities ready for submission.

In [34]:
submission_format = pd.read_csv(
    DATA_DIR / "submission_format.csv", index_col="sequence_id"
)
In [35]:
submission_format.head()
Out[35]:
00Q4V31T 012VT4JK 028IO5W2 03GRNN7N 03Y3W51H 09MQV1TY 0A4AHRCT 0A9M05NC 0B9GCUVV 0CL7QVG8 ... ZQNGGY33 ZSHS4VJZ ZT1IP3T6 ZU6860XU ZU6TVFFU ZU75P59K ZUI6TDWV ZWFD8OHC ZX06ZDZN ZZJVE4HO
sequence_id
E0VFT 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
TTRK5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2Z7FZ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
VJI6E 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
721FI 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 1314 columns

Before we create our submission csv, let's make sure the submission format matches the shape of our predicted probabilities. We're also going to check that the predicted classes are in the same order as the submission format columns.

In [36]:
assert submission_format.shape == probas.shape
assert (rf.classes_ == submission_format.columns).all()
In [37]:
my_submission = pd.DataFrame(
    data=probas, columns=rf.classes_, index=submission_format.index
)
In [38]:
my_submission.head()
Out[38]:
00Q4V31T 012VT4JK 028IO5W2 03GRNN7N 03Y3W51H 09MQV1TY 0A4AHRCT 0A9M05NC 0B9GCUVV 0CL7QVG8 ... ZQNGGY33 ZSHS4VJZ ZT1IP3T6 ZU6860XU ZU6TVFFU ZU75P59K ZUI6TDWV ZWFD8OHC ZX06ZDZN ZZJVE4HO
sequence_id
E0VFT 0.000874 0.001078 0.001073 0.000082 0.005858 0.001078 0.001029 0.001055 0.000215 0.000151 ... 0.000911 0.000564 0.000167 0.000182 0.001259 0.000672 0.001338 0.000960 0.000266 0.000178
TTRK5 0.000838 0.000989 0.001024 0.000162 0.000982 0.001868 0.000762 0.000967 0.000359 0.000110 ... 0.001487 0.000787 0.000294 0.000103 0.000813 0.000824 0.000936 0.002691 0.000228 0.000162
2Z7FZ 0.000665 0.000815 0.000799 0.000290 0.000686 0.001456 0.002528 0.001027 0.000163 0.000117 ... 0.001060 0.000793 0.000116 0.000119 0.001003 0.001022 0.001071 0.000761 0.000434 0.000328
VJI6E 0.000447 0.000135 0.000120 0.001868 0.000245 0.000108 0.000086 0.000122 0.002021 0.002008 ... 0.000158 0.000207 0.002026 0.001736 0.000174 0.000094 0.000115 0.000202 0.001320 0.001670
721FI 0.000947 0.001126 0.001121 0.000232 0.000970 0.000690 0.000904 0.001086 0.000393 0.000213 ... 0.001229 0.000660 0.000401 0.000183 0.000864 0.000709 0.001059 0.001484 0.000332 0.000271

5 rows × 1314 columns

Looks good! Let's save and submit!

In [39]:
my_submission.to_csv("submission.csv")

Submit to Leaderboard

0.134261298564003

Woohoo, we scored 33.2%! We hope this benchmark has shown you how to get started. We can't wait to see what you come up with. Happy modeling!

Stay updated

Join our newsletter or follow us for the latest on our social impact projects, data science competitions and open source work.

There was a problem. Please try again.
Subscribe successful!
Protected by reCAPTCHA. The Google Privacy Policy and Terms of Service apply.

Latest posts

All posts

winners

Meet the winners of the AI for Advancing Instruction Challenge

Learn how the winners of the AIAI challenge leveraged multimodal classroom data to identify instructional activities and classroom discourse content.

case studies

Automating wildlife monitoring with Zamba & Zamba Cloud

DrivenData partnered with conservation researchers to create Zamba, an open-source machine learning solution that helps wildlife researchers process camera trap footage, reducing months of manual review to hours of automated analysis.

community

Community Spotlight: Paola Ruiz, Néstor González, Daniel Crovo

The Community Spotlight features fantastic members from our DrivenData community. Three members of the IGCPHARMA team, Paola Ruiz, Néstor González, and Daniel Crovo talk to us about data science, drug discovery, diverse databases and more!

community

Community Spotlight: Kirill Brodt

The Community Spotlight features fantastic members from our DrivenData community. Kirill Brodt, a researcher in computer graphics at the University of Montreal, talks animation, pose estimation, and data science challenges.

case studies

Jump-starting data infrastructure and in-house data expertise

DrivenData designed and built a data warehouse to centralize, organize, and visualize data across CodePath's operations. Our team also provided technical hiring assistance to find the right talent to carry the work forward.

case studies

A production application to support survivors of human trafficking

DrivenData developed Freedom Lifemap, a digital tool designed to support survivors of human trafficking on their journey toward reintegration and independence.

insights

Life beyond the leaderboard

What happens to winning solutions after a machine learning competition?

insights

(Tech) Infrastructure Week for the Nonprofit Sector

Reflections on how to build data and AI infrastructure in the social sector that serves the needs of nonprofits and their beneficiaries.

winners

Meet the winners of Phase 2 of the PREPARE Challenge

Learn about how winners detected cognitive decline using speech recordings and social determinants of health survey data

insights

AI sauce on everything: Reflections on ASU+GSV 2025

Data, evaltuation, product iteration, and public goods: reflections on the ASU+GSV Summit 2025.

resources

Open-source packages for using speech data in ML

Overview of key open-source packages for extracting features from voice data to support ML applications

tutorial

Getting started with LLMs: a benchmark for the 'What's Up, Docs?' challenge

An introduction to using large language models via the benchmark to a document summarization challenge.

winners

Meet the Winners of the Goodnight Moon, Hello Early Literacy Screening Challenge

Learn about the results and winning methods from the early literacy screening challenge.

case studies

Crowdsourcing solutions for AI-assisted early literacy screening

DrivenData ran a machine learning competition to develop models for scoring audio recordings from literacy screener exercises completed by students in kindergarten through 3rd grade.

resources

Where to find a data job for a good cause

Finding data jobs for good causes can be difficult. Learn strategies, job lists, and tips to find organizations with open positions working on causes you care about.

winners

Meet the Winners of the Youth Mental Health Narratives Challenge

Learn about the winning solutions from the Youth Mental Health Challenge Automated Abstraction and Novel Variables Tracks

winners

Meet the winners of the Forecast and Final Prize Stages of the Water Supply Forecast Rodeo

Learn about the winners and winning solutions from the final stages of the Water Supply Forecast Rodeo.

insights

10 takeaways from 10 years of data science for social good

This year DrivenData celebrates our 10th birthday! We've spent the past decade working to use data science and AI for social good. Here are some lessons we've learned along the way.

case studies

Mapping agricultural trends in Yemen during crisis

DrivenData partnered with The World Bank to use machine learning and remote sensing data to track agricultural changes across Yemen from 2019-2024, providing critical insights for food security planning in a conflict-affected region.

case studies

Making higher education data more accessible

DrivenData partnered with Science for America to develop scipeds, an open source Python library and interactive data visualization platform designed to simplify the analysis of U.S. higher education data from IPEDS and to illuminate trends and disparities in STEM education.

Work with us to build a better world

Learn more about how our team is bringing the transformative power of data science and AI to organizations tackling the world's biggest challenges.