blog

America's Next Top (Statistical) Model - Benchmark


by Peter Bull

We're excited to launch a new comeptition. If you're not sure what we're talking about, head to: America's Next Top (Statistical) Model.

US presidential elections come but once every 4 years, and this one's a big one. The new president will help shape policies on education, healthcare, energy, the environment, international relations, aid, and more. There are lots of people trying to predict what will happen. Can you top them?

In this challenge, you'll predict the percent of each state that will vote for each candidate. You can use any data that's available to the public. Come election night, we'll see who's model had the best vision for the country!

In [1]:
%matplotlib inline
import seaborn as sns

# no warnings in our blog post, plz
import warnings
warnings.filterwarnings('ignore')

What do the pollsters say?

First things first, we need some data. The bread and butter of election forecasting is polling data, though some believe that is changing rapidly. The Huffington Post makes available an excellent API for getting data from election polls.

We've gone ahead and collected polling data by state for 2012, and we can read in that CSV using pandas.

In [2]:
import pandas as pd

polls2012 = pd.read_csv("all-polls-2012.csv", index_col=0)
polls2012.head()
Out[2]:
obama date johnson method moe observations other pollster state stein romney
0 53.0 2012-11-03 NaN Internet NaN 454 1.0 Angus-Reid WI NaN 46.0
1 49.0 2012-10-30 NaN Automated Phone 3.0 1000 2.0 Pulse Opinion Research/Let Freedom Ring (R) WI NaN 48.0
2 50.0 2012-11-03 NaN Internet 3.1 1225 1.0 YouGov WI NaN 46.0
3 51.0 2012-11-03 NaN Automated Phone 2.8 1256 1.0 PPP (D) WI NaN 48.0
4 48.0 2012-11-02 NaN Live Phone 4.4 500 8.0 Grove Insight (D-Project New America/USAction) WI NaN 42.0

We can see that the different polls have different dates, methodologies, number of observations, margins of error, and are run by different pollsters. We won't dig in to the method, margin of error, or the number of observations in our first-pass model, but you can see how these would be helpful by looking at other forecasts, for example the 538 model.

We'll also need a test set to create our model--in this case, the we're using the results of the most recent presidential election, 2012. You can imagine that including more recent elections (e.g., congressional or gubenatorial races) may help improve the forecast that we create for the 2016 presidential election.

In [3]:
results2012 = pd.read_csv("data/final/private/2012-actual-returns.csv", index_col=0)
results2012.head()
Out[3]:
Obama Romney Stein Johnson
STATE ABBREVIATION
AK 0.408127 0.548016 0.009707 0.024599
AL 0.383590 0.605458 0.001638 0.005943
AR 0.368790 0.605669 0.008701 0.015219
AZ 0.445898 0.536545 0.003399 0.013961
CA 0.602390 0.371204 0.006568 0.010984

Turn polls into features

In order to predict the vote percentage for a candidate in each state, we need state-level features. For each state, we have a different number of polls, conducted using a different methodology and with varying recency (polls are less often conducted in states that aren't up for grabs). Our first decision is how many polls to use. Given that we expect voters to change their minds over time, we'll just work with up to 5 of the most recent polls. With that in mind, for each state, for each of the 5 most recent polls, we'll create the following features:

  • Number of days to the election: This helps the model understand how recent the poll is (therefore, how useful its measures will be).
  • Margin of error: If the poll has a margin of error, we'll want to incorporate that as a feature in the model.
  • Democrat percentage: The percentage of responds that say they will vote for the democratic candidate (For 2012, this is Obama; for 2016, this is Clinton).
  • Republican percentage: The percentage of responds that say they will vote for the republican candidate (For 2012, this is Romney; for 2016, this is Trump).
  • Stein percentage: The percentage of responds that say they will vote for Jill Stein, the Green Party candidate in 2012 and 2016. Not all polls inlcude 3rd party candidates.
  • Johnson percentage: The percentage of responds that say they will vote for Gary Johnson, the Libertarian Party candidate in 2012 and 2016. Not all polls inlcude 3rd party candidates.
In [4]:
from datetime import datetime

def build_features_from_polls(states, all_polls, is2012=True, n_polls=5):
    """ Builds a dataframe where each row is a state, and each column is a
        property of one of the last 5 (n_polls) polls in that state.
    """
    all_states_rows = []

    for st in states:
        st_polls = all_polls[all_polls.state == st]
        st_polls.sort_values('date', ascending=False, inplace=True)
        
        row = {}
        limit = min(st_polls.shape[0], n_polls)

        for i in range(limit):
            this_poll = st_polls.iloc[i]

            # calculate the number of days until the election
            election_day = datetime(2012, 11, 6) if is2012 else datetime(2016, 11, 8)
            days_to_election = (election_day - pd.to_datetime(this_poll.date)).days
            
            # get the dem and rep candidates:
            dem_pct = this_poll.obama if is2012 else this_poll.clinton
            rep_pct = this_poll.romney if is2012 else this_poll.trump

            poll_data = {'poll_{}_days_to_election'.format(i): days_to_election,
                         'poll_{}_moe'.format(i): this_poll.moe,
                         'poll_{}_democrat'.format(i): dem_pct,
                         'poll_{}_republican'.format(i): rep_pct,
                         'poll_{}_johnson'.format(i): this_poll.johnson,
                         'poll_{}_stein'.format(i): this_poll.stein}

            row.update(poll_data)

        all_states_rows.append(row)
        
    features = pd.DataFrame(all_states_rows, index=states)
    
    # for unavailable data, generally fill in the mean for that column
    features.fillna(features.mean(axis=0), inplace=True)
    
    # if there is no data in a column (all nans), fill in 0
    features.fillna(0, inplace=True)
    
    return features
In [5]:
features2012 = build_features_from_polls(results2012.index, polls2012)
features2012.head()
Out[5]:
poll_0_days_to_election poll_0_democrat poll_0_johnson poll_0_moe poll_0_republican poll_0_stein poll_1_days_to_election poll_1_democrat poll_1_johnson poll_1_moe ... poll_3_johnson poll_3_moe poll_3_republican poll_3_stein poll_4_days_to_election poll_4_democrat poll_4_johnson poll_4_moe poll_4_republican poll_4_stein
STATE ABBREVIATION
AK 33.326087 47.456522 2.0 3.621667 46.369565 1.0 50.44186 47.023256 6.0 3.566389 ... 4.5 4.07375 46.371429 0.0 58.764706 47.411765 1.666667 3.823448 45.088235 1.333333
AL 132.000000 36.000000 2.0 4.200000 51.000000 1.0 50.44186 47.023256 6.0 3.566389 ... 4.5 4.07375 46.371429 0.0 58.764706 47.411765 1.666667 3.823448 45.088235 1.333333
AR 23.000000 31.000000 2.0 4.000000 58.000000 1.0 50.00000 35.000000 6.0 2.000000 ... 4.5 4.07375 46.371429 0.0 58.764706 47.411765 1.666667 3.823448 45.088235 1.333333
AZ 3.000000 44.000000 2.0 4.100000 52.000000 1.0 3.00000 46.000000 6.0 3.000000 ... 4.5 5.40000 52.000000 0.0 27.000000 42.000000 3.000000 4.400000 40.000000 2.000000
CA 3.000000 55.000000 2.0 3.500000 40.000000 1.0 7.00000 54.000000 6.0 2.600000 ... 4.5 4.00000 41.000000 0.0 16.000000 55.000000 1.666667 3.823448 39.000000 1.333333

5 rows × 30 columns

Time to do some forecasting!


Now it's time to make some predictions. We'll start off with a straightforward model: ordinary least squares regression. OLS can be written as:

$$ y_{i,c} = x_i ^T \beta_c + \varepsilon_i $$

In our model, we can think of $i$ as a state. $y_{i,c}$ is the percent of the vote that candidate $c$ received in state $i$ in the election results. $x_i$ is the value from the polls for a given state.

$\beta_c$ is the vector of coefficients that correspond to each property of the polls for candidate $c$. For example, we would expect that $\beta_\textrm{stein}$, will be large for poll_stein variables, but otherwise isn't very important for predicting the success of other candidates.

We'll also fit this model using the GridSearchCV from sklearn, which will compare the cross-validated scores for different hyperparameter values on the model and choose the best one. For the LinearRegresssion model that we're starting with, we only look at one possible hyperparameter: wether or not we include a $\varepsilon_i$ parameter in our model.

In [6]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV

from sklearn.pipeline import make_pipeline

ss = MinMaxScaler()
gscv = GridSearchCV(LinearRegression(),
                    dict(fit_intercept=[True, False], ),
                    scoring='neg_mean_squared_error')

clf = make_pipeline(ss, gscv)

clf.fit(features2012, results2012)
Out[6]:
Pipeline(steps=[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('gridsearchcv', GridSearchCV(cv=None, error_score='raise',
       estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'fit_intercept': [True, False]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_squared_error', verbose=0))])

We've trained a linear regression model that scales the features we use to be between 0 and 1. Since all of the variables are of the same scale, we can plot the coefficients to get a sense for their relative effect on the prediction. Unsuprisingly, the results from the most recent poll are the most important feature!

In [7]:
# get the model coefficients
coeffs = clf.steps[1][1].best_estimator_.coef_

# plot the coefficients
(pd.DataFrame(coeffs, index=results2012.columns, columns=features2012.columns)
   .T
   .plot
   .barh(figsize=(5, 15),
         linewidth=0,
         width=1.0))
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x1170412b0>

Listen to the voice of the people

Now that we have the model, we can predict data for 2012 and evaluate how good the fit of that model is. To do that, we'll predict the percentages that each candidate will receive in each state.

In [8]:
# predict vote percentages with our trained model
preds = clf.predict(features2012)

# get submission format
submission2012 = pd.read_csv("2012-submission-format.csv", index_col=0)
preds2012 = submission2012.copy()

# fill in our predicted values and write to csv
preds2012.iloc[:, :] = preds
preds2012.to_csv("linear-model.csv")

preds2012.head()
Out[8]:
Obama Romney Stein Johnson
STATE ABBREVIATION
AK 0.490175 0.489861 0.003615 0.011719
AL 0.357734 0.602813 0.002282 0.014384
AR 0.334030 0.628213 0.004050 0.007238
AZ 0.435375 0.554215 0.002013 0.013893
CA 0.593085 0.374303 0.006027 0.008718

If we submit on DrivenData, we can see our score against the 2012 election:

Feel the will of the people

Finally, it's time to see what we think about Trump v. Clinton. We've got a model now to predict presidential election outcomes based poll results. We can use this model to predict the current presidential election.

In [9]:
# raw poll data for 2016
polls2016 = pd.read_csv("all-polls-2016.csv", index_col=0)

# submission format for 2016 election
submission2016 = pd.read_csv("data/final/public/2016-submission-format.csv", index_col=0)

# create our features
features2016 = build_features_from_polls(submission2016.index,
                                         polls2016,
                                         is2012=False)

# make predictions (ensuring that the order of the 2016 columns matches the 2012 columns)
preds = clf.predict(features2016[features2012.columns])

# fill the predictions into the submission format
preds2016 = pd.DataFrame(preds,
                         index=submission2016.index,
                         columns=submission2016.columns)

preds2016.head()
Out[9]:
Clinton Trump Stein Johnson
STATE ABBREVIATION
AK 0.189042 0.440576 -0.032379 0.027787
AL 0.254227 0.512741 -0.023724 0.028735
AR 0.240154 0.504451 -0.023678 0.029431
AZ 0.343192 0.441799 -0.019653 0.025112
CA 0.606404 0.245351 -0.017409 -0.001059

We can see we now have predictions for every major candidate, for every state. We can submit this to DrivenData and mark it as our "Evaluation Submission" to indicate that these are our predictions for 2016:

But, one important question remains: who will win the election? That, of course, is subject to the rules of the Electoral College. The winner needs at least 270 electoral votes to become president. We can calculate that using data about how many electoral votes each state gets:

In [10]:
import numpy as np

electoral_data = pd.read_csv("2012-electoral-college.csv")
electoral_data.sort_values(by='State', inplace=True)
preds2016.sort_index(inplace=True)

preds2016['Dems'] = np.where(preds2016.Clinton > preds2016.Trump,
                                          electoral_data.Electors,
                                          0)

preds2016['Reps'] = np.where(preds2016.Clinton < preds2016.Trump,
                                            electoral_data.Electors,
                                            0)

print("===== PREDICTED ELECTORAL VOTES FOR EACH PARTY =======")
print(preds2016[['Dems', 'Reps']].sum())
===== PREDICTED ELECTORAL VOTES FOR EACH PARTY =======
Dems    331
Reps    207
dtype: int64

That's not too bad!

We can see that our model predicts a Democratic victory given that 270 electoral votes are needed to win the election.

If we look at other election forecasts like 538, the NY Times, and Fox News we can see our model is not wildly out of line with current forecasts.

However, we can almost certainly do better! It's now up to you to MAKE THIS MODEL GREAT AGAIN...

For ideas on how to make the model even better, check out our election resources page on the competition website.

Prediction Map

For fun, here's the familiar electoral map to see at what our predictions look like. Looking at other prediction maps, we see that it might be worth it to gather more data and dig in to modeling decisions for Iowa, Virginia, DC (which we have going Republican, unlike most other models).

In [11]:
import json
import folium

# make percentages 0 - 100
mapdata = (preds2016 * 100).astype(float)

# state json has full names, so we need those in our data
mapdata.rename(index={'DC': 'District of Columbia'}, inplace=True)
mapdata.rename(index=electoral_data.set_index('State').Name.to_dict(), inplace=True)

# missing regions
mapdata.loc['Puerto Rico'] =  [0., 0., 0., 0., 0., 0.]

# clip to a minimum and maximum that make sense for %
mapdata.loc[:,:] = np.clip(mapdata.values, 0.0, 100.0)

# add a pctg for the winner
mapdata['winner'] = np.where(mapdata.Clinton > mapdata.Trump,
                             mapdata.Clinton,
                             mapdata.Trump * -1)

# create map centered on US
election_map = folium.Map(location=[ 39.833, -98.583],
                          tiles="Mapbox Bright",
                          zoom_start=4)

# fill in the colors for who wins the state
election_map.choropleth(geo_path="states.json",
                        fill_opacity=0.8,
                        data=mapdata.reset_index(),
                        columns=['STATE ABBREVIATION', 'winner'],
                        fill_color='RdBu',
                        key_on='feature.properties.NAME')

election_map
Out[11]: