We're excited to launch a new comeptition. If you're not sure what we're talking about, head to: America's Next Top (Statistical) Model.

US presidential elections come but once every 4 years, and this one's a big one. The new president will help shape policies on education, healthcare, energy, the environment, international relations, aid, and more. There are lots of people trying to predict what will happen. Can you top them?

In this challenge, you'll predict the percent of each state that will vote for each candidate. You can use any data that's available to the public. Come election night, we'll see who's model had the best vision for the country!

In [1]:

%matplotlib inline
import seaborn as sns

# no warnings in our blog post, plz
import warnings
warnings.filterwarnings('ignore')

What do the pollsters say?¶

First things first, we need some data. The bread and butter of election forecasting is polling data, though some believe that is changing rapidly. The Huffington Post makes available an excellent API for getting data from election polls.

We've gone ahead and collected polling data by state for 2012, and we can read in that CSV using pandas.

In [2]:

import pandas as pd

polls2012 = pd.read_csv("all-polls-2012.csv", index_col=0)
polls2012.head()

Out[2]:

	obama	date	johnson	method	moe	observations	other	pollster	state	stein	romney
0	53.0	2012-11-03	NaN	Internet	NaN	454	1.0	Angus-Reid	WI	NaN	46.0
1	49.0	2012-10-30	NaN	Automated Phone	3.0	1000	2.0	Pulse Opinion Research/Let Freedom Ring (R)	WI	NaN	48.0
2	50.0	2012-11-03	NaN	Internet	3.1	1225	1.0	YouGov	WI	NaN	46.0
3	51.0	2012-11-03	NaN	Automated Phone	2.8	1256	1.0	PPP (D)	WI	NaN	48.0
4	48.0	2012-11-02	NaN	Live Phone	4.4	500	8.0	Grove Insight (D-Project New America/USAction)	WI	NaN	42.0

We can see that the different polls have different dates, methodologies, number of observations, margins of error, and are run by different pollsters. We won't dig in to the method, margin of error, or the number of observations in our first-pass model, but you can see how these would be helpful by looking at other forecasts, for example the 538 model.

We'll also need a test set to create our model--in this case, the we're using the results of the most recent presidential election, 2012. You can imagine that including more recent elections (e.g., congressional or gubenatorial races) may help improve the forecast that we create for the 2016 presidential election.

In [3]:

results2012 = pd.read_csv("data/final/private/2012-actual-returns.csv", index_col=0)
results2012.head()

Out[3]:

	Obama	Romney	Stein	Johnson
STATE ABBREVIATION
AK	0.408127	0.548016	0.009707	0.024599
AL	0.383590	0.605458	0.001638	0.005943
AR	0.368790	0.605669	0.008701	0.015219
AZ	0.445898	0.536545	0.003399	0.013961
CA	0.602390	0.371204	0.006568	0.010984

Turn polls into features¶

In order to predict the vote percentage for a candidate in each state, we need state-level features. For each state, we have a different number of polls, conducted using a different methodology and with varying recency (polls are less often conducted in states that aren't up for grabs). Our first decision is how many polls to use. Given that we expect voters to change their minds over time, we'll just work with up to 5 of the most recent polls. With that in mind, for each state, for each of the 5 most recent polls, we'll create the following features:

Number of days to the election: This helps the model understand how recent the poll is (therefore, how useful its measures will be).
Margin of error: If the poll has a margin of error, we'll want to incorporate that as a feature in the model.
Democrat percentage: The percentage of responds that say they will vote for the democratic candidate (For 2012, this is Obama; for 2016, this is Clinton).
Republican percentage: The percentage of responds that say they will vote for the republican candidate (For 2012, this is Romney; for 2016, this is Trump).
Stein percentage: The percentage of responds that say they will vote for Jill Stein, the Green Party candidate in 2012 and 2016. Not all polls inlcude 3rd party candidates.
Johnson percentage: The percentage of responds that say they will vote for Gary Johnson, the Libertarian Party candidate in 2012 and 2016. Not all polls inlcude 3rd party candidates.

In [4]:

from datetime import datetime

def build_features_from_polls(states, all_polls, is2012=True, n_polls=5):
    """ Builds a dataframe where each row is a state, and each column is a
        property of one of the last 5 (n_polls) polls in that state.
    """
    all_states_rows = []

    for st in states:
        st_polls = all_polls[all_polls.state == st]
        st_polls.sort_values('date', ascending=False, inplace=True)
        
        row = {}
        limit = min(st_polls.shape[0], n_polls)

        for i in range(limit):
            this_poll = st_polls.iloc[i]

            # calculate the number of days until the election
            election_day = datetime(2012, 11, 6) if is2012 else datetime(2016, 11, 8)
            days_to_election = (election_day - pd.to_datetime(this_poll.date)).days
            
            # get the dem and rep candidates:
            dem_pct = this_poll.obama if is2012 else this_poll.clinton
            rep_pct = this_poll.romney if is2012 else this_poll.trump

            poll_data = {'poll_{}_days_to_election'.format(i): days_to_election,
                         'poll_{}_moe'.format(i): this_poll.moe,
                         'poll_{}_democrat'.format(i): dem_pct,
                         'poll_{}_republican'.format(i): rep_pct,
                         'poll_{}_johnson'.format(i): this_poll.johnson,
                         'poll_{}_stein'.format(i): this_poll.stein}

            row.update(poll_data)

        all_states_rows.append(row)
        
    features = pd.DataFrame(all_states_rows, index=states)
    
    # for unavailable data, generally fill in the mean for that column
    features.fillna(features.mean(axis=0), inplace=True)
    
    # if there is no data in a column (all nans), fill in 0
    features.fillna(0, inplace=True)
    
    return features

In [5]:

features2012 = build_features_from_polls(results2012.index, polls2012)
features2012.head()

Out[5]:

	poll_0_days_to_election	poll_0_democrat	poll_0_johnson	poll_0_moe	poll_0_republican	poll_0_stein	poll_1_days_to_election	poll_1_democrat	poll_1_johnson	poll_1_moe	...	poll_3_johnson	poll_3_moe	poll_3_republican	poll_3_stein	poll_4_days_to_election	poll_4_democrat	poll_4_johnson	poll_4_moe	poll_4_republican	poll_4_stein
STATE ABBREVIATION
AK	33.326087	47.456522	2.0	3.621667	46.369565	1.0	50.44186	47.023256	6.0	3.566389	...	4.5	4.07375	46.371429	0.0	58.764706	47.411765	1.666667	3.823448	45.088235	1.333333
AL	132.000000	36.000000	2.0	4.200000	51.000000	1.0	50.44186	47.023256	6.0	3.566389	...	4.5	4.07375	46.371429	0.0	58.764706	47.411765	1.666667	3.823448	45.088235	1.333333
AR	23.000000	31.000000	2.0	4.000000	58.000000	1.0	50.00000	35.000000	6.0	2.000000	...	4.5	4.07375	46.371429	0.0	58.764706	47.411765	1.666667	3.823448	45.088235	1.333333
AZ	3.000000	44.000000	2.0	4.100000	52.000000	1.0	3.00000	46.000000	6.0	3.000000	...	4.5	5.40000	52.000000	0.0	27.000000	42.000000	3.000000	4.400000	40.000000	2.000000
CA	3.000000	55.000000	2.0	3.500000	40.000000	1.0	7.00000	54.000000	6.0	2.600000	...	4.5	4.00000	41.000000	0.0	16.000000	55.000000	1.666667	3.823448	39.000000	1.333333

5 rows × 30 columns

Time to do some forecasting!¶

Now it's time to make some predictions. We'll start off with a straightforward model: ordinary least squares regression. OLS can be written as:

$$ y_{i,c} = x_i ^T \beta_c + \varepsilon_i $$

In our model, we can think of $i$ as a state. $y_{i,c}$ is the percent of the vote that candidate $c$ received in state $i$ in the election results. $x_i$ is the value from the polls for a given state.

$\beta_c$ is the vector of coefficients that correspond to each property of the polls for candidate $c$. For example, we would expect that $\beta_\textrm{stein}$, will be large for poll_stein variables, but otherwise isn't very important for predicting the success of other candidates.

We'll also fit this model using the GridSearchCV from sklearn, which will compare the cross-validated scores for different hyperparameter values on the model and choose the best one. For the LinearRegresssion model that we're starting with, we only look at one possible hyperparameter: wether or not we include a $\varepsilon_i$ parameter in our model.

In [6]:

from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV

from sklearn.pipeline import make_pipeline

ss = MinMaxScaler()
gscv = GridSearchCV(LinearRegression(),
                    dict(fit_intercept=[True, False], ),
                    scoring='neg_mean_squared_error')

clf = make_pipeline(ss, gscv)

clf.fit(features2012, results2012)

Out[6]:

Pipeline(steps=[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('gridsearchcv', GridSearchCV(cv=None, error_score='raise',
       estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'fit_intercept': [True, False]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_squared_error', verbose=0))])

We've trained a linear regression model that scales the features we use to be between 0 and 1. Since all of the variables are of the same scale, we can plot the coefficients to get a sense for their relative effect on the prediction. Unsuprisingly, the results from the most recent poll are the most important feature!

In [7]:

# get the model coefficients
coeffs = clf.steps[1][1].best_estimator_.coef_

# plot the coefficients
(pd.DataFrame(coeffs, index=results2012.columns, columns=features2012.columns)
   .T
   .plot
   .barh(figsize=(5, 15),
         linewidth=0,
         width=1.0))

Out[7]:

<matplotlib.axes._subplots.AxesSubplot at 0x1170412b0>

Listen to the voice of the people¶

Now that we have the model, we can predict data for 2012 and evaluate how good the fit of that model is. To do that, we'll predict the percentages that each candidate will receive in each state.

In [8]:

# predict vote percentages with our trained model
preds = clf.predict(features2012)

# get submission format
submission2012 = pd.read_csv("2012-submission-format.csv", index_col=0)
preds2012 = submission2012.copy()

# fill in our predicted values and write to csv
preds2012.iloc[:, :] = preds
preds2012.to_csv("linear-model.csv")

preds2012.head()

Out[8]:

	Obama	Romney	Stein	Johnson
STATE ABBREVIATION
AK	0.490175	0.489861	0.003615	0.011719
AL	0.357734	0.602813	0.002282	0.014384
AR	0.334030	0.628213	0.004050	0.007238
AZ	0.435375	0.554215	0.002013	0.013893
CA	0.593085	0.374303	0.006027	0.008718

If we submit on DrivenData, we can see our score against the 2012 election:

Feel the will of the people¶

Finally, it's time to see what we think about Trump v. Clinton. We've got a model now to predict presidential election outcomes based poll results. We can use this model to predict the current presidential election.

In [9]:

# raw poll data for 2016
polls2016 = pd.read_csv("all-polls-2016.csv", index_col=0)

# submission format for 2016 election
submission2016 = pd.read_csv("data/final/public/2016-submission-format.csv", index_col=0)

# create our features
features2016 = build_features_from_polls(submission2016.index,
                                         polls2016,
                                         is2012=False)

# make predictions (ensuring that the order of the 2016 columns matches the 2012 columns)
preds = clf.predict(features2016[features2012.columns])

# fill the predictions into the submission format
preds2016 = pd.DataFrame(preds,
                         index=submission2016.index,
                         columns=submission2016.columns)

preds2016.head()

Out[9]:

	Clinton	Trump	Stein	Johnson
STATE ABBREVIATION
AK	0.189042	0.440576	-0.032379	0.027787
AL	0.254227	0.512741	-0.023724	0.028735
AR	0.240154	0.504451	-0.023678	0.029431
AZ	0.343192	0.441799	-0.019653	0.025112
CA	0.606404	0.245351	-0.017409	-0.001059

We can see we now have predictions for every major candidate, for every state. We can submit this to DrivenData and mark it as our "Evaluation Submission" to indicate that these are our predictions for 2016:

But, one important question remains: who will win the election? That, of course, is subject to the rules of the Electoral College. The winner needs at least 270 electoral votes to become president. We can calculate that using data about how many electoral votes each state gets:

In [10]:

import numpy as np

electoral_data = pd.read_csv("2012-electoral-college.csv")
electoral_data.sort_values(by='State', inplace=True)
preds2016.sort_index(inplace=True)

preds2016['Dems'] = np.where(preds2016.Clinton > preds2016.Trump,
                                          electoral_data.Electors,
                                          0)

preds2016['Reps'] = np.where(preds2016.Clinton < preds2016.Trump,
                                            electoral_data.Electors,
                                            0)

print("===== PREDICTED ELECTORAL VOTES FOR EACH PARTY =======")
print(preds2016[['Dems', 'Reps']].sum())

===== PREDICTED ELECTORAL VOTES FOR EACH PARTY =======
Dems    331
Reps    207
dtype: int64

That's not too bad!¶

We can see that our model predicts a Democratic victory given that 270 electoral votes are needed to win the election.

If we look at other election forecasts like 538, the NY Times, and Fox News we can see our model is not wildly out of line with current forecasts.

However, we can almost certainly do better! It's now up to you to MAKE THIS MODEL GREAT AGAIN...

For ideas on how to make the model even better, check out our election resources page on the competition website.

Prediction Map¶

For fun, here's the familiar electoral map to see at what our predictions look like. Looking at other prediction maps, we see that it might be worth it to gather more data and dig in to modeling decisions for Iowa, Virginia, DC (which we have going Republican, unlike most other models).

In [11]:

import json
import folium

# make percentages 0 - 100
mapdata = (preds2016 * 100).astype(float)

# state json has full names, so we need those in our data
mapdata.rename(index={'DC': 'District of Columbia'}, inplace=True)
mapdata.rename(index=electoral_data.set_index('State').Name.to_dict(), inplace=True)

# missing regions
mapdata.loc['Puerto Rico'] =  [0., 0., 0., 0., 0., 0.]

# clip to a minimum and maximum that make sense for %
mapdata.loc[:,:] = np.clip(mapdata.values, 0.0, 100.0)

# add a pctg for the winner
mapdata['winner'] = np.where(mapdata.Clinton > mapdata.Trump,
                             mapdata.Clinton,
                             mapdata.Trump * -1)

# create map centered on US
election_map = folium.Map(location=[ 39.833, -98.583],
                          tiles="Mapbox Bright",
                          zoom_start=4)

# fill in the colors for who wins the state
election_map.choropleth(geo_path="states.json",
                        fill_opacity=0.8,
                        data=mapdata.reset_index(),
                        columns=['STATE ABBREVIATION', 'winner'],
                        fill_color='RdBu',
                        key_on='feature.properties.NAME')

election_map

Out[11]:

blog

America's Next Top (Statistical) Model - Benchmark

Thu 13 October 2016
by Peter Bull

What do the pollsters say?¶

Turn polls into features¶

Time to do some forecasting!¶

Listen to the voice of the people¶

Feel the will of the people¶

That's not too bad!¶

Prediction Map¶

Thu 13 October 2016 by Peter Bull

What do the pollsters say?¶

Turn polls into features¶

Time to do some forecasting!¶

Listen to the voice of the people¶

Feel the will of the people¶

That's not too bad!¶

Prediction Map¶

Thu 13 October 2016
by Peter Bull