blog

America's Next Top (Statistical) Model 2020 - Benchmark


by Casey Fitzpatrick

America's Next Top (Statistical) Model - 2020

US presidential elections come but once every 4 years, and this one's a big one. The new president will help shape policies on the pandemic response, healthcare, the environment, the economy, and more. There are lots of people trying to predict what will happen. Can you top them?

In our newest competition, you are asked to predict the fraction of each state that will vote for each major candidate. You can use any data that is freely available to the public. Come election night (or election week... or election month), we'll see who's model had the most accurate vision for the country!

This competition allows the use of any publically available data, so in this post we'll walk through downloading some polling data, cleaning it up a bit, and using it to generate predictions for the 2016 and 2020 elections.

In [1]:
%matplotlib inline

from pathlib import Path
import re

import numpy as np
import pandas as pd
import us

pd.set_option("display.max_columns", 50)
pd.set_option("display.max_rows", 50)

What do the pollsters say?

First things first, we need some data. The bread and butter of election forecasting is polling data, though some believe that is changing rapidly. The forecasting website FiveThirtyEight is a great place to get polling data. Thanks to them, we can load the thousands of polls used in their 2016 and 2020 forecasts directly into pandas in just couple of lines. Scroll around to check out the columns!

In [2]:
polls_2016_url = "http://projects.fivethirtyeight.com/general-model/president_general_polls_2016.csv"
polls_2020_url = "https://projects.fivethirtyeight.com/polls-page/president_polls.csv"

polls_2016 = pd.read_csv(polls_2016_url)
polls_2020 = pd.read_csv(polls_2020_url)

polls_2016.head()
Out[2]:
cycle branch type matchup forecastdate state startdate enddate pollster grade samplesize population poll_wt rawpoll_clinton rawpoll_trump rawpoll_johnson rawpoll_mcmullin adjpoll_clinton adjpoll_trump adjpoll_johnson adjpoll_mcmullin multiversions url poll_id question_id createddate timestamp
0 2016 President polls-plus Clinton vs. Trump vs. Johnson 11/8/16 U.S. 11/3/2016 11/6/2016 ABC News/Washington Post A+ 2220.0 lv 8.720654 47.00 43.00 4.00 NaN 45.20163 41.72430 4.626221 NaN NaN https://www.washingtonpost.com/news/the-fix/wp... 48630 76192 11/7/16 09:35:33 8 Nov 2016
1 2016 President polls-plus Clinton vs. Trump vs. Johnson 11/8/16 U.S. 11/1/2016 11/7/2016 Google Consumer Surveys B 26574.0 lv 7.628472 38.03 35.69 5.46 NaN 43.34557 41.21439 5.175792 NaN NaN https://datastudio.google.com/u/0/#/org//repor... 48847 76443 11/7/16 09:35:33 8 Nov 2016
2 2016 President polls-plus Clinton vs. Trump vs. Johnson 11/8/16 U.S. 11/2/2016 11/6/2016 Ipsos A- 2195.0 lv 6.424334 42.00 39.00 6.00 NaN 42.02638 38.81620 6.844734 NaN NaN http://projects.fivethirtyeight.com/polls/2016... 48922 76636 11/8/16 09:35:33 8 Nov 2016
3 2016 President polls-plus Clinton vs. Trump vs. Johnson 11/8/16 U.S. 11/4/2016 11/7/2016 YouGov B 3677.0 lv 6.087135 45.00 41.00 5.00 NaN 45.65676 40.92004 6.069454 NaN NaN https://d25d2506sfb94s.cloudfront.net/cumulus_... 48687 76262 11/7/16 09:35:33 8 Nov 2016
4 2016 President polls-plus Clinton vs. Trump vs. Johnson 11/8/16 U.S. 11/3/2016 11/6/2016 Gravis Marketing B- 16639.0 rv 5.316449 47.00 43.00 3.00 NaN 46.84089 42.33184 3.726098 NaN NaN http://www.gravispolls.com/2016/11/final-natio... 48848 76444 11/7/16 09:35:33 8 Nov 2016
In [3]:
polls_2020.head()
Out[3]:
question_id poll_id cycle state pollster_id pollster sponsor_ids sponsors display_name pollster_rating_id pollster_rating_name fte_grade sample_size population population_full methodology office_type seat_number seat_name start_date end_date election_date sponsor_candidate internal partisan tracking nationwide_batch ranked_choice_reallocated created_at notes url stage race_id answer candidate_id candidate_name candidate_party pct
0 132135 70662 2020 Montana 1102 Emerson College NaN NaN Emerson College 88.0 Emerson College A- 500.0 lv lv Text U.S. President 0 NaN 10/5/20 10/7/20 11/3/20 NaN False NaN NaN False False 10/7/20 20:43 NaN https://emersonpolling.reportablenews.com/pr/m... general 6237 Biden 13256 Joseph R. Biden Jr. DEM 43.6
1 132135 70662 2020 Montana 1102 Emerson College NaN NaN Emerson College 88.0 Emerson College A- 500.0 lv lv Text U.S. President 0 NaN 10/5/20 10/7/20 11/3/20 NaN False NaN NaN False False 10/7/20 20:43 NaN https://emersonpolling.reportablenews.com/pr/m... general 6237 Trump 13254 Donald Trump REP 56.4
2 132174 70680 2020 NaN 1189 Morning Consult NaN NaN Morning Consult 218.0 Morning Consult B/C 17249.0 lv lv Online U.S. President 0 NaN 10/5/20 10/7/20 11/3/20 NaN False NaN True False False 10/8/20 11:38 NaN https://morningconsult.com/2020-presidential-e... general 6210 Biden 13256 Joseph R. Biden Jr. DEM 52.0
3 132174 70680 2020 NaN 1189 Morning Consult NaN NaN Morning Consult 218.0 Morning Consult B/C 17249.0 lv lv Online U.S. President 0 NaN 10/5/20 10/7/20 11/3/20 NaN False NaN True False False 10/8/20 11:38 NaN https://morningconsult.com/2020-presidential-e... general 6210 Trump 13254 Donald Trump REP 43.0
4 132116 70650 2020 Florida 744 Ipsos 71 Reuters Ipsos 154.0 Ipsos B- 678.0 lv lv Online U.S. President 0 NaN 9/29/20 10/7/20 11/3/20 NaN False NaN NaN False False 10/7/20 14:33 NaN https://www.ipsos.com/sites/default/files/ct/n... general 6220 Biden 13256 Joseph R. Biden Jr. DEM 49.0

The data schema has changed a bit between 2016 and 2020, so let's see what columns these polls have in common.

In [4]:
polls_2016.columns.intersection(polls_2020.columns)
Out[4]:
Index(['cycle', 'state', 'pollster', 'population', 'url', 'poll_id',
       'question_id'],
      dtype='object')

We can use the state column to map into our submission format. But the FiveThirtyEight data uses the full state name for the state, and our submission format uses the state abbreviation. We'll want to

  • Create a map between states and abbreviations using the us library
  • Subset the polling data to only those states in our submission format (removing any national polls or territories)

First, let's load our submission formats

In [5]:
DATA_DIR = Path("../data/processed/public/")
index_col = "state_abbreviation"

submission_format_2016 = pd.read_csv(
    (DATA_DIR / "submission-format-2016.csv"), 
    index_col=index_col
)
submission_format_2020 = pd.read_csv(
    (DATA_DIR / "submission-format-2020.csv"), 
    index_col=index_col
)

# indices are the same for 2016 and 2020
assert (submission_format_2016.index == submission_format_2020.index).all()
print(f"There are {len(submission_format_2016)} abbreviations")  # includes D.C.
submission_format_2016.head()
There are 51 abbreviations
Out[5]:
Clinton Trump Other
state_abbreviation
AK 0.33 0.33 0.33
AL 0.33 0.33 0.33
AR 0.33 0.33 0.33
AZ 0.33 0.33 0.33
CA 0.33 0.33 0.33
In [6]:
submission_format_2020.head()
Out[6]:
Biden Trump Other
state_abbreviation
AK 0.33 0.33 0.33
AL 0.33 0.33 0.33
AR 0.33 0.33 0.33
AZ 0.33 0.33 0.33
CA 0.33 0.33 0.33

It will be our job to replace those 0.33 with some machine-learned predictions of our own!

First, since the FiveThirtyEight data uses the full state name in state, we'll create a map between full state names and abbreviations. This is easy using a fun little package called us.

In [7]:
states = {
    state.name: state.abbr for state in (us.STATES + [us.states.DC])
    if state.abbr in submission_format_2016.index  # ok to check only 2016 since 2020 same
}

assert len(states) == len(submission_format_2016) == len(submission_format_2020)
states
Out[7]:
{'Alabama': 'AL',
 'Alaska': 'AK',
 'Arizona': 'AZ',
 'Arkansas': 'AR',
 'California': 'CA',
 'Colorado': 'CO',
 'Connecticut': 'CT',
 'Delaware': 'DE',
 'Florida': 'FL',
 'Georgia': 'GA',
 'Hawaii': 'HI',
 'Idaho': 'ID',
 'Illinois': 'IL',
 'Indiana': 'IN',
 'Iowa': 'IA',
 'Kansas': 'KS',
 'Kentucky': 'KY',
 'Louisiana': 'LA',
 'Maine': 'ME',
 'Maryland': 'MD',
 'Massachusetts': 'MA',
 'Michigan': 'MI',
 'Minnesota': 'MN',
 'Mississippi': 'MS',
 'Missouri': 'MO',
 'Montana': 'MT',
 'Nebraska': 'NE',
 'Nevada': 'NV',
 'New Hampshire': 'NH',
 'New Jersey': 'NJ',
 'New Mexico': 'NM',
 'New York': 'NY',
 'North Carolina': 'NC',
 'North Dakota': 'ND',
 'Ohio': 'OH',
 'Oklahoma': 'OK',
 'Oregon': 'OR',
 'Pennsylvania': 'PA',
 'Rhode Island': 'RI',
 'South Carolina': 'SC',
 'South Dakota': 'SD',
 'Tennessee': 'TN',
 'Texas': 'TX',
 'Utah': 'UT',
 'Vermont': 'VT',
 'Virginia': 'VA',
 'Washington': 'WA',
 'West Virginia': 'WV',
 'Wisconsin': 'WI',
 'Wyoming': 'WY',
 'District of Columbia': 'DC'}

Let's look at how many polls each state has. Some polls have multiple rows (in 2020) so we'll use the unique poll_id to count.

In [8]:
poll_counts = pd.DataFrame(
    [
        polls_2016.groupby("state").poll_id.count().rename("# 2016 Polls"),
        polls_2020.groupby("state").poll_id.count().rename("# 2020 Polls")
    ]
).T.sort_index(ascending=False)

poll_counts["in_states_map"] = poll_counts.index.isin(states.keys())
poll_counts[poll_counts.in_states_map].plot.barh(figsize=(10, 20), title="Number of polls conducted")
Out[8]:
<AxesSubplot:title={'center':'Number of polls conducted'}>

It's interesting to note that Michigan and Wisconsin, two of the states that are said to have delievered Trump the presidency by about 80,000 votes in 2016, are being polled substantially more in 2020.

Now let's look at the polling data that doesn't appear in our submission formats

In [9]:
poll_counts[~poll_counts.in_states_map]
Out[9]:
# 2016 Polls # 2020 Polls in_states_map
U.S. 3318.0 NaN False
Nebraska CD-3 3.0 NaN False
Nebraska CD-2 6.0 11.0 False
Nebraska CD-1 3.0 2.0 False
Maine CD-2 42.0 44.0 False
Maine CD-1 33.0 40.0 False

A couple of things stand out

  • Nebraska and Maine poll both at the state and congressional district levels (CD). This is because they award electoral votes based on statewide and district returns. Since we are ultimately concerned with statewide vote-shares, we will drop these polls for this benchmark.
  • National polls are grouped in here, apparently as U.S. in 2016 and NaN in 2020. We'll drop those as well.

Then we will have a one-to-one mapping between our polling data and submission formats, since every poll will have a key in our state dictionary.

In [10]:
polls_2016 = polls_2016[polls_2016.state.isin(states.keys())]
polls_2020 = polls_2020[polls_2020.state.isin(states.keys())]

Now any state in our polling data is a state in our submission format.

In [11]:
assert polls_2016.state.isin(states.keys()).all()
assert polls_2020.state.isin(states.keys()).all()

Also note that we're only looking at presidental polling.

In [12]:
# different coding between 2016 and 2020
assert (polls_2016.branch == "President").all()
assert (polls_2020.office_type == "U.S. President").all()

Turning polls into features

In order to predict the vote percentage for a candidate in each state, we need state-level features. For each state, we have a different number of polls, conducted using a different methodology and with varying recency (polls are less often conducted in states that aren't up for grabs). We'll take a very simple approach to start and create the following features:

  • Number of days to the election: This helps the model understand how recent the poll is (therefore, how useful its measures will be).
  • FiveThirtyEight Pollster Grade: FiveThirtyEight puts a lot of work into rating pollsters they use to generate forecasts. This could be a useful feature in our models. (It certainly is in theirs!)
  • Democrat percentage: The percentage of responds that say they will vote for the democratic candidate (For 2016, this is Clinton; for 2020, this is Biden).
  • Republican percentage: The percentage of responds that say they will vote for the republican candidate (Trump).
  • Other percentage: 1 - (republican percentage + democratic percentage). While 3rd party candidates had a important impact in 2016, they are playing less of a role in 2020 so we lump them all together for the 2020 model.

Prepare the FiveThirtyEight data

In order to create the above features, we'll need to make the 2016 and 2020 data consistent

  • Make column names consistent
  • Convert dates to datetime data types
  • Remove extraneous data
In [13]:
rename_2016_to_2020 = {
    "enddate": "end_date",
    "timestamp": "election_date",
    "grade": "fte_grade",
    "startdate": "start_date"
    
}
polls_2016 = polls_2016.rename(rename_2016_to_2020, axis=1)

# convert to datetime
for col in ["start_date", "end_date", "election_date"]:
    polls_2016[col] = pd.to_datetime(polls_2016[col]).dt.date
    polls_2020[col] = pd.to_datetime(polls_2020[col]).dt.date

# check that election dates are correct
assert (polls_2016.election_date == pd.to_datetime("2016-11-08")).all()
assert (polls_2020.election_date == pd.to_datetime("2020-11-03")).all()

For the 2016 election, FiveThirtyEight ran three different forecast models called

  • Polls-only
  • Polls-plus
  • Now-cast

which we can see using the type column

In [14]:
polls_2016.type.value_counts(dropna=False)
Out[14]:
polls-plus    3073
now-cast      3073
polls-only    3073
Name: type, dtype: int64

In 2020, they are only running the Polls-plus model, as described in the model field described on their official data page on Github.

For consistency, we'll reduce the 2016 data to "type" == "polls-plus".

In [15]:
polls_2016 = polls_2016[polls_2016.type == "polls-plus"]
assert polls_2016.poll_id.nunique() == len(polls_2016)

We'll also want to use the 538 grades as a feature. The grading scheme is a bit different between 2016 and 2020 but we will map the union of both to ordinal values for use in all of our models.

In [16]:
all_grades = (
    pd.concat([polls_2016.fte_grade, polls_2020.fte_grade])
    .sort_values(ascending=False, na_position="first")
    .unique()
)
grade_map = {grade: i for i, grade in enumerate(all_grades)}
grade_map
Out[16]:
{nan: 0,
 'D-': 1,
 'D': 2,
 'C/D': 3,
 'C-': 4,
 'C+': 5,
 'C': 6,
 'B/C': 7,
 'B-': 8,
 'B+': 9,
 'B': 10,
 'A/B': 11,
 'A-': 12,
 'A+': 13,
 'A': 14}
In [17]:
def prepare_polls(polls, earliest_poll_date, is_2020, sort_by="days_to_election", grade_map=grade_map):
    polls["days_to_election"] = (polls.election_date - polls.end_date).dt.days
    poll_start_mask = polls.end_date > pd.to_datetime(earliest_poll_date)
    
    rows = []
    for poll_id, subdf in polls[poll_start_mask].groupby("poll_id"):
        assert len(subdf.days_to_election.unique()) == 1
        assert len(subdf.state.unique()) == 1
        assert len(subdf.fte_grade.unique()) == 1
        
        days_to_election = list(subdf.days_to_election.unique()).pop()
        state = list(subdf.state.unique()).pop()
        fte_grade_num = grade_map[list(subdf.fte_grade.unique()).pop()]
        
        if is_2020:
            dem_frac = subdf.loc[subdf.candidate_party == "DEM", "pct"].mean() / 100
            rep_frac = subdf.loc[subdf.candidate_party == "REP", "pct"].mean() / 100
        else:
            dem_frac = subdf.adjpoll_clinton.mean() / 100
            rep_frac = subdf.adjpoll_trump.mean() / 100
        
        other_frac = (100 - (dem_frac + rep_frac)) / 100
        
        rows.append(
            {
                "state": state,
                "days_to_election": days_to_election,
                "dem_frac": dem_frac,
                "rep_frac": rep_frac,
                "other_frac": other_frac,
                "fte_grade_num": fte_grade_num,
                "poll_id": poll_id,
            }
        )
    return pd.DataFrame(rows).sort_values(by=sort_by, ascending=True)
In [18]:
prepared_2020 = prepare_polls(
    polls=polls_2020, 
    earliest_poll_date="2020-08-01",
    is_2020=True
)

prepared_2020.head()
Out[18]:
state days_to_election dem_frac rep_frac other_frac fte_grade_num poll_id
582 Montana 27 0.436 0.564 0.9900 12 70662
576 Arizona 27 0.480 0.460 0.9906 8 70651
575 Florida 27 0.490 0.450 0.9906 8 70650
585 Arizona 28 0.480 0.450 0.9907 7 70681
581 Minnesota 28 0.470 0.400 0.9913 14 70660
In [19]:
prepared_2016 = prepare_polls(
    polls_2016, 
    earliest_poll_date="2016-08-01",
    is_2020=False
)

prepared_2016.head()
Out[19]:
state days_to_election dem_frac rep_frac other_frac fte_grade_num poll_id
2668 Louisiana 1 0.272169 0.592157 0.991357 10 48811
2678 North Dakota 1 0.276940 0.527929 0.991951 10 48821
2679 Nebraska 1 0.356497 0.468763 0.991747 10 48822
2680 New Hampshire 1 0.513274 0.367592 0.991191 10 48823
2681 New Jersey 1 0.446905 0.380967 0.991721 10 48824

Build model features using the last N polls

For each state, use the last N prepared polling data as the state features. Each row of our prepared data now corresponds to a unique poll.

In [20]:
assert prepared_2016.poll_id.nunique() == len(prepared_2016)
assert prepared_2020.poll_id.nunique() == len(prepared_2020)

But states can have many polls

In [21]:
assert not prepared_2016.state.nunique() == len(prepared_2016)
assert not prepared_2020.state.nunique() == len(prepared_2020)

We only want one prediction for a state, so for our first pass model features, each row will be a state and there will be N * len(feature_cols) columns, where N is the number of polls we choose to consider for model features, and feature_cols is a list of column names representing the features we'll consider in each poll. In cases where a state doesn't have enough polling data, we'll simply fillna.

While we're at it, we'll use the state mapping created above to rename states to their abbreviation, in line with our submission formats.

In [22]:
feature_cols = ["days_to_election", "dem_frac", "rep_frac", "other_frac", "fte_grade_num"]


def build_features_from_last_n_polls(polls, num_polls, feature_cols=feature_cols, state_map=states):
    all_states_rows = {}
    for state, subdf in polls.groupby("state"):
        state_row = {}
        subdf = subdf.sort_values(by="days_to_election").iloc[:min(len(subdf), num_polls)]
        state_row.update(
            {
                f"poll_{i}_{col}": poll[col] for col in feature_cols
                for i, (_, poll) in enumerate(subdf.iterrows())
            }
        )
        all_states_rows[state_map[state]] = state_row
    return (
        pd.DataFrame.from_dict(all_states_rows, orient="index")
        .fillna(0)
        .rename_axis("state_abbreviation", axis="index")  
        .sort_index()
    )
In [23]:
NUM_POLLS = 5

features_2020 = build_features_from_last_n_polls(
    polls=prepared_2020, 
    num_polls=NUM_POLLS
)
features_2020.head()
Out[23]:
poll_0_days_to_election poll_1_days_to_election poll_2_days_to_election poll_3_days_to_election poll_4_days_to_election poll_0_dem_frac poll_1_dem_frac poll_2_dem_frac poll_3_dem_frac poll_4_dem_frac poll_0_rep_frac poll_1_rep_frac poll_2_rep_frac poll_3_rep_frac poll_4_rep_frac poll_0_other_frac poll_1_other_frac poll_2_other_frac poll_3_other_frac poll_4_other_frac poll_0_fte_grade_num poll_1_fte_grade_num poll_2_fte_grade_num poll_3_fte_grade_num poll_4_fte_grade_num
state_abbreviation
AK 30 34 41.0 64.0 0.0 0.4600 0.44055 0.46000 0.42165 0.00 0.5000 0.53595 0.4700 0.5667 0.00 0.990400 0.990235 0.99070 0.990117 0.0000 7 1 7.0 1.0 0.0
AL 31 34 64.0 76.0 93.0 0.3720 0.40630 0.34845 0.44000 0.36 0.5680 0.56745 0.6255 0.4800 0.58 0.990600 0.990263 0.99026 0.990800 0.9906 0 1 1.0 7.0 7.0
AR 34 64 0.0 0.0 0.0 0.3901 0.32900 0.00000 0.00000 0.00 0.6014 0.65975 0.0000 0.0000 0.00 0.990085 0.990113 0.00000 0.000000 0.0000 1 1 0.0 0.0 0.0
AZ 27 28 29.0 29.0 30.0 0.4800 0.48000 0.45800 0.47700 0.51 0.4600 0.45000 0.4480 0.4320 0.45 0.990600 0.990700 0.99094 0.990910 0.9904 8 7 7.0 11.0 4.0
CA 34 36 43.0 49.0 51.0 0.6332 0.59000 0.62000 0.67000 0.60 0.3434 0.32000 0.2800 0.2800 0.31 0.990234 0.990900 0.99100 0.990500 0.9909 1 14 0.0 3.0 11.0
In [24]:
features_2016 = build_features_from_last_n_polls(
    polls=prepared_2016,
    num_polls=NUM_POLLS
)
features_2016.head()
Out[24]:
poll_0_days_to_election poll_1_days_to_election poll_2_days_to_election poll_3_days_to_election poll_4_days_to_election poll_0_dem_frac poll_1_dem_frac poll_2_dem_frac poll_3_dem_frac poll_4_dem_frac poll_0_rep_frac poll_1_rep_frac poll_2_rep_frac poll_3_rep_frac poll_4_rep_frac poll_0_other_frac poll_1_other_frac poll_2_other_frac poll_3_other_frac poll_4_other_frac poll_0_fte_grade_num poll_1_fte_grade_num poll_2_fte_grade_num poll_3_fte_grade_num poll_4_fte_grade_num
state_abbreviation
AK 1 1 2 2 2 0.285592 0.296071 0.296745 0.408479 0.382555 0.529407 0.473345 0.463892 0.433350 0.402817 0.991850 0.992306 0.992394 0.991582 0.992146 10 4 4 8 10
AL 1 1 2 2 2 0.255230 0.345708 0.375402 0.346250 0.324165 0.661839 0.543177 0.536972 0.533617 0.546590 0.990829 0.991111 0.990876 0.991201 0.991292 10 4 12 4 10
AR 1 1 2 2 2 0.361209 0.305888 0.326495 0.371534 0.313358 0.523722 0.553260 0.543753 0.532838 0.503176 0.991151 0.991409 0.991298 0.990956 0.991835 10 4 4 12 10
AZ 1 1 2 2 2 0.462694 0.435973 0.445910 0.436611 0.413577 0.414962 0.413299 0.468141 0.423818 0.461778 0.991223 0.991507 0.990859 0.991396 0.991246 10 4 0 4 12
CA 1 1 2 2 2 0.531979 0.545958 0.583381 0.546591 0.582353 0.319255 0.303292 0.310047 0.293806 0.351667 0.991488 0.991508 0.991066 0.991596 0.990660 10 4 12 4 0

Our features are ready! Finally, before we model we'll load up our labels for 2016 (obviously, we do not yet have labels for 2020) and confirm thaat index is ordered as we expect.

In [25]:
results_2016 = pd.read_csv("../data/processed/public/returns-2016.csv", index_col="state_abbreviation")

assert (results_2016.index == features_2016.index).all()
assert (results_2016.index == submission_format_2016.index).all()

results_2016.head()
Out[25]:
Clinton Trump Other
state_abbreviation
AK 0.365509 0.512815 0.121676
AL 0.343579 0.620831 0.035590
AR 0.336519 0.605719 0.057762
AZ 0.451260 0.486716 0.062024
CA 0.617264 0.316171 0.066565

Time to do some forecasting!

Now it's time to make some predictions. We'll start off with a battle-tested out-of-the-box SciKit Learn regression model––the treeb-based GradientBoostingRegressor. These models are ensembles of classification and regression (CART) trees that are trained using gradient descent (of #deeplearning fame). We note that these are non-linear models, which extends their expressive power. Given the differences between 2016 and 2020 polling trends, a non-linear model is likely to be useful.

We'll also fit this model using the GridSearchCV, which will compare the cross-validated scores for different hyperparameter values on the model and choose the best one. Initially, we will vary the learning rate, the number of estimators used in the ensemble, and the number of features used to split nodes on the tree.

In [26]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.multioutput import MultiOutputRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler


clf = make_pipeline(
    MinMaxScaler(),
    GridSearchCV(
        MultiOutputRegressor(GradientBoostingRegressor(random_state=0)),
        dict(
            estimator__learning_rate=[0.1, 0.01, 0.001],
            estimator__n_estimators=[10, 100, 1000],
            estimator__max_features=["auto", "sqrt", "log2"],
        ),
        scoring="neg_mean_squared_error"
    )
)

clf.fit(features_2016, results_2016)
Out[26]:
Pipeline(steps=[('minmaxscaler', MinMaxScaler()),
                ('gridsearchcv',
                 GridSearchCV(estimator=MultiOutputRegressor(estimator=GradientBoostingRegressor(random_state=0)),
                              param_grid={'estimator__learning_rate': [0.1,
                                                                       0.01,
                                                                       0.001],
                                          'estimator__max_features': ['auto',
                                                                      'sqrt',
                                                                      'log2'],
                                          'estimator__n_estimators': [10, 100,
                                                                      1000]},
                              scoring='neg_mean_squared_error'))])

Take note of the best parameters found during grid search.

In [27]:
clf["gridsearchcv"].best_params_
Out[27]:
{'estimator__learning_rate': 0.01,
 'estimator__max_features': 'auto',
 'estimator__n_estimators': 1000}

Listen to the voice of the people

Now that we have the model, we can predict data for 2016 and evaluate how good the fit of that model is. To do that, we'll predict the percentages that each candidate will receive in each state.

In [28]:
# predict vote percentages with our trained model
preds = clf.predict(features_2016)

# get submission format
preds_2016 = submission_format_2016.copy()

# fill in our predicted values and write to csv
preds_2016.iloc[:, :] = preds
preds_2016.to_csv("gbr-model-2016.csv")

preds_2016.head()
Out[28]:
Clinton Trump Other
state_abbreviation
AK 0.366021 0.513101 0.121496
AL 0.343549 0.620807 0.035777
AR 0.336536 0.605627 0.057121
AZ 0.452270 0.486831 0.061630
CA 0.616499 0.316193 0.065879

If we submit on DrivenData, we can see our score against the 2016 election:

Electoral college results

Just for fun, let's see what our 2016 predictions imply for electoral college results. We'll grab the data for how many electoral votes each state gets from the FEC. (Our apologies for excel pandas magic here.)

In [29]:
fec_electoral_college_url = "https://www.fec.gov/documents/1890/federalelections2016.xlsx"

electoral_by_candidate = (
    pd.read_excel(
        fec_electoral_college_url, 
        sheet_name="Table 2. Electoral &  Pop Vote", 
        skiprows=[0, 1]
    )[["STATE", "ELECTORAL VOTE", "ELECTORAL VOTE.1"]]
    .set_index("STATE")
    .loc["AL": "WY"]
    .fillna(0)
)

for col in electoral_by_candidate.columns:
    electoral_by_candidate[col] = electoral_by_candidate[col].map(
        lambda val: int((str(val).replace("*", "")))  # ignore the ** vals in data
    )

electoral_college = (
    electoral_by_candidate.sum(axis=1)
    .to_frame()
    .rename({0: "electoral_votes"}, axis=1)
)
electoral_college.head()
Out[29]:
electoral_votes
STATE
AL 9
AK 3
AZ 11
AR 6
CA 55

How did we do? Recall, it takes 270 electoral votes to win.

In [30]:
electoral_college.sort_values(by='STATE', inplace=True)
preds_2016.sort_index(inplace=True)

preds_2016['Dems'] = np.where(
    preds_2016.Clinton > preds_2016.Trump,
    electoral_college.electoral_votes,
    0
)

preds_2016['Reps'] = np.where(
    preds_2016.Clinton < preds_2016.Trump,
    electoral_college.electoral_votes,
    0
)
In [31]:
print("===== PREDICTED ELECTORAL VOTES FOR EACH PARTY 2016 =======\n")
print(preds_2016[['Dems', 'Reps']].sum())
print("\n===== ACTUAL ELECTORAL VOTES FOR EACH PARTY 2016 =======\n")
print("Dems", 227)
print("Reps", 304)
===== PREDICTED ELECTORAL VOTES FOR EACH PARTY 2016 =======

Dems    228
Reps    303
dtype: int64

===== ACTUAL ELECTORAL VOTES FOR EACH PARTY 2016 =======

Dems 227
Reps 304

That's pretty good, but we may be overfitting and when it comes to predicting elections using polling data, as we all learned in 2016, past performance is not necessarily an indicator of future results!

2020

Feel the will of the people

Finally, it's time to see what we think about Trump v. Biden. We've got a model now to predict presidential election outcomes based poll results. We can use this model to predict the current presidential election.

In [32]:
# predict vote percentages with our trained model
preds = clf.predict(features_2020)

# get submission format
preds_2020 = submission_format_2020.copy()

# fill in our predicted values and write to csv
preds_2020.iloc[:, :] = preds
preds_2020.to_csv("gbr-model-2020.csv")

preds_2020.head()
Out[32]:
Biden Trump Other
state_abbreviation
AK 0.499729 0.512873 0.041805
AL 0.345639 0.600703 0.043336
AR 0.323058 0.331472 0.055669
AZ 0.538705 0.484415 0.039734
CA 0.675142 0.253836 0.044839

We can see we now have predictions for every major candidate, for every state. We can submit this to DrivenData and mark it as our "Evaluation Submission" to indicate that these are our predictions for 2020:

But, one important question remains: who will win the election? As we saw above, that is subject to the rules of the Electoral College. The winner needs at least 270 electoral votes to become president. We can calculate that using data about how many electoral votes each state gets:

In [33]:
electoral_college.sort_values(by='STATE', inplace=True)
preds_2020.sort_index(inplace=True)

preds_2020['Dems'] = np.where(
    preds_2020.Biden > preds_2020.Trump,
    electoral_college.electoral_votes,
    0
)

preds_2020['Reps'] = np.where(
    preds_2020.Biden < preds_2020.Trump,
    electoral_college.electoral_votes,
    0
)
In [34]:
print("===== PREDICTED ELECTORAL VOTES FOR EACH PARTY 2020 =======\n")
print(preds_2020[['Dems', 'Reps']].sum())
print("\n===== ACTUAL ELECTORAL VOTES FOR EACH PARTY 2020 =======\n")
print("We don't know yet...")
===== PREDICTED ELECTORAL VOTES FOR EACH PARTY 2020 =======

Dems    406
Reps    125
dtype: int64

===== ACTUAL ELECTORAL VOTES FOR EACH PARTY 2020 =======

We don't know yet...

Ok, this first-pass model is complete. We have probably done some overfitting, and could certainly do some more feature engineering, but this is a start. It's your job to take it to the next level! And there are many next levels, see the FiveThirtyEight 2020 methodology write-up for proof. All we can do now is stay calm and model on... Just kidding, we can also

πŸ‘‰πŸ‘‰ VOTE πŸ‘ˆπŸ‘ˆ