America's Next Top (Statistical) Model - 2020¶

US presidential elections come but once every 4 years, and this one's a big one. The new president will help shape policies on the pandemic response, healthcare, the environment, the economy, and more. There are lots of people trying to predict what will happen. Can you top them?

In our newest competition, you are asked to predict the fraction of each state that will vote for each major candidate. You can use any data that is freely available to the public. Come election night (or election week... or election month), we'll see who's model had the most accurate vision for the country!

This competition allows the use of any publically available data, so in this post we'll walk through downloading some polling data, cleaning it up a bit, and using it to generate predictions for the 2016 and 2020 elections.

In [1]:

%matplotlib inline

from pathlib import Path
import re

import numpy as np
import pandas as pd
import us

pd.set_option("display.max_columns", 50)
pd.set_option("display.max_rows", 50)

What do the pollsters say?¶

First things first, we need some data. The bread and butter of election forecasting is polling data, though some believe that is changing rapidly. The forecasting website FiveThirtyEight is a great place to get polling data. Thanks to them, we can load the thousands of polls used in their 2016 and 2020 forecasts directly into pandas in just couple of lines. Scroll around to check out the columns!

In [2]:

polls_2016_url = "http://projects.fivethirtyeight.com/general-model/president_general_polls_2016.csv"
polls_2020_url = "https://projects.fivethirtyeight.com/polls-page/president_polls.csv"

polls_2016 = pd.read_csv(polls_2016_url)
polls_2020 = pd.read_csv(polls_2020_url)

polls_2016.head()

Out[2]:

	cycle	branch	type	matchup	forecastdate	state	startdate	enddate	pollster	grade	samplesize	population	poll_wt	rawpoll_clinton	rawpoll_trump	rawpoll_johnson	rawpoll_mcmullin	adjpoll_clinton	adjpoll_trump	adjpoll_johnson	adjpoll_mcmullin	multiversions	url	poll_id	question_id	createddate	timestamp
0	2016	President	polls-plus	Clinton vs. Trump vs. Johnson	11/8/16	U.S.	11/3/2016	11/6/2016	ABC News/Washington Post	A+	2220.0	lv	8.720654	47.00	43.00	4.00	NaN	45.20163	41.72430	4.626221	NaN	NaN	https://www.washingtonpost.com/news/the-fix/wp...	48630	76192	11/7/16	09:35:33 8 Nov 2016
1	2016	President	polls-plus	Clinton vs. Trump vs. Johnson	11/8/16	U.S.	11/1/2016	11/7/2016	Google Consumer Surveys	B	26574.0	lv	7.628472	38.03	35.69	5.46	NaN	43.34557	41.21439	5.175792	NaN	NaN	https://datastudio.google.com/u/0/#/org//repor...	48847	76443	11/7/16	09:35:33 8 Nov 2016
2	2016	President	polls-plus	Clinton vs. Trump vs. Johnson	11/8/16	U.S.	11/2/2016	11/6/2016	Ipsos	A-	2195.0	lv	6.424334	42.00	39.00	6.00	NaN	42.02638	38.81620	6.844734	NaN	NaN	http://projects.fivethirtyeight.com/polls/2016...	48922	76636	11/8/16	09:35:33 8 Nov 2016
3	2016	President	polls-plus	Clinton vs. Trump vs. Johnson	11/8/16	U.S.	11/4/2016	11/7/2016	YouGov	B	3677.0	lv	6.087135	45.00	41.00	5.00	NaN	45.65676	40.92004	6.069454	NaN	NaN	https://d25d2506sfb94s.cloudfront.net/cumulus_...	48687	76262	11/7/16	09:35:33 8 Nov 2016
4	2016	President	polls-plus	Clinton vs. Trump vs. Johnson	11/8/16	U.S.	11/3/2016	11/6/2016	Gravis Marketing	B-	16639.0	rv	5.316449	47.00	43.00	3.00	NaN	46.84089	42.33184	3.726098	NaN	NaN	http://www.gravispolls.com/2016/11/final-natio...	48848	76444	11/7/16	09:35:33 8 Nov 2016

In [3]:

polls_2020.head()

Out[3]:

	question_id	poll_id	cycle	state	pollster_id	pollster	sponsor_ids	sponsors	display_name	pollster_rating_id	pollster_rating_name	fte_grade	sample_size	population	population_full	methodology	office_type	seat_name	start_date	end_date	election_date	sponsor_candidate	internal	partisan	tracking	nationwide_batch	ranked_choice_reallocated	created_at	notes	url	stage	race_id	answer	candidate_id	candidate_name	candidate_party	pct
0	132135	70662	2020	Montana	1102	Emerson College	NaN	NaN	Emerson College	88.0	Emerson College	A-	500.0	lv	lv	Text	U.S. President	NaN	10/5/20	10/7/20	11/3/20	NaN	False	NaN	NaN	False	False	10/7/20 20:43	NaN	https://emersonpolling.reportablenews.com/pr/m...	general	6237	Biden	13256	Joseph R. Biden Jr.	DEM	43.6
1	132135	70662	2020	Montana	1102	Emerson College	NaN	NaN	Emerson College	88.0	Emerson College	A-	500.0	lv	lv	Text	U.S. President	NaN	10/5/20	10/7/20	11/3/20	NaN	False	NaN	NaN	False	False	10/7/20 20:43	NaN	https://emersonpolling.reportablenews.com/pr/m...	general	6237	Trump	13254	Donald Trump	REP	56.4
2	132174	70680	2020	NaN	1189	Morning Consult	NaN	NaN	Morning Consult	218.0	Morning Consult	B/C	17249.0	lv	lv	Online	U.S. President	NaN	10/5/20	10/7/20	11/3/20	NaN	False	NaN	True	False	False	10/8/20 11:38	NaN	https://morningconsult.com/2020-presidential-e...	general	6210	Biden	13256	Joseph R. Biden Jr.	DEM	52.0
3	132174	70680	2020	NaN	1189	Morning Consult	NaN	NaN	Morning Consult	218.0	Morning Consult	B/C	17249.0	lv	lv	Online	U.S. President	NaN	10/5/20	10/7/20	11/3/20	NaN	False	NaN	True	False	False	10/8/20 11:38	NaN	https://morningconsult.com/2020-presidential-e...	general	6210	Trump	13254	Donald Trump	REP	43.0
4	132116	70650	2020	Florida	744	Ipsos	71	Reuters	Ipsos	154.0	Ipsos	B-	678.0	lv	lv	Online	U.S. President	NaN	9/29/20	10/7/20	11/3/20	NaN	False	NaN	NaN	False	False	10/7/20 14:33	NaN	https://www.ipsos.com/sites/default/files/ct/n...	general	6220	Biden	13256	Joseph R. Biden Jr.	DEM	49.0

The data schema has changed a bit between 2016 and 2020, so let's see what columns these polls have in common.

In [4]:

polls_2016.columns.intersection(polls_2020.columns)

Out[4]:

Index(['cycle', 'state', 'pollster', 'population', 'url', 'poll_id',
       'question_id'],
      dtype='object')

We can use the state column to map into our submission format. But the FiveThirtyEight data uses the full state name for the state, and our submission format uses the state abbreviation. We'll want to

Create a map between states and abbreviations using the us library
Subset the polling data to only those states in our submission format (removing any national polls or territories)

First, let's load our submission formats

In [5]:

DATA_DIR = Path("../data/processed/public/")
index_col = "state_abbreviation"

submission_format_2016 = pd.read_csv(
    (DATA_DIR / "submission-format-2016.csv"), 
    index_col=index_col
)
submission_format_2020 = pd.read_csv(
    (DATA_DIR / "submission-format-2020.csv"), 
    index_col=index_col
)

# indices are the same for 2016 and 2020
assert (submission_format_2016.index == submission_format_2020.index).all()
print(f"There are {len(submission_format_2016)} abbreviations")  # includes D.C.
submission_format_2016.head()

There are 51 abbreviations

Out[5]:

	Clinton	Trump	Other
state_abbreviation
AK	0.33	0.33	0.33
AL	0.33	0.33	0.33
AR	0.33	0.33	0.33
AZ	0.33	0.33	0.33
CA	0.33	0.33	0.33

In [6]:

submission_format_2020.head()

Out[6]:

	Biden	Trump	Other
state_abbreviation
AK	0.33	0.33	0.33
AL	0.33	0.33	0.33
AR	0.33	0.33	0.33
AZ	0.33	0.33	0.33
CA	0.33	0.33	0.33

It will be our job to replace those 0.33 with some machine-learned predictions of our own!

First, since the FiveThirtyEight data uses the full state name in state, we'll create a map between full state names and abbreviations. This is easy using a fun little package called us.

In [7]:

states = {
    state.name: state.abbr for state in (us.STATES + [us.states.DC])
    if state.abbr in submission_format_2016.index  # ok to check only 2016 since 2020 same
}

assert len(states) == len(submission_format_2016) == len(submission_format_2020)
states

Out[7]:

{'Alabama': 'AL',
 'Alaska': 'AK',
 'Arizona': 'AZ',
 'Arkansas': 'AR',
 'California': 'CA',
 'Colorado': 'CO',
 'Connecticut': 'CT',
 'Delaware': 'DE',
 'Florida': 'FL',
 'Georgia': 'GA',
 'Hawaii': 'HI',
 'Idaho': 'ID',
 'Illinois': 'IL',
 'Indiana': 'IN',
 'Iowa': 'IA',
 'Kansas': 'KS',
 'Kentucky': 'KY',
 'Louisiana': 'LA',
 'Maine': 'ME',
 'Maryland': 'MD',
 'Massachusetts': 'MA',
 'Michigan': 'MI',
 'Minnesota': 'MN',
 'Mississippi': 'MS',
 'Missouri': 'MO',
 'Montana': 'MT',
 'Nebraska': 'NE',
 'Nevada': 'NV',
 'New Hampshire': 'NH',
 'New Jersey': 'NJ',
 'New Mexico': 'NM',
 'New York': 'NY',
 'North Carolina': 'NC',
 'North Dakota': 'ND',
 'Ohio': 'OH',
 'Oklahoma': 'OK',
 'Oregon': 'OR',
 'Pennsylvania': 'PA',
 'Rhode Island': 'RI',
 'South Carolina': 'SC',
 'South Dakota': 'SD',
 'Tennessee': 'TN',
 'Texas': 'TX',
 'Utah': 'UT',
 'Vermont': 'VT',
 'Virginia': 'VA',
 'Washington': 'WA',
 'West Virginia': 'WV',
 'Wisconsin': 'WI',
 'Wyoming': 'WY',
 'District of Columbia': 'DC'}

Let's look at how many polls each state has. Some polls have multiple rows (in 2020) so we'll use the unique poll_id to count.

In [8]:

poll_counts = pd.DataFrame(
    [
        polls_2016.groupby("state").poll_id.count().rename("# 2016 Polls"),
        polls_2020.groupby("state").poll_id.count().rename("# 2020 Polls")
    ]
).T.sort_index(ascending=False)

poll_counts["in_states_map"] = poll_counts.index.isin(states.keys())
poll_counts[poll_counts.in_states_map].plot.barh(figsize=(10, 20), title="Number of polls conducted")

Out[8]:

<AxesSubplot:title={'center':'Number of polls conducted'}>

It's interesting to note that Michigan and Wisconsin, two of the states that are said to have delievered Trump the presidency by about 80,000 votes in 2016, are being polled substantially more in 2020.

Now let's look at the polling data that doesn't appear in our submission formats

In [9]:

poll_counts[~poll_counts.in_states_map]

Out[9]:

	# 2016 Polls	# 2020 Polls	in_states_map
U.S.	3318.0	NaN	False
Nebraska CD-3	3.0	NaN	False
Nebraska CD-2	6.0	11.0	False
Nebraska CD-1	3.0	2.0	False
Maine CD-2	42.0	44.0	False
Maine CD-1	33.0	40.0	False

A couple of things stand out

Nebraska and Maine poll both at the state and congressional district levels (CD). This is because they award electoral votes based on statewide and district returns. Since we are ultimately concerned with statewide vote-shares, we will drop these polls for this benchmark.
National polls are grouped in here, apparently as U.S. in 2016 and NaN in 2020. We'll drop those as well.

Then we will have a one-to-one mapping between our polling data and submission formats, since every poll will have a key in our state dictionary.

In [10]:

polls_2016 = polls_2016[polls_2016.state.isin(states.keys())]
polls_2020 = polls_2020[polls_2020.state.isin(states.keys())]

Now any state in our polling data is a state in our submission format.

In [11]:

assert polls_2016.state.isin(states.keys()).all()
assert polls_2020.state.isin(states.keys()).all()

Also note that we're only looking at presidental polling.

In [12]:

# different coding between 2016 and 2020
assert (polls_2016.branch == "President").all()
assert (polls_2020.office_type == "U.S. President").all()

Turning polls into features¶

In order to predict the vote percentage for a candidate in each state, we need state-level features. For each state, we have a different number of polls, conducted using a different methodology and with varying recency (polls are less often conducted in states that aren't up for grabs). We'll take a very simple approach to start and create the following features:

Number of days to the election: This helps the model understand how recent the poll is (therefore, how useful its measures will be).
FiveThirtyEight Pollster Grade: FiveThirtyEight puts a lot of work into rating pollsters they use to generate forecasts. This could be a useful feature in our models. (It certainly is in theirs!)
Democrat percentage: The percentage of responds that say they will vote for the democratic candidate (For 2016, this is Clinton; for 2020, this is Biden).
Republican percentage: The percentage of responds that say they will vote for the republican candidate (Trump).
Other percentage: 1 - (republican percentage + democratic percentage). While 3rd party candidates had a important impact in 2016, they are playing less of a role in 2020 so we lump them all together for the 2020 model.

Prepare the FiveThirtyEight data¶

In order to create the above features, we'll need to make the 2016 and 2020 data consistent

Make column names consistent
Convert dates to datetime data types
Remove extraneous data

In [13]:

rename_2016_to_2020 = {
    "enddate": "end_date",
    "timestamp": "election_date",
    "grade": "fte_grade",
    "startdate": "start_date"
    
}
polls_2016 = polls_2016.rename(rename_2016_to_2020, axis=1)

# convert to datetime
for col in ["start_date", "end_date", "election_date"]:
    polls_2016[col] = pd.to_datetime(polls_2016[col]).dt.date
    polls_2020[col] = pd.to_datetime(polls_2020[col]).dt.date

# check that election dates are correct
assert (polls_2016.election_date == pd.to_datetime("2016-11-08")).all()
assert (polls_2020.election_date == pd.to_datetime("2020-11-03")).all()

For the 2016 election, FiveThirtyEight ran three different forecast models called

Polls-only
Polls-plus
Now-cast

which we can see using the type column

In [14]:

polls_2016.type.value_counts(dropna=False)

Out[14]:

polls-plus    3073
now-cast      3073
polls-only    3073
Name: type, dtype: int64

In 2020, they are only running the Polls-plus model, as described in the model field described on their official data page on Github.

For consistency, we'll reduce the 2016 data to "type" == "polls-plus".

In [15]:

polls_2016 = polls_2016[polls_2016.type == "polls-plus"]
assert polls_2016.poll_id.nunique() == len(polls_2016)

We'll also want to use the 538 grades as a feature. The grading scheme is a bit different between 2016 and 2020 but we will map the union of both to ordinal values for use in all of our models.

In [16]:

all_grades = (
    pd.concat([polls_2016.fte_grade, polls_2020.fte_grade])
    .sort_values(ascending=False, na_position="first")
    .unique()
)
grade_map = {grade: i for i, grade in enumerate(all_grades)}
grade_map

Out[16]:

{nan: 0,
 'D-': 1,
 'D': 2,
 'C/D': 3,
 'C-': 4,
 'C+': 5,
 'C': 6,
 'B/C': 7,
 'B-': 8,
 'B+': 9,
 'B': 10,
 'A/B': 11,
 'A-': 12,
 'A+': 13,
 'A': 14}

In [17]:

def prepare_polls(polls, earliest_poll_date, is_2020, sort_by="days_to_election", grade_map=grade_map):
    polls["days_to_election"] = (polls.election_date - polls.end_date).dt.days
    poll_start_mask = polls.end_date > pd.to_datetime(earliest_poll_date)
    
    rows = []
    for poll_id, subdf in polls[poll_start_mask].groupby("poll_id"):
        assert len(subdf.days_to_election.unique()) == 1
        assert len(subdf.state.unique()) == 1
        assert len(subdf.fte_grade.unique()) == 1
        
        days_to_election = list(subdf.days_to_election.unique()).pop()
        state = list(subdf.state.unique()).pop()
        fte_grade_num = grade_map[list(subdf.fte_grade.unique()).pop()]
        
        if is_2020:
            dem_frac = subdf.loc[subdf.candidate_party == "DEM", "pct"].mean() / 100
            rep_frac = subdf.loc[subdf.candidate_party == "REP", "pct"].mean() / 100
        else:
            dem_frac = subdf.adjpoll_clinton.mean() / 100
            rep_frac = subdf.adjpoll_trump.mean() / 100
        
        other_frac = (100 - (dem_frac + rep_frac)) / 100
        
        rows.append(
            {
                "state": state,
                "days_to_election": days_to_election,
                "dem_frac": dem_frac,
                "rep_frac": rep_frac,
                "other_frac": other_frac,
                "fte_grade_num": fte_grade_num,
                "poll_id": poll_id,
            }
        )
    return pd.DataFrame(rows).sort_values(by=sort_by, ascending=True)

In [18]:

prepared_2020 = prepare_polls(
    polls=polls_2020, 
    earliest_poll_date="2020-08-01",
    is_2020=True
)

prepared_2020.head()

Out[18]:

	state	days_to_election	dem_frac	rep_frac	other_frac	fte_grade_num	poll_id
582	Montana	27	0.436	0.564	0.9900	12	70662
576	Arizona	27	0.480	0.460	0.9906	8	70651
575	Florida	27	0.490	0.450	0.9906	8	70650
585	Arizona	28	0.480	0.450	0.9907	7	70681
581	Minnesota	28	0.470	0.400	0.9913	14	70660

In [19]:

prepared_2016 = prepare_polls(
    polls_2016, 
    earliest_poll_date="2016-08-01",
    is_2020=False
)

prepared_2016.head()

Out[19]:

	state	days_to_election	dem_frac	rep_frac	other_frac	fte_grade_num	poll_id
2668	Louisiana	1	0.272169	0.592157	0.991357	10	48811
2678	North Dakota	1	0.276940	0.527929	0.991951	10	48821
2679	Nebraska	1	0.356497	0.468763	0.991747	10	48822
2680	New Hampshire	1	0.513274	0.367592	0.991191	10	48823
2681	New Jersey	1	0.446905	0.380967	0.991721	10	48824

Build model features using the last N polls¶

For each state, use the last N prepared polling data as the state features. Each row of our prepared data now corresponds to a unique poll.

In [20]:

assert prepared_2016.poll_id.nunique() == len(prepared_2016)
assert prepared_2020.poll_id.nunique() == len(prepared_2020)

But states can have many polls

In [21]:

assert not prepared_2016.state.nunique() == len(prepared_2016)
assert not prepared_2020.state.nunique() == len(prepared_2020)

We only want one prediction for a state, so for our first pass model features, each row will be a state and there will be N * len(feature_cols) columns, where N is the number of polls we choose to consider for model features, and feature_cols is a list of column names representing the features we'll consider in each poll. In cases where a state doesn't have enough polling data, we'll simply fillna.

While we're at it, we'll use the state mapping created above to rename states to their abbreviation, in line with our submission formats.

In [22]:

feature_cols = ["days_to_election", "dem_frac", "rep_frac", "other_frac", "fte_grade_num"]


def build_features_from_last_n_polls(polls, num_polls, feature_cols=feature_cols, state_map=states):
    all_states_rows = {}
    for state, subdf in polls.groupby("state"):
        state_row = {}
        subdf = subdf.sort_values(by="days_to_election").iloc[:min(len(subdf), num_polls)]
        state_row.update(
            {
                f"poll_{i}_{col}": poll[col] for col in feature_cols
                for i, (_, poll) in enumerate(subdf.iterrows())
            }
        )
        all_states_rows[state_map[state]] = state_row
    return (
        pd.DataFrame.from_dict(all_states_rows, orient="index")
        .fillna(0)
        .rename_axis("state_abbreviation", axis="index")  
        .sort_index()
    )

In [23]:

NUM_POLLS = 5

features_2020 = build_features_from_last_n_polls(
    polls=prepared_2020, 
    num_polls=NUM_POLLS
)
features_2020.head()

Out[23]:

	poll_0_days_to_election	poll_1_days_to_election	poll_2_days_to_election	poll_3_days_to_election	poll_4_days_to_election	poll_0_dem_frac	poll_1_dem_frac	poll_2_dem_frac	poll_3_dem_frac	poll_4_dem_frac	poll_0_rep_frac	poll_1_rep_frac	poll_2_rep_frac	poll_3_rep_frac	poll_4_rep_frac	poll_0_other_frac	poll_1_other_frac	poll_2_other_frac	poll_3_other_frac	poll_4_other_frac	poll_0_fte_grade_num	poll_1_fte_grade_num	poll_2_fte_grade_num	poll_3_fte_grade_num	poll_4_fte_grade_num
state_abbreviation
AK	30	34	41.0	64.0	0.0	0.4600	0.44055	0.46000	0.42165	0.00	0.5000	0.53595	0.4700	0.5667	0.00	0.990400	0.990235	0.99070	0.990117	0.0000	7	1	7.0	1.0	0.0
AL	31	34	64.0	76.0	93.0	0.3720	0.40630	0.34845	0.44000	0.36	0.5680	0.56745	0.6255	0.4800	0.58	0.990600	0.990263	0.99026	0.990800	0.9906	0	1	1.0	7.0	7.0
AR	34	64	0.0	0.0	0.0	0.3901	0.32900	0.00000	0.00000	0.00	0.6014	0.65975	0.0000	0.0000	0.00	0.990085	0.990113	0.00000	0.000000	0.0000	1	1	0.0	0.0	0.0
AZ	27	28	29.0	29.0	30.0	0.4800	0.48000	0.45800	0.47700	0.51	0.4600	0.45000	0.4480	0.4320	0.45	0.990600	0.990700	0.99094	0.990910	0.9904	8	7	7.0	11.0	4.0
CA	34	36	43.0	49.0	51.0	0.6332	0.59000	0.62000	0.67000	0.60	0.3434	0.32000	0.2800	0.2800	0.31	0.990234	0.990900	0.99100	0.990500	0.9909	1	14	0.0	3.0	11.0

In [24]:

features_2016 = build_features_from_last_n_polls(
    polls=prepared_2016,
    num_polls=NUM_POLLS
)
features_2016.head()

Out[24]:

	poll_0_days_to_election	poll_1_days_to_election	poll_2_days_to_election	poll_3_days_to_election	poll_4_days_to_election	poll_0_dem_frac	poll_1_dem_frac	poll_2_dem_frac	poll_3_dem_frac	poll_4_dem_frac	poll_0_rep_frac	poll_1_rep_frac	poll_2_rep_frac	poll_3_rep_frac	poll_4_rep_frac	poll_0_other_frac	poll_1_other_frac	poll_2_other_frac	poll_3_other_frac	poll_4_other_frac	poll_0_fte_grade_num	poll_1_fte_grade_num	poll_2_fte_grade_num	poll_3_fte_grade_num	poll_4_fte_grade_num
state_abbreviation
AK	1	1	2	2	2	0.285592	0.296071	0.296745	0.408479	0.382555	0.529407	0.473345	0.463892	0.433350	0.402817	0.991850	0.992306	0.992394	0.991582	0.992146	10	4	4	8	10
AL	1	1	2	2	2	0.255230	0.345708	0.375402	0.346250	0.324165	0.661839	0.543177	0.536972	0.533617	0.546590	0.990829	0.991111	0.990876	0.991201	0.991292	10	4	12	4	10
AR	1	1	2	2	2	0.361209	0.305888	0.326495	0.371534	0.313358	0.523722	0.553260	0.543753	0.532838	0.503176	0.991151	0.991409	0.991298	0.990956	0.991835	10	4	4	12	10
AZ	1	1	2	2	2	0.462694	0.435973	0.445910	0.436611	0.413577	0.414962	0.413299	0.468141	0.423818	0.461778	0.991223	0.991507	0.990859	0.991396	0.991246	10	4	0	4	12
CA	1	1	2	2	2	0.531979	0.545958	0.583381	0.546591	0.582353	0.319255	0.303292	0.310047	0.293806	0.351667	0.991488	0.991508	0.991066	0.991596	0.990660	10	4	12	4	0

Our features are ready! Finally, before we model we'll load up our labels for 2016 (obviously, we do not yet have labels for 2020) and confirm thaat index is ordered as we expect.

In [25]:

results_2016 = pd.read_csv("../data/processed/public/returns-2016.csv", index_col="state_abbreviation")

assert (results_2016.index == features_2016.index).all()
assert (results_2016.index == submission_format_2016.index).all()

results_2016.head()

Out[25]:

	Clinton	Trump	Other
state_abbreviation
AK	0.365509	0.512815	0.121676
AL	0.343579	0.620831	0.035590
AR	0.336519	0.605719	0.057762
AZ	0.451260	0.486716	0.062024
CA	0.617264	0.316171	0.066565

Time to do some forecasting!¶

Now it's time to make some predictions. We'll start off with a battle-tested out-of-the-box SciKit Learn regression model––the treeb-based GradientBoostingRegressor. These models are ensembles of classification and regression (CART) trees that are trained using gradient descent (of #deeplearning fame). We note that these are non-linear models, which extends their expressive power. Given the differences between 2016 and 2020 polling trends, a non-linear model is likely to be useful.

We'll also fit this model using the GridSearchCV, which will compare the cross-validated scores for different hyperparameter values on the model and choose the best one. Initially, we will vary the learning rate, the number of estimators used in the ensemble, and the number of features used to split nodes on the tree.

In [26]:

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.multioutput import MultiOutputRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler


clf = make_pipeline(
    MinMaxScaler(),
    GridSearchCV(
        MultiOutputRegressor(GradientBoostingRegressor(random_state=0)),
        dict(
            estimator__learning_rate=[0.1, 0.01, 0.001],
            estimator__n_estimators=[10, 100, 1000],
            estimator__max_features=["auto", "sqrt", "log2"],
        ),
        scoring="neg_mean_squared_error"
    )
)

clf.fit(features_2016, results_2016)

Out[26]:

Pipeline(steps=[('minmaxscaler', MinMaxScaler()),
                ('gridsearchcv',
                 GridSearchCV(estimator=MultiOutputRegressor(estimator=GradientBoostingRegressor(random_state=0)),
                              param_grid={'estimator__learning_rate': [0.1,
                                                                       0.01,
                                                                       0.001],
                                          'estimator__max_features': ['auto',
                                                                      'sqrt',
                                                                      'log2'],
                                          'estimator__n_estimators': [10, 100,
                                                                      1000]},
                              scoring='neg_mean_squared_error'))])

Take note of the best parameters found during grid search.

In [27]:

clf["gridsearchcv"].best_params_

Out[27]:

{'estimator__learning_rate': 0.01,
 'estimator__max_features': 'auto',
 'estimator__n_estimators': 1000}

Listen to the voice of the people¶

Now that we have the model, we can predict data for 2016 and evaluate how good the fit of that model is. To do that, we'll predict the percentages that each candidate will receive in each state.

In [28]:

# predict vote percentages with our trained model
preds = clf.predict(features_2016)

# get submission format
preds_2016 = submission_format_2016.copy()

# fill in our predicted values and write to csv
preds_2016.iloc[:, :] = preds
preds_2016.to_csv("gbr-model-2016.csv")

preds_2016.head()

Out[28]:

	Clinton	Trump	Other
state_abbreviation
AK	0.366021	0.513101	0.121496
AL	0.343549	0.620807	0.035777
AR	0.336536	0.605627	0.057121
AZ	0.452270	0.486831	0.061630
CA	0.616499	0.316193	0.065879

If we submit on DrivenData, we can see our score against the 2016 election:

Electoral college results¶

Just for fun, let's see what our 2016 predictions imply for electoral college results. We'll grab the data for how many electoral votes each state gets from the FEC. (Our apologies for excel pandas magic here.)

In [29]:

fec_electoral_college_url = "https://www.fec.gov/documents/1890/federalelections2016.xlsx"

electoral_by_candidate = (
    pd.read_excel(
        fec_electoral_college_url, 
        sheet_name="Table 2. Electoral &  Pop Vote", 
        skiprows=[0, 1]
    )[["STATE", "ELECTORAL VOTE", "ELECTORAL VOTE.1"]]
    .set_index("STATE")
    .loc["AL": "WY"]
    .fillna(0)
)

for col in electoral_by_candidate.columns:
    electoral_by_candidate[col] = electoral_by_candidate[col].map(
        lambda val: int((str(val).replace("*", "")))  # ignore the ** vals in data
    )

electoral_college = (
    electoral_by_candidate.sum(axis=1)
    .to_frame()
    .rename({0: "electoral_votes"}, axis=1)
)
electoral_college.head()

Out[29]:

	electoral_votes
STATE
AL	9
AK	3
AZ	11
AR	6
CA	55

How did we do? Recall, it takes 270 electoral votes to win.

In [30]:

electoral_college.sort_values(by='STATE', inplace=True)
preds_2016.sort_index(inplace=True)

preds_2016['Dems'] = np.where(
    preds_2016.Clinton > preds_2016.Trump,
    electoral_college.electoral_votes,
    0
)

preds_2016['Reps'] = np.where(
    preds_2016.Clinton < preds_2016.Trump,
    electoral_college.electoral_votes,
    0
)

In [31]:

print("===== PREDICTED ELECTORAL VOTES FOR EACH PARTY 2016 =======\n")
print(preds_2016[['Dems', 'Reps']].sum())
print("\n===== ACTUAL ELECTORAL VOTES FOR EACH PARTY 2016 =======\n")
print("Dems", 227)
print("Reps", 304)

===== PREDICTED ELECTORAL VOTES FOR EACH PARTY 2016 =======

Dems    228
Reps    303
dtype: int64

===== ACTUAL ELECTORAL VOTES FOR EACH PARTY 2016 =======

Dems 227
Reps 304

That's pretty good, but we may be overfitting and when it comes to predicting elections using polling data, as we all learned in 2016, past performance is not necessarily an indicator of future results!

2020¶

Feel the will of the people¶

Finally, it's time to see what we think about Trump v. Biden. We've got a model now to predict presidential election outcomes based poll results. We can use this model to predict the current presidential election.

In [32]:

# predict vote percentages with our trained model
preds = clf.predict(features_2020)

# get submission format
preds_2020 = submission_format_2020.copy()

# fill in our predicted values and write to csv
preds_2020.iloc[:, :] = preds
preds_2020.to_csv("gbr-model-2020.csv")

preds_2020.head()

Out[32]:

	Biden	Trump	Other
state_abbreviation
AK	0.499729	0.512873	0.041805
AL	0.345639	0.600703	0.043336
AR	0.323058	0.331472	0.055669
AZ	0.538705	0.484415	0.039734
CA	0.675142	0.253836	0.044839

We can see we now have predictions for every major candidate, for every state. We can submit this to DrivenData and mark it as our "Evaluation Submission" to indicate that these are our predictions for 2020:

But, one important question remains: who will win the election? As we saw above, that is subject to the rules of the Electoral College. The winner needs at least 270 electoral votes to become president. We can calculate that using data about how many electoral votes each state gets:

In [33]:

electoral_college.sort_values(by='STATE', inplace=True)
preds_2020.sort_index(inplace=True)

preds_2020['Dems'] = np.where(
    preds_2020.Biden > preds_2020.Trump,
    electoral_college.electoral_votes,
    0
)

preds_2020['Reps'] = np.where(
    preds_2020.Biden < preds_2020.Trump,
    electoral_college.electoral_votes,
    0
)

In [34]:

print("===== PREDICTED ELECTORAL VOTES FOR EACH PARTY 2020 =======\n")
print(preds_2020[['Dems', 'Reps']].sum())
print("\n===== ACTUAL ELECTORAL VOTES FOR EACH PARTY 2020 =======\n")
print("We don't know yet...")

===== PREDICTED ELECTORAL VOTES FOR EACH PARTY 2020 =======

Dems    406
Reps    125
dtype: int64

===== ACTUAL ELECTORAL VOTES FOR EACH PARTY 2020 =======

We don't know yet...

Ok, this first-pass model is complete. We have probably done some overfitting, and could certainly do some more feature engineering, but this is a start. It's your job to take it to the next level! And there are many next levels, see the FiveThirtyEight 2020 methodology write-up for proof. All we can do now is stay calm and model on... Just kidding, we can also

👉👉 VOTE 👈👈

blog