An Evaluation of American Presidential Elections

A Tour Through the Data Science Lifecycle

Image obtained from CNN

John Curran

This project is done independently for the University of Maryland, with all code and prose written by me. All knowledge for completing this study was obtained through the University and internet sources.

The study in this project concerns American politics, but it is meant to be completely non-partisan. The purpose of this study is not to advance a particular viewpoint, but instead to take an impartial look into what drives the changes in the American political landscape.

Introduction

American electoral politics has been constantly fluid since the development of the two-party system, with different party coalitions being built and falling apart over time. Today, the Republican Party enjoys substantial advantages in the South and Great Plains, while the Democratic Party holds a tight grip on the Northeast and West Coast. However, these modern coalitions did not always exist and are even rapidly changing today. The purpose of this study is to identify what factors currently contribute to electoral gains for either party in the modern era, and ascertain whether these factors have shifted over time.

In 2016, Donald J. Trump was elected the 45th President of the United States with a coalition of traditional Republican voters and new, white working class voters. This winning strategy was unique for a Republican in the modern era, as some of these voters were loyally Democratic before he came along. At the same time, Hillary Clinton made some substantial improvement among the more educated and affluent voters across American suburbs. In 2020, Joseph R. Biden won by expanding dramatically upon Clinton's new coalition while slipping even more among the new Trump base. Both American major parties are changing what groups they appeal to, and there are many factors that contribute to these shifts in the electoral map across the 50 states. Preliminary hypotheses from experts suggest that income and education are quickly becoming the leading factors contributing to the changing electoral landscape. This study will investigate these hypotheses, in addition to other factors.


Hypothesis

We will seek to identify the factors that contribute to the changing American political coalitions in the 21st century by looking through historical presidential elections in America's 3,143 counties and determining factors correlated with changes in party support between two elections. This study will cover the entire data science lifecycle, from data collection all the way to conclusive findings. For the election data, we will be looking both at raw win margin in a single election and the shift in win margin across two elections for each party. In order to do this, we must first define several terms.

  • Vote Margin: The difference between the percentage of the vote obtained by the Republican and Democratic candidate in a given election.
    • Example: in 2016, Hillary Clinton received 48.2% of the national popular vote while Donald Trump received 46.1%. Thus, the popular vote margin in 2016 was 2.1 points in favor of Hillary Clinton.
  • Vote Swing: The difference between vote margins in two different elections.
    • Example: the popular vote margin in 2004 was 2.4 points in favor of George Bush (Republican), while the popular vote margin in 2008 was 7.2 points in favor of Barack Obama (Democrat). This means that, from 2004 to 2008, the popular vote swing was 9.6 points in favor of the Democratic Party.

Why Is This Important?

The United States is increasingly a politically divided country. The ground is shifting beneath the politicians on both sides of the aisle, and it is often hard to discern why certain voters may be moving in a certain direction. This study attempts to offer a basic explanation for the changing American landscape by determining the factors that lead to vote swings for either party. Hopefully, this can help clarify things for political observers and open up a dialogue grounded in a more thorough understanding of the American voter.

Part 1: Data Collection

In this section, we will collect all relevant data from several sources using Python. Our analysis will be on the county level. We will start by importing several relevant Python libraries for this study.

In [1]:
# standard Python libraries for data science
import numpy as np
import pandas as pd
import sklearn
from sklearn import linear_model

# Python libraries for HTTP requests of data from web pages
import requests
from bs4 import BeautifulSoup

# standard default dictionary module
from collections import defaultdict

# Python libraries for graphing and visualizations
import matplotlib.pyplot as plt
import statsmodels.api
import statsmodels.formula.api as sm
import seaborn
import folium

Possible factors to explain changes in the American electoral landscape include:

  • Income
  • GDP per capita
  • Education
  • Race
  • Coastal or not

Data Sources

Data for personal income by county: https://www.bea.gov/data/income-saving/personal-income-county-metro-and-other-areas
Data for GDP per capita by county: https://ssti.org/blog/useful-stats-gdp-capita-county-2012-2015
Data for education by county: https://data.ers.usda.gov/reports.aspx?ID=17829
Data for race by county: https://www.census.gov/data/tables/time-series/demo/popest/2010s-counties-detail.html
List of coastal counties: https://www2.census.gov/library/stories/2018/08/coastline-counties-list.xlsx

Data for election results by county will be obtained through HTTP requests to Dave Leip's Atlas of U.S. Presidential Elections.

Obtaining Election Data Through HTTP Requests

County election results for any modern presidential election in any state can be easily retrieved from Dave Leip's Atlas. A simple HTTP request using a personalized url will do the trick. This way, we can transform the data into a pandas DataFrame for later use.

First, we will define dictionaries for the FIPS value of each state, because this is how the Atlas identifies individual states. We will also create a dictionary for state abbreviations, which will be used in the DataFrame.

In [2]:
# maps states to their corresponding FIPS value for the atlas. Alaska and Louisiana data are not avaiable.
fips = {"Alabama": 1, "Arizona": 4, "Arkansas": 5, "California": 6, "Colorado": 8, "Connecticut" : 9, \
        "Delaware": 10, "DC": 11, "Florida": 12, "Georgia": 13, "Hawaii": 15, "Idaho": 16, "Illinois": 17, \
        "Indiana": 18, "Iowa": 19, "Kansas": 20, "Kentucky": 21, "Maine": 23, "Maryland": 24, \
        "Massachusetts": 25, "Michigan": 26, "Minnesota": 27, "Mississippi": 28, "Missouri": 29, "Montana": 30, \
        "Nebraska": 31, "Nevada": 32, "New Hampshire": 33, "New Jersey": 34, "New Mexico": 35, \
        "New York":36, "North Carolina":37, "North Dakota":38, "Ohio": 39, "Oklahoma": 40, "Oregon": 41, \
        "Pennsylvania": 42, "Rhode Island": 44, "South Carolina": 45, "South Dakota": 46, "Tennessee": 47, \
        "Texas": 48, "Utah": 49, "Vermont": 50, "Virginia": 51, "Washington": 53, "West Virginia": 54, \
        "Wisconsin": 55, "Wyoming": 56}

# map of state abbreviations for use in MapCharts
abbs = {"Alabama": "AL", "Alaska": "AK", "Arizona": "AZ", "Arkansas": "AR", "California": "CA", "Colorado": "CO", \
        "Connecticut" : "CT", "Delaware": "DE", "DC": "DC", "Florida": "FL", "Georgia": "GA", "Hawaii": "HI", \
        "Idaho": "ID", "Illinois": "IL", "Indiana": "IN", "Iowa": "IA", "Kansas": "KS", "Kentucky": "KY", \
        "Louisiana": "LA", "Maine": "ME", "Maryland": "MD", "Massachusetts": "MA", "Michigan": "MI", \
        "Minnesota": "MN", "Mississippi": "MS", "Missouri": "MO", "Montana": "MT", "Nebraska": "NE", \
        "Nevada": "NV", "New Hampshire": "NH", "New Jersey": "NJ", "New Mexico": "NM", "New York": "NY", \
        "North Carolina": "NC", "North Dakota": "ND", "Ohio":"OH", "Oklahoma": "OK", "Oregon": "OR", \
        "Pennsylvania": "PA", "Rhode Island": "RI", "South Carolina": "SC", "South Dakota": "SD", "Tennessee": "TN", \
        "Texas": "TX", "Utah": "UT", "Vermont": "VT", "Virginia": "VA", "Washington": "WA", \
        "West Virginia": "WV", "Wisconsin": "WI", "Wyoming": "WY"}

all_states = list(abbs.keys())
atlas_states = list(fips.keys())

Function Definitions

Next, we will define our election_results() function. This is our most complex function and is what all of our data collection is built on. This function will accept a state name and election year as arguments, and return a DataFrame containing presidential election results for that year in every county in the state. These results will be in the form of a column for each major candidate containing a vote percentage for each county. This is how we will calculate the margin for every county in the state.

In [3]:
def election_results(state, year):
    request = requests.get("https://uselectionatlas.org/RESULTS/datagraph.php?year=" + \
                           str(year) + "&fips=" + str(fips[state]))
    soup = BeautifulSoup(request.content, "html.parser")
    # list of tables, each table corresponding to a county
    tables = soup.body.find("div", {"class": "info"}).find_all("table")
    results = {} # will contain results for each county
    all_candidates = {} # running list of all candidates on the ballot in this state
    
    for county in tables:
        values = county.find_all("tr") # list of candidate rows for the county		
        county_name = values[0].find("td").b.string

        # deal with exceptions with independent cities and counties of the same name
        for exception,exception_state in [("Baltimore","Maryland"), ("Fairfax","Virginia"), ("Richmond", "Virginia"), \
                                        ("Bedford", "Virginia"), ("Franklin", "Virginia"), ("Roanoke","Virginia"), \
                                          ("St. Louis", "Missouri")]:
            if county_name == exception and state == exception_state:
                if not exception + " County" in list(results.keys()):
                    county_name = exception + " County"
                else:
                    county_name = exception + " City"
                break

        # small changes in county names for consistency
        if county_name == "Dewitt" and state == "Texas":
            county_name = "DeWitt"
        elif county_name == "Desoto" and state == "Florida":
             county_name = "DeSoto"
        elif county_name == "Dade" and state == "Florida":
            county_name = "Miami-Dade"
        elif county_name == "Ormsby" and state == "Nevada":
            county_name = "Carson City"
        elif county_name == "Shannon" and state == "South Dakota":
            county_name = "Oglala Lakota"
            
        candidate_results = defaultdict(lambda: 0.0) # results for each candidate in the county
        for candidate in values[:2]: # first two candidates (Democrat, Republican)
            name = candidate.find("td", {"class":"cnd"})
            if name is None:
                name = candidate.find("td").string
            else:
                name = name.string

            candidate_results[name] = float((candidate.find("td", {"class":"per"}).string)[:-1])
            all_candidates[name] = None

        results[county_name] = candidate_results
        
    # the election results have been retrieved and compiled into dictionary form.
    # they must now be compiled into a pandas DataFrame


    # Current data format for results
    # {
    # county_name:
       # {Candidate: percentage, Candidate: percentage},
    # county_name:
       # {Candidate: percentage, Candidate: percentage},
    # etc
    # }

    # return results in the form of a DataFrame
    df = {"County": list(results.keys())}
    for candidate in all_candidates.keys(): # for every candidate
        df[candidate + "_" + str(year) + "_%"] = [results[county][candidate] for county in results.keys()]

    return pd.DataFrame(df)

We will now define our election_swings() function, which will accept a state and two election results DataFrames as arguments, returning a DataFrame containing the vote swings for each county in that state between the two elections given.

In [4]:
def election_swings(election1, election2):
    # compile results for each state
    results = election1.merge(election2, on="County", how="outer")
    candidates = results.columns[1:]
    year1 = candidates[0][-6:-2]
    year2 = candidates[2][-6:-2]

    # calculate swing
    swing = []
    for i,row in results.iterrows():
        swing.append((row[candidates[2]] - row[candidates[3]]) - (row[candidates[0]] - row[candidates[1]]))
    results["Swing_" + year1 + "_" + year2] = swing

    return results

Here is an example of what these functions return, using the state of Maine.

In [5]:
maine_example = election_results('Maine', 2020)
maine_example.head()
Out[5]:
County Biden_2020_% Trump_2020_%
0 Androscoggin 47.0 49.9
1 Aroostook 39.0 59.0
2 Cumberland 66.5 30.8
3 Franklin 46.4 50.3
4 Hancock 54.8 42.4
In [6]:
election_swings(election_results('Maine', 2016), maine_example).head()
Out[6]:
County Clinton_2016_% Trump_2016_% Biden_2020_% Trump_2020_% Swing_2016_2020
0 Androscoggin 41.4 50.8 47.0 49.9 6.5
1 Aroostook 38.1 55.3 39.0 59.0 -2.8
2 Cumberland 59.9 33.6 66.5 30.8 9.4
3 Franklin 42.6 48.0 46.4 50.3 1.5
4 Hancock 50.2 42.7 54.8 42.4 4.9

As you can see, election_results() shows what percentage of the vote each major party candidate received in each county, and election_swings() shows that information as well as the vote swing between each election.

Note: for this study, a negative vote swing implies a swing towards the Republican party, while a positive vote swing implies a swing towards the Democratic party. A swing of 0 implies the vote margin did not change between the two elections. This was chosen arbitrarily.

Gathering the Election Data

We will now gather all election data from the 48 states where data is available (we will touch on the other two later). This can be done by looping through every state and merging the DataFrames. Because we will be doing 48 HTTP requests, this step will take the longest to compute. No need to worry if it takes several several seconds up to a minute to complete if you are doing this yourself. We will start by collecting all 2012-2016 vote swing data.

In [7]:
# swing between 2012 and 2016
elections_df = election_swings(election_results(atlas_states[0], 2012), election_results(atlas_states[0], 2016))
elections_df.loc[:, 'State'] = abbs[atlas_states[0]] # add state abbreviations

# iterate over every available states, appending the rows for each state at the end of the DataFrame
for state in atlas_states[1:]:
    new_df = election_swings(election_results(state, 2012), election_results(state, 2016))
    new_df.loc[:, 'State'] = abbs[state]
    elections_df = elections_df.append(new_df, ignore_index=True)
# reset column names
elections_df = elections_df[[elections_df.columns[0]] + [elections_df.columns[-1]] + list(elections_df.columns[1:-1])]
elections_df.head()
Out[7]:
County State Obama_2012_% Romney_2012_% Clinton_2016_% Trump_2016_% Swing_2012_2016
0 Autauga AL 26.5 72.5 23.8 72.8 -3.0
1 Baldwin AL 21.6 77.2 19.4 76.5 -1.5
2 Barbour AL 51.3 48.2 46.5 52.1 -8.7
3 Bibb AL 26.2 72.8 21.2 76.4 -8.6
4 Blount AL 12.3 86.3 8.4 89.3 -6.9

Now let's try adding on 1992-1996 swing data. We can always add more years, but for now we will be working just with this these two sets of swings.

In [8]:
# do the same thing for 1992-1996 election data

df_1992 = election_swings(election_results(atlas_states[0], 1992), election_results(atlas_states[0], 1996))
df_1992.loc[:, 'State'] = abbs[atlas_states[0]]

for state in atlas_states[1:]:
    new_df = election_swings(election_results(state, 1992), election_results(state, 1996))
    new_df.loc[:, 'State'] = abbs[state]
    df_1992 = df_1992.append(new_df, ignore_index=True)

df_1992 = df_1992[[df_1992.columns[0]] + [df_1992.columns[-1]] + list(df_1992.columns[1:-1])]
# outer merge, in order to allow us to address missing data directly
elections_df = elections_df.merge(df_1992, on=['County', 'State'], how='outer')
elections_df.head()
Out[8]:
County State Obama_2012_% Romney_2012_% Clinton_2016_% Trump_2016_% Swing_2012_2016 Clinton_1992_% Bush_1992_% Clinton_1996_% Dole_1996_% Swing_1992_1996
0 Autauga AL 26.5 72.5 23.8 72.8 -3.0 30.9 55.9 32.5 61.7 -4.2
1 Baldwin AL 21.6 77.2 19.4 76.5 -1.5 26.2 56.5 27.1 62.6 -5.2
2 Barbour AL 51.3 48.2 46.5 52.1 -8.7 46.4 42.9 53.5 40.5 9.5
3 Bibb AL 26.2 72.8 21.2 76.4 -8.6 43.2 46.5 44.0 48.2 -0.9
4 Blount AL 12.3 86.3 8.4 89.3 -6.9 32.9 53.8 33.0 59.1 -5.2

Now we have all of the election data that we want to begin with, so it is time to move on to collecting the rest of the data. This will be done by downloading Excel sheets from the data sources listed above.

Gathering Other County Data

Remember: the preliminary independent variables for our study are Income, GDP per capita, Education, Race, and Coastal or not (binary variable). We will start with income.

Here is the format we will be using for each DataFrame:

  • Income: average income per capita for each county
  • GDP per Capita: GDP per capita for each county
  • Education: the percentage of the population in each county that have completed a college education
  • Race: the percentage of the population in each county that is non-Hispanic white
  • Coastal: list of counties that are determined to be coastal by the US Census

Income Data

We will now gather per capita income data from the above source and compile it into a DataFrame.

Note: the footnotes in this excel sheet were manually removed for formatting purposes. In addition, formatting for Virginia independent cities has been fixed in order for them to be interpreted simply as counties in the state and Virginia "combination areas" have been removed. Finally, the county of "Washington" was added under the Washington section and the title of the section was changed to "DC" for formatting purposes. This is because in this study, DC is formatted as the state of DC with the lone county of Washington.

In [9]:
income_df = pd.read_excel('income_by_county.xlsx', header=3, usecols='A,D', names=['County', 'Income'])

GDP per Capita Data

Now we will collect the data for GDP per capita for each county.

In [10]:
gdp_df = pd.read_excel('gdp_per_capita_per_county.xlsx', header=1, usecols='B,C,P', \
                       names=['County', 'State', 'GDP'])

Education Data

We will now move on to gathering education data for each county. Specifically, we will determine the percentage of the population in each county that has completed a college education. Because this factor will be used extensively in this study, we will record the data for each decade starting in 1990. Unfortunately, each state must be downloaded as its own separate Excel sheet. I will name them all similar names for convenience.

Note: the Florida and South Dakota datasets were manually edited to merge the two Miami-Dade counties and Oglala Lakota/Shannon name change counties. These two pairs of counties refer to the same county.

In [11]:
# we will store all of the education DataFrames in a dictionary

education_dfs = {}
for state in all_states:
    education_dfs[state] = pd.read_excel('education/' + abbs[state].lower() + '_education.xlsx', header=2, \
                                         usecols='B,F,G,I', names=['County', 'College_1990', 'College_2000', 'College'])\
                                        .drop(index=0)
    education_dfs[state] = education_dfs[state][education_dfs[state]['College'].notna()]

Demographics Data

Finally, we will collect race demographics data for each county. Specifically, we are looking at what percentage of the population is non-Hispanic white.

Note: the dataset provided is far too large to use efficiently, as it splits all 3,000+ counties up by 19 age groups and 12 different measurements of the population across 10 years. This leads to more than 700,000 rows. In order to simplify this, I manually deleted all rows in the Excel sheet that were not from measurement 12 (the most recent measurement) and age group 0 (total).

In [12]:
demographics_df = pd.read_excel('demographics_by_county.xlsx', usecols='D,E,G,H,AI,AJ', \
                                names=['State', 'County', 'Age Group', 'Population', 'White Male', 'White Female'])

Coastal Data (Binary)

Here, we will set a binary variable (1 or 0) denoting whether the given county border the Pacific Ocean, Atlantic Ocean, or Gulf of Mexico.

In [13]:
coastline_df = pd.read_excel('coastline_counties.xlsx', header=3, usecols='D,E', names=['County','State'])\
                            .dropna(how='any')

Part 2: Data Cleaning

Now that we have collected all of our data, it must be cleaned up and put into a format for merging all DataFrames together. This can be tricky, as not all data sources use the same format for county names. We will address each DataFrame individually. After formatting the data to be merged properly, we must deal with any missing data.

Formatting the DataFrames

In order to merge all of our DataFrames into a single DataFrame, they must all have the same format for county names. The format we will be using is one column for the county name and one column for the state abbreviation, similar to the election swing DataFrame we have already created.

Formatting Income Data

In [14]:
income_df.loc[:, 'State'] = np.nan

# used to check NaN values in String columns
def isnan(x):
    try:
        return np.isnan(x)
    except:
        return False
    
current_state = ''
rows_to_delete = []
for i,row in income_df.iterrows():
    # this process aligns with the formatting of the table
    # we do not want to include whole states, so we will check
    # which rows are states and add their abbrevation to
    # every subsequent county, dropping the state at the end
    if isnan(income_df['County'][i]):
        income_df.loc[i+1, 'Income'] = np.nan
        rows_to_delete.append(i)
    elif isnan(income_df['Income'][i]):
        current_state = abbs[row['County']]
        rows_to_delete.append(i)
    else:
        income_df.loc[i, 'State'] = current_state
        
income_df = income_df.drop(index=rows_to_delete).dropna(how='all')
income_df = income_df[['County', 'State', 'Income']]

income_df.head()     
Out[14]:
County State Income
2 Autauga AL 43917.0
3 Baldwin AL 47485.0
4 Barbour AL 35763.0
5 Bibb AL 31725.0
6 Blount AL 36412.0

Formatting GDP per Capita Data

Fortunately, this data is already perfectly formatted.

In [15]:
  gdp_df.head()
Out[15]:
County State GDP
0 Los Angeles CA 68352.427798
1 Cook IL 72939.126137
2 Harris TX 86959.562434
3 Maricopa AZ 51848.202103
4 San Diego CA 65061.817106

Formatting Education Data

In [16]:
for state in all_states:
    education_dfs[state].loc[:, 'State'] = \
                                    education_dfs[state]['County'][education_dfs[state].index.values[0]].strip()[-2:]
    
    # remove the word 'county' from each county name and add the relevant state
    education_dfs[state]['County'] = [county.strip()[:-4] for county in education_dfs[state]['County']]
    education_dfs[state] = education_dfs[state][['County', 'State', 'College', 'College_1990', 'College_2000']]
    
education_df = education_dfs[all_states[0]]
for state in all_states[1:]:
    education_df = education_df.append(education_dfs[state], ignore_index=True)
    
education_df['College'] = education_df['College'] * 100 # adjust for percentage
education_df['College_1990'] = education_df['College_1990'] * 100
education_df['College_2000'] = education_df['College_2000'] * 100

education_df.head()
Out[16]:
County State College College_1990 College_2000
0 Autauga AL 27.689286 14.505537 18.021675
1 Baldwin AL 31.345883 16.820637 23.066347
2 Barbour AL 12.215925 11.826519 10.944115
3 Bibb AL 11.489227 4.729260 7.104874
4 Blount AL 12.642895 7.024286 9.598837

Formatting Demographics Data

In [17]:
# only use Age Group 0 (total cumulative age groups)
demographics_df = demographics_df[demographics_df['Age Group'] == 0][['County', 'State', 'Population', \
                                                                      'White Male', 'White Female']]

# combine white male and white female population
demographics_df['White Population'] = demographics_df['White Male'] + demographics_df['White Female']

# gather percentage by dividing by whole population
demographics_df['White'] = demographics_df['White Population'] / demographics_df['Population']

# this is for inconsistencies in county names 
demographics_df['County'] = demographics_df['County'].apply(lambda c: \
                                                            c.replace('District of Columbia', 'Washington County'))
demographics_df['State'] = demographics_df['State'].apply(lambda c: \
                                                            c.replace('District of Columbia', 'DC'))

# add state abbreviations
demographics_df['State'] = [abbs[s] for s in demographics_df['State']]

demographics_df['County'] = [c.replace('Parish', 'County')[:-7]  if c.replace('Parish', 'County')[-6:] == 'County' \
                             else c.replace('Parish', 'County')[:-5] for c in demographics_df['County']]

demographics_df = demographics_df[['County', 'State', 'Population', 'White']]

demographics_df['White'] = demographics_df['White'] * 100 # adjust for percentage

demographics_df.head()
Out[17]:
County State Population White
0 Autauga AL 55869 73.770785
1 Baldwin AL 223234 83.207307
2 Barbour AL 24686 45.511626
3 Bibb AL 22394 74.408324
4 Blount AL 57826 86.770657

Formatting the Coastline List

In [18]:
coastline_df['County'] = coastline_df['County'].apply(lambda c: str(c).replace(' County', '').replace(' Parish', ''))
coastline_df['State'] = coastline_df['State'].apply(lambda s: abbs[s])
coastline_df.head()
Out[18]:
County State
0 Baldwin AL
1 Mobile AL
2 Aleutians East Borough AK
3 Aleutians West Census Area AK
4 Anchorage Borough AK

Merging all DataFrames

Now that we have completed our formatting, we have 6 DataFrames containing all of the data we need to get started on our analysis.

  • elections_df: contains all presidential election results day, from margins to swings
  • income_df: contains all average income per capita data
  • gdp_df: contains all GDP per capita data
  • education_df: contains all college attainment rate data
  • demographics_df: contains the total population and white population data
  • coastline_df: list of all counties determined to be coastal by the Us Census

We must now merge all of these DataFrames into a single master DataFrame, containing all necessary data to be used in our analysis. After doing this, and cleaning up any missing data, we can move on to analyzing our data.

In [19]:
df = elections_df.merge(income_df, on=['County', 'State'], how='outer')\
                 .merge(gdp_df, on=['County', 'State'], how='outer')\
                 .merge(education_df, on=['County', 'State'], how='outer')\
                 .merge(demographics_df, on=['County', 'State'], how='outer')

# add a Coastal column, originally set to all 0, and then change the coastal counties to 1
df.loc[:, 'Coastal'] = 0
for i,row in coastline_df.iterrows():
    df.loc[((df['County'] == (row['County'])) & (df['State'] == (row['State']))), 'Coastal'] = 1

df.head()
Out[19]:
County State Obama_2012_% Romney_2012_% Clinton_2016_% Trump_2016_% Swing_2012_2016 Clinton_1992_% Bush_1992_% Clinton_1996_% Dole_1996_% Swing_1992_1996 Income GDP College College_1990 College_2000 Population White Coastal
0 Autauga AL 26.5 72.5 23.8 72.8 -3.0 30.9 55.9 32.5 61.7 -4.2 43917.0 28071.884460 27.689286 14.505537 18.021675 55869.0 73.770785 0
1 Baldwin AL 21.6 77.2 19.4 76.5 -1.5 26.2 56.5 27.1 62.6 -5.2 47485.0 31726.371985 31.345883 16.820637 23.066347 223234.0 83.207307 1
2 Barbour AL 51.3 48.2 46.5 52.1 -8.7 46.4 42.9 53.5 40.5 9.5 35763.0 28319.334450 12.215925 11.826519 10.944115 24686.0 45.511626 0
3 Bibb AL 26.2 72.8 21.2 76.4 -8.6 43.2 46.5 44.0 48.2 -0.9 31725.0 14286.024556 11.489227 4.729260 7.104874 22394.0 74.408324 0
4 Blount AL 12.3 86.3 8.4 89.3 -6.9 32.9 53.8 33.0 59.1 -5.2 36412.0 14231.776350 12.642895 7.024286 9.598837 57826.0 86.770657 0

Fixing Missing Data

Given that we gathered data from several different sources, there are likely to be discrepancies with county and state names that result in missing data. In addition, some counties that existed in 1992 may not exist today, and vice versa. Finally, Dave Leip's Atlas does not readily provide data for Louisiana. We must manage this missing data before moving onto the exploratory phase.

The Alaska Case

Unfortunately, the state of Alaska does not report presidential election results by county, instead reporting them by state legislative district. This makes Alaska rather useless in our analysis. Fortunately, Alaska is a very unique and geographically distinct state from the the other 49, and removing them from our analysis will not cause too much bias when we are working with so many other counties.

In [20]:
df = df[df['State'] != 'AK']

Evaluating Missing Data

We will now check through all of the rows in our DataFrame that are missing data and determine a cause and potential solution. Before we begin, here are some potential causes for missing data that can be addressed.

  1. County did not exist during both sets of elections
  2. Inconsistent county names across the datasets
  3. Data is actually unavailable
In [21]:
len(df[df.isnull().any(axis=1)])
Out[21]:
218

As we can see, there are 218 rows containing missing values somewhere. This is not ideal. After scouring through all of the rows in the DataFrame, I am able to identify the patterns that explain almost all of the missing data.

Identified Causes for Missing Values

  1. Inconsistent county names across datasets
    • This is especially prevalent among independent cities in Virginia
  2. Grouping of counties makes data unavailable
  3. Counties that no longer exist
  4. Louisiana data is not offered

Addressing the inconsistent county names is fairly simple. All we have to do is develop functions to change state and county names, and apply them to all situations that call for it.

In [22]:
# function that changes a county name if the county and state values match
def replace_county(dataframe, county, state, new_county):
    try:
        index = dataframe[(dataframe['County'] == county) & (dataframe['State'] == state)].index.values[0]
        dataframe.loc[index, 'County'] = new_county
        return dataframe
    except:
        return dataframe

# some basic name inconsistencies that can be easily fixed
elections_df = replace_county(elections_df, 'District of Columbia', 'DC', 'Washington')
elections_df = replace_county(elections_df, 'Dewitt', 'TX', 'DeWitt')
elections_df = replace_county(elections_df, 'Desoto', 'FL', 'DeSoto')
elections_df = replace_county(elections_df, 'Dade', 'FL', 'Miami-Dade')
elections_df = replace_county(elections_df, 'Shannon', 'SD', 'Oglala Lakota')
elections_df = replace_county(elections_df, 'Lac Qui Parle', 'MN', 'Lac qui Parle')
elections_df = replace_county(elections_df, 'Dona Ana', 'NM', 'Doña Ana')


# all of these places are both a county and an independent city that share a name.
# this addresses that problem of duplicate names.
for dataframe in [elections_df, income_df, gdp_df, education_df, demographics_df]:
    for county,state in [("Baltimore","MD"), ("Fairfax","VA"), ("Richmond", "VA"), \
                               ("Bedford", "VA"), ("Franklin", "VA"), ("Roanoke","VA"), \
                               ("St. Louis", "MO")]:
        replace_county(dataframe, county, state, county + ' County')
        replace_county(dataframe, county, state, county + ' City')

# more inconsistent names
gdp_df = replace_county(gdp_df, 'District of Columbia', 'DC', 'Washington')
education_df = replace_county(education_df, 'District of Columbia', 'DC', 'Washington')
education_df = replace_county(education_df, 'La Salle', 'LA', 'LaSalle')
demographics_df = replace_county(demographics_df, 'Carson', 'NV', 'Carson City')
        
# more inconcsistent county names across the data sets. These must all be fixed
income_df['County'] = income_df['County'].apply(lambda c: c.replace(' (includes Yellowstone National Park)', '')\
                                                           .replace('Lagrange', 'LaGrange')\
                                                           .replace('Maui + Kalawao', 'Maui'))

gdp_df['County'] = gdp_df['County'].apply(lambda c: c.replace('(Independent City)', 'City')\
                                                      .replace(' (includes Yellowstone Park)', '')\
                                                      .replace('Lagrange', 'LaGrange')\
                                                      .replace('Carson City City', 'Carson City')\
                                                      .replace('Maui + Kalawao', 'Maui'))

education_df['County'] = education_df['County'].apply(lambda c: c.replace('Dona Ana', 'Doña Ana'))
In [23]:
# now that the formatting inconsistencies have been fixed, we can merge the data together again
df = elections_df.merge(income_df, on=['County', 'State'], how='outer')\
                 .merge(gdp_df, on=['County', 'State'], how='outer')\
                 .merge(education_df, on=['County', 'State'], how='outer')\
                 .merge(demographics_df, on=['County', 'State'], how='outer')

df.loc[:, 'Coastal'] = 0
for i,row in coastline_df.iterrows():
    df.loc[((df['County'] == (row['County'])) & (df['State'] == (row['State']))), 'Coastal'] = 1

df = df[df['State'] != 'AK'] # remove Alaska data again
In [24]:
# check again how many rows have missing data
len(df[df.isnull().any(axis=1)])
Out[24]:
160

After curing the inconsistent county names, we are left with 160 rows with missing data. These rows are a result of all of Louisiana's counties missing election data, city and county grouping in Virginia that leaves income and GDP data unavailable, a county in Colorado that did not exist in 1992, and a county in Hawaii so small that election data is not recorded.

The new Colorado county and tiny Hawaii county can simply be dropped, as their data would be irrelevant to this study. Unfortunately, we are going to have to drop all of the Virginia rows with missing data as well, because the grouping of counties and independent cities cannot be recovered. Fortunately, this data is Missing Completely at Random (MCAR), meaning there is no pattern to explain which counties are missing, so it is relatively safe to drop.

In [25]:
df = df.drop(df[(df['State'] == 'VA') & (df.isnull().any(axis=1))].index.values)
df = df.drop(df[(df['County'] == 'Broomfield') | (df['County'] == 'Kalawao')].index.values)
In [26]:
len(df[df.isnull().any(axis=1)])
Out[26]:
64

This leaves just the 64 Louisian counties. This missing data will be dealt with in the exploratory phase.

Part 3: Exploratory Data Analysis

Now that we have put together our comprehensive DataFrame, we can get to work. Our one remaining problem is the missing Louisian election data. Luckily, this data is Missing at Random (MAR), meaning that the cause of the missing data is an observed characteristic (whether the county is in Louisiana or not). That means there are several methods to impute this missing data with some level of confidence.

Missing Data Imputation

One such method for missing data imputation is hot-deck imputation, where we impute values from another row that matches the given data point as close as possible in another aspect related to the missing aspect. To the despair of many, the American South is one of the most racially polarized regions in the United States, meaning race is a very accurate predictor for how a certain area will vote. With this in mind, the racial demographics of a given county in the South will be a good predictor of how a similar county in Louisiana will vote. We can use this information to do some hot-deck imputation. We will provide 90% of our weight to race and 10% of our weight to education in determining a southern county most similar to each Louisiana county.

For this imputation, we will define the southern states as Arkansas, Tennessee, Mississippi, Alabama, Georgia, South Carolina, and North Carolina. Other southern states do not carry the same level of racial polarization and will thus not be as effective in predicting the missing data for Louisiana.

First, we will define a function for determining which southern county is most similar to a given Louisiana county.

In [27]:
# define the southern states
southern_states = df[(df['State'] == 'AR') | (df['State'] == 'TN') | (df['State'] == 'MS') | (df['State'] == 'AL') | \
                    (df['State'] == 'GA') | (df['State'] == 'SC') | (df['State'] == 'NC')]

# this is an algorithm that will determine the most similar county to the county given as an argument
# this algorithm uses the formula described above
def similar_southern_county(row):
    most_similar = None
    distance = 101
    for i,r in southern_states.iterrows():
        new_distance = (0.9 * abs(row['White'] - r['White'])) + \
                        (0.1 * abs(row['College'] - r['College']))
        
        if new_distance < distance:
            most_similar = r
            distance = new_distance
            
    return most_similar

Now it's time for the imputation.

In [28]:
for i,row in df[df['State'] == 'LA'].iterrows():
    similar_county = similar_southern_county(row)
    # we will leave 1990s data blank because these education and race factors might not apply back then
    for col in df.columns[2:7]:
        df.loc[i, col] = similar_county[col]

df[df['State'] == 'LA'].head()
Out[28]:
County State Obama_2012_% Romney_2012_% Clinton_2016_% Trump_2016_% Swing_2012_2016 Clinton_1992_% Bush_1992_% Clinton_1996_% Dole_1996_% Swing_1992_1996 Income GDP College College_1990 College_2000 Population White Coastal
3051 Acadia LA 24.1 74.8 17.5 80.3 -12.1 NaN NaN NaN NaN NaN 37786.0 25215.117392 13.372207 8.444971 9.448177 62045.0 77.316464 0
3052 Allen LA 25.2 73.5 19.5 78.1 -10.3 NaN NaN NaN NaN NaN 34644.0 23929.340616 12.724334 6.748605 9.258488 25627.0 70.854958 0
3053 Ascension LA 37.7 60.9 33.0 63.0 -6.8 NaN NaN NaN NaN NaN 50671.0 67764.391502 26.793090 9.263189 14.503437 126604.0 67.527882 0
3054 Assumption LA 41.5 57.7 33.9 64.4 -14.3 NaN NaN NaN NaN NaN 47947.0 18402.052850 10.225883 6.721411 7.362431 21891.0 65.396738 0
3055 Avoyelles LA 33.9 65.6 27.8 70.8 -11.3 NaN NaN NaN NaN NaN 39001.0 20121.923883 12.085658 7.395432 8.253777 40144.0 64.931248 0

Given the politics of the American South, we can expect many of these county estimations to be accurate within a margin of error. A brief check on the New York Times election results page can confirm that this hot-deck imputation did exactly what we wanted it to.

Congratulations! We officially have no more missing data for 2012-2016. Now we can get started building a preliminary model!

Preliminary Linear Regression Model

We are going to start by developing a preliminary regression model. We can expect that an early model will have many problems in need of addressing. This section is just to get the feel of linear regression; finalizing our model will occur in a later section.

A linear regression model is used to predict values in a system with a linear correlation between a dependent variable and one or more independent variables. For our initial hypothesis, we predict that the vote swing between 2012 and 2016 in each county is correlated with income, GDP per capita, racial demographics, and whether the county is coastal. We can use this hypothesis to create a simple regression model that will help us determine how correlated each characteristic is with vote swing.

Here is the equation that our hypothesis rests on.

$Vote Swing = \beta_{0} + \beta_{1}(Income Per Capita) + \beta_{2}(GDP per Capita) + \beta_{3}(College Attainment) + \beta_{4}(White Population) + \beta_{5}(Coastal)$

In [29]:
# use the statsmodels library to create linear regression model using an inputed equation and DataFrame
InitialModel = sm.ols(formula='Swing_2012_2016 ~ Income + GDP + College + White + Coastal', data=df).fit()
print(InitialModel.summary2())
                  Results: Ordinary least squares
===================================================================
Model:              OLS              Adj. R-squared:     0.468     
Dependent Variable: Swing_2012_2016  AIC:                20803.5875
Date:               2020-12-18 17:58 BIC:                20839.7268
No. Observations:   3051             Log-Likelihood:     -10396.   
Df Model:           5                F-statistic:        536.8     
Df Residuals:       3045             Prob (F-statistic): 0.00      
R-squared:          0.468            Scale:              53.451    
--------------------------------------------------------------------
                Coef.   Std.Err.     t      P>|t|    [0.025   0.975]
--------------------------------------------------------------------
Intercept      -5.6081    0.6821   -8.2218  0.0000  -6.9456  -4.2707
Income         -0.0000    0.0000   -0.3603  0.7186  -0.0000   0.0000
GDP            -0.0000    0.0000   -1.0060  0.3145  -0.0000   0.0000
College         0.5904    0.0194   30.4978  0.0000   0.5524   0.6283
White          -0.2253    0.0069  -32.7635  0.0000  -0.2387  -0.2118
Coastal        -0.7062    0.5426   -1.3014  0.1932  -1.7702   0.3578
-------------------------------------------------------------------
Omnibus:             355.798       Durbin-Watson:          1.128   
Prob(Omnibus):       0.000         Jarque-Bera (JB):       1450.791
Skew:                0.515         Prob(JB):               0.000   
Kurtosis:            6.217         Condition No.:          333902  
===================================================================
* The condition number is large (3e+05). This might indicate
strong multicollinearity or other numerical problems.

Here are some statistics in this model worth reviewing

  • coef: the coefficient denoting the correlation between the independent variable and the dependent variable
  • Adj. R-squared: how much of the change in our dependent variable is explained by our independent variables
  • P>|t|: also known as p-value, measuring the signifiance of this independent variable in our model. Value less than 0.05 is optimal with 95% confidence
  • [0.025 0.975]: the 95% confidence interval for the given coefficient value

As we can see from this, our model explains 46.8% of the change in vote swing across the counties. It seems that, as opposed to our original theory, at least income and GDP are insignificantly correlated with vote swing.

We can determine from the coefficients that, on average, a 1 percentage point increase in college education in a given county will lead to a 0.5904 point increase in the county's swing towards the Democratic Party between two elections, all else equal. Conversely, a 1 percentage point increase in the white population will lead to a 0.2253 point increase in the county's swing towards the Republican party, all else equal. We will have to work on our model to determine if there is any correlation between whether a county is coastal and its vote swing.

Diagnosing Problems

We should now look at potential problems with our model beyond the basic statistics that can be fixed in the next section.

Multicollinearity

Multicollinearity is when two or more of the independent variables are highly correlated with each other. This can cause major problems with the model. We can check for any instances of multicollinearity with a simple correlation matrix of our DataFrame.

In [30]:
df[['Income', 'GDP', 'College', 'White', 'Coastal']].corr()
Out[30]:
Income GDP College White Coastal
Income 1.000000 0.542345 0.666162 0.061213 0.230354
GDP 0.542345 1.000000 0.375878 -0.100608 0.118444
College 0.666162 0.375878 1.000000 0.014349 0.211948
White 0.061213 -0.100608 0.014349 1.000000 -0.164453
Coastal 0.230354 0.118444 0.211948 -0.164453 1.000000

This correlation matrix underscores some of the problems with our original model. It seems, unsurprisingly, that GDP and income and are somewhat highly correlated. It also seems that college attainment and income are very highly correlated. From the high p-values in the model to the high correlation with other variables here, it is becoming clear that at least income is not necessary for our model, and is in fact hurting it.

Diagnostic Plots

Now that we have addressed multicollinearity, it is time to look at some diagnostic plots for our model. Specifically, we will be looking at the Fitted Values vs. Residuals plot, Histogram of Residuals plot, and QQ Plot.

  • The Fitted Values vs. Residuals plots the model's fitted swing value for each county against its residual (how far off from the truth it was). This plot should have approximately the same variance around the mean for all values and have a mean of 0.
  • The histogram of residuals plots the probability histogram of the residuals/errors for all counties in the model. This plot should look approximately like a normal distribution.
  • The QQ plot plots the residuals of the models against a theoretical normal distribution. The values should be close to identical, so this graph should look like a 45 degree increase.

Let's define a function for producing diagnostic plots and examine the results.

In [31]:
# this function takes in a model and some dimensions, and displays the diagnostic plots in the given dimensions
def diagnostic_plots(model, x, y):
    
    # use matplotlib to plot some diagnostic plots
    figure, (axis1, axis2, axis3) = plt.subplots(1,3) # creats a figure for the three plots
    figure.set_size_inches(x,y)

    axis1.scatter(model.fittedvalues, model.resid)
    axis1.plot(model.fittedvalues, [0 for i in model.fittedvalues], color='red') # line on 0 for measuring
    axis1.set_title('Fitted Values vs. Residuals')
    axis1.set_xlabel('Fitted Values')
    axis1.set_ylabel('Residuals')

    axis2.hist(model.resid)
    axis2.set_title('Residuals Histogram')
    axis2.set_xlabel('Residuals')

    statsmodels.api.qqplot(model.resid, line='45', ax=axis3)
    axis3.set_title('Residuals QQ Plot')
    plt.show()
In [32]:
diagnostic_plots(InitialModel, 18, 5)

These diagnostic plots, especially the QQ plot, are far from ideal. It appears from all three plots that our swing data is heavily skewed. This is likely explained by an abundance of very small counties with very large swings. It might also be explained by the unique case of Utah, a state that swung dramatically towards the Democrats in 2016 because of Mormon distaste for Donald Trump (more on this later). These problems can be easily addressed when we work to perfect our model in the next section.

Early Visualizations

We can use this section to exercise our ability in data visualizations. We will create some informative graphs using Pyplot and our data.

Let's start by graphing out some correlations between education and vote swing in some different geographic regions across the United States.

In [33]:
# define subgroups of our data set based on certain states
df_newengland = df[(df['State'] == 'ME') | (df['State'] == 'NH') | (df['State'] == 'VT') | (df['State'] == 'MA') | \
                    (df['State'] == 'RI') | (df['State'] == 'CT')]

df_south = df[(df['State'] == 'AR') | (df['State'] == 'TN') | (df['State'] == 'MS') | (df['State'] == 'AL') | \
                    (df['State'] == 'GA') | (df['State'] == 'SC') | (df['State'] == 'NC') | (df['State'] == 'LA')]

df_southwest = df[(df['State'] == 'TX') | (df['State'] == 'CO') | (df['State'] == 'NM') | (df['State'] == 'AZ') | \
                    (df['State'] == 'NV')]

df_west = df[(df['State'] == 'WA') | (df['State'] == 'OR') | (df['State'] == 'CA')]

df_midwest = df[(df['State'] == 'MI') | (df['State'] == 'WI') | (df['State'] == 'MN') | (df['State'] == 'OH') | \
                    (df['State'] == 'IN') | (df['State'] == 'IL') | (df['State'] == 'IA')]

df_mountainwest = df[(df['State'] == 'MT') | (df['State'] == 'WY') | (df['State'] == 'ID') | (df['State'] == 'ND') | \
                    (df['State'] == 'SD')]
In [34]:
# create a figure for each geographic region
figure, axes = plt.subplots(3,2)
figure.set_size_inches(15, 18)
figure.suptitle('Education and Vote Shift: 2012-2016', fontsize=20)

# plot the college education against vote swing for each region
df_newengland.plot.scatter('College', 'Swing_2012_2016', title='New England', ax=axes[0][0])
m,b = np.polyfit(df_newengland['College'], df_newengland['Swing_2012_2016'], 1) # this includes a regression line
axes[0][0].plot(df_newengland['College'], m*df_newengland['College'] + b, color='red')
axes[0][0].set_xlabel('College Attainment Rate')
axes[0][0].set_ylabel('Swing Towards Democrats from 2012 to 2016')

df_south.plot.scatter('College', 'Swing_2012_2016', title='The Deep South', ax=axes[0][1])
m,b = np.polyfit(df_south['College'], df_south['Swing_2012_2016'], 1)
axes[0][1].plot(df_south['College'], m*df_south['College'] + b, color='red')
axes[0][1].set_xlabel('College Attainment Rate')
axes[0][1].set_ylabel('Swing Towards Democrats from 2012 to 2016')

df_southwest.plot.scatter('College', 'Swing_2012_2016', title='The Southwest', ax=axes[1][0])
m,b = np.polyfit(df_southwest['College'], df_southwest['Swing_2012_2016'], 1)
axes[1][0].plot(df_southwest['College'], m*df_southwest['College'] + b, color='red')
axes[1][0].set_xlabel('College Attainment Rate')
axes[1][0].set_ylabel('Swing Towards Democrats from 2012 to 2016')

df_west.plot.scatter('College', 'Swing_2012_2016', title='West Coast', ax=axes[1][1])
m,b = np.polyfit(df_west['College'], df_west['Swing_2012_2016'], 1)
axes[1][1].plot(df_west['College'], m*df_west['College'] + b, color='red')
axes[1][1].set_xlabel('College Attainment Rate')
axes[1][1].set_ylabel('Swing Towards Democrats from 2012 to 2016')

df_midwest.plot.scatter('College', 'Swing_2012_2016', title='The Midwest', ax=axes[2][0])
m,b = np.polyfit(df_midwest['College'], df_midwest['Swing_2012_2016'], 1)
axes[2][0].plot(df_midwest['College'], m*df_midwest['College'] + b, color='red')
axes[2][0].set_xlabel('College Attainment Rate')
axes[2][0].set_ylabel('Swing Towards Democrats from 2012 to 2016')

df_mountainwest.plot.scatter('College', 'Swing_2012_2016', title='Mountain West', ax=axes[2][1])
m,b = np.polyfit(df_mountainwest['College'], df_mountainwest['Swing_2012_2016'], 1)
axes[2][1].plot(df_mountainwest['College'], m*df_mountainwest['College'] + b, color='red')
axes[2][1].set_xlabel('College Attainment Rate')
axes[2][1].set_ylabel('Swing Towards Democrats from 2012 to 2016')

plt.show()

As we can see, most geographic regions in the US show a very high correlation between education and vote swing. Generally, counties with a higher level of college attainment will swing more towards the Democratic party in 2016.

Now, let's see the same for the swings from 1992-1996.

In [35]:
# adjusted to remove Louisiana, because that data is still missing for 1992-1996
df_south = df[(df['State'] == 'AR') | (df['State'] == 'TN') | (df['State'] == 'MS') | (df['State'] == 'AL') | \
                    (df['State'] == 'GA') | (df['State'] == 'SC') | (df['State'] == 'NC')]
In [36]:
# do the same process, but using 1992-1996 data

figure, axes = plt.subplots(3,2)
figure.set_size_inches(15, 18)
figure.suptitle('Education and Vote Shift: 1992-1996', fontsize=20)

df_newengland.plot.scatter('College_1990', 'Swing_1992_1996', title='New England', ax=axes[0][0], c='red')
m,b = np.polyfit(df_newengland['College_1990'], df_newengland['Swing_1992_1996'], 1)
axes[0][0].plot(df_newengland['College_1990'], m*df_newengland['College_1990'] + b)
axes[0][0].set_xlabel('College Attainment Rate')
axes[0][0].set_ylabel('Swing Towards Democrats from 2012 to 2016')

df_south.plot.scatter('College_1990', 'Swing_1992_1996', title='The Deep South', ax=axes[0][1], c='red')
m,b = np.polyfit(df_south['College_1990'], df_south['Swing_1992_1996'], 1)
axes[0][1].plot(df_south['College_1990'], m*df_south['College_1990'] + b)
axes[0][1].set_xlabel('College Attainment Rate')
axes[0][1].set_ylabel('Swing Towards Democrats from 2012 to 2016')

df_southwest.plot.scatter('College_1990', 'Swing_1992_1996', title='The Southwest', ax=axes[1][0], c='red')
m,b = np.polyfit(df_southwest['College_1990'], df_southwest['Swing_1992_1996'], 1)
axes[1][0].plot(df_southwest['College_1990'], m*df_southwest['College_1990'] + b)
axes[1][0].set_xlabel('College Attainment Rate')
axes[1][0].set_ylabel('Swing Towards Democrats from 2012 to 2016')

df_west.plot.scatter('College_1990', 'Swing_1992_1996', title='West Coast', ax=axes[1][1], c='red')
m,b = np.polyfit(df_west['College_1990'], df_west['Swing_1992_1996'], 1)
axes[1][1].plot(df_west['College_1990'], m*df_west['College_1990'] + b)
axes[1][1].set_xlabel('College Attainment Rate')
axes[1][1].set_ylabel('Swing Towards Democrats from 2012 to 2016')

df_midwest.plot.scatter('College_1990', 'Swing_1992_1996', title='The Midwest', ax=axes[2][0], c='red')
m,b = np.polyfit(df_midwest['College_1990'], df_midwest['Swing_1992_1996'], 1)
axes[2][0].plot(df_midwest['College_1990'], m*df_midwest['College_1990'] + b)
axes[2][0].set_xlabel('College Attainment Rate')
axes[2][0].set_ylabel('Swing Towards Democrats from 2012 to 2016')

df_mountainwest.plot.scatter('College_1990', 'Swing_1992_1996', title='Mountain West', ax=axes[2][1], c='red')
m,b = np.polyfit(df_mountainwest['College_1990'], df_mountainwest['Swing_1992_1996'], 1)
axes[2][1].plot(df_mountainwest['College_1990'], m*df_mountainwest['College_1990'] + b)
axes[2][1].set_xlabel('College Attainment Rate')
axes[1][1].set_ylabel('Swing Towards Democrats from 2012 to 2016')

plt.show()

As we can see from some of these preliminary visuals, the correlation between education and vote swing is a fairly recent phenomenon. We can see very strong correlations today, but in the 1990s there exists basically no correlation. With this in mind, we can begin with our full model implementation.

Part 4: Full Model Implementation

Now that we have collected, cleaned, and explored our data, it is time to get to work developing a complete linear regression explaining the factors contributing to America's changing electoral landscape. We will use our independent variables to create a regression model to determine their correlation with the vote swings across the country. This will allow us to conclude what variables are correlated with each party's improving and diminishing fortunes in different places.

In order to do this, we will play around with data transformations and machine learning to perfect our model. We will also use hypothesis testing to determine if these factors have always contributed to electoral swings or if they are a new phenomenon.

Data Transformations to Address Skew

As we saw from our diagnostic plots in Part 3, our county swing data is heavily skewed. There could be several explanations for this. One is that an abundance of small counties with large swings is skewing data.

Grouping Small Counties

The best way to handle this without dropping all of these small counties is to group them together. Let's try grouping all of the small counties into intervals:

  • 0-999 people
  • 1,000-1,999 people
  • 2,000-2,999 people
  • 3,000-3,999 people
  • 4,000-4,999 people
  • 5,000-9,999 people
  • 10,000-14,999 people
  • 15,000-19,999 people
  • Do not group counties with at least 20,000 people

These intervals will be grouped together, and their data will be averaged. This way, we can get a snapshot of the small counties without them skewing our data. We can also remove income from our list of independent variables, as our preliminary model suggested that it is the least likely to be correlated with vote swing.

In [37]:
df_group = df[['County', 'State', 'Swing_2012_2016', 'Population', 'Income', 'GDP', 'College', 'White', 'Coastal']]

# loop through each interval, create one grouped object, and drop the other items in that interval
for interval in [1000, 2000, 3000, 4000, 5000, 10000, 15000, 20000]:
    df_interval = df_group[df_group['Population'] < interval]
    
    # get the mean for every data point in this interval
    row = pd.DataFrame({'County':str(interval), 'State':'Interval',\
                       'Swing_2012_2016':df_interval['Swing_2012_2016'].mean(),\
                       'Population':df_interval['Population'].mean(), \
                       'Income':df_interval['Income'].mean(), \
                       'GDP':df_interval['GDP'].mean(), \
                       'College':df_interval['College'].mean(), \
                       'White':df_interval['White'].mean(), \
                       'Coastal':df_interval['Coastal'].mode()})
    
    # drop the items in that interval, because they have all been grouped together in a new single row
    df_group = df_group.drop(index=df_group[(df_group['Population'] < interval) & \
                                            (df_group['State'] != 'Interval')].index.values)
    df_group = df_group.append(row, ignore_index=True)

Now let's try this grouped data out with a new regression model. We'll do this the same as before.

In [38]:
LinearModel = sm.ols(formula='Swing_2012_2016 ~ GDP + College + White + Coastal', data=df_group).fit()
print(LinearModel.summary2())
                  Results: Ordinary least squares
===================================================================
Model:              OLS              Adj. R-squared:     0.554     
Dependent Variable: Swing_2012_2016  AIC:                11820.4400
Date:               2020-12-18 17:58 BIC:                11847.8647
No. Observations:   1781             Log-Likelihood:     -5905.2   
Df Model:           4                F-statistic:        554.0     
Df Residuals:       1776             Prob (F-statistic): 2.26e-310 
R-squared:          0.555            Scale:              44.536    
--------------------------------------------------------------------
                Coef.   Std.Err.     t      P>|t|    [0.025   0.975]
--------------------------------------------------------------------
Intercept      -2.9960    0.8037   -3.7277  0.0002  -4.5723  -1.4197
GDP            -0.0000    0.0000   -4.5129  0.0000  -0.0001  -0.0000
College         0.6142    0.0187   32.8559  0.0000   0.5775   0.6508
White          -0.2479    0.0086  -28.8520  0.0000  -0.2647  -0.2310
Coastal        -1.6718    0.5383   -3.1056  0.0019  -2.7275  -0.6160
-------------------------------------------------------------------
Omnibus:             395.507       Durbin-Watson:          1.146   
Prob(Omnibus):       0.000         Jarque-Bera (JB):       2454.831
Skew:                0.891         Prob(JB):               0.000   
Kurtosis:            8.469         Condition No.:          233328  
===================================================================
* The condition number is large (2e+05). This might indicate
strong multicollinearity or other numerical problems.

This seems to have worked! Our Adjusted R2 value bumped up quite a bit, now topping 50%. This is all good news, but there is probably more to do. Let's check our diagnostic plots.

In [39]:
diagnostic_plots(LinearModel, 15, 3)

The Utah Case

This improved our adjusted R2 value, but did not improve much else. Another way to address skewed data is to remove outliers that have a clear cause for being so extreme, separate from other factors. Utah counties swung dramatically towards the Democrats in 2016 because of unique Mormon disaster for Donald Trump. Because Utah is such a unique state due its Mormon heritage and culture, we can safely remove Utah from our dataset without risking serious bias in the data.

It is in cases such as these where thorough industry knowledge can be very helpful in data science. Even those with top notch data skills may not know to do this without knowledge of American politics.

In [40]:
df_group = df_group[df_group['State'] != 'UT']
LinearModel = sm.ols(formula='Swing_2012_2016 ~ GDP + College + White + Coastal', data=df_group).fit()
print(LinearModel.summary2())
                  Results: Ordinary least squares
===================================================================
Model:              OLS              Adj. R-squared:     0.601     
Dependent Variable: Swing_2012_2016  AIC:                11356.7909
Date:               2020-12-18 17:58 BIC:                11384.1733
No. Observations:   1766             Log-Likelihood:     -5673.4   
Df Model:           4                F-statistic:        665.4     
Df Residuals:       1761             Prob (F-statistic): 0.00      
R-squared:          0.602            Scale:              36.238    
--------------------------------------------------------------------
                Coef.   Std.Err.     t      P>|t|    [0.025   0.975]
--------------------------------------------------------------------
Intercept      -2.6385    0.7260   -3.6341  0.0003  -4.0625  -1.2145
GDP            -0.0000    0.0000   -4.3632  0.0000  -0.0000  -0.0000
College         0.5916    0.0170   34.8022  0.0000   0.5582   0.6249
White          -0.2520    0.0078  -32.4921  0.0000  -0.2672  -0.2368
Coastal        -1.3394    0.4860   -2.7562  0.0059  -2.2926  -0.3863
-------------------------------------------------------------------
Omnibus:               65.345       Durbin-Watson:          1.347  
Prob(Omnibus):         0.000        Jarque-Bera (JB):       185.193
Skew:                  -0.036       Prob(JB):               0.000  
Kurtosis:              4.585        Condition No.:          232676 
===================================================================
* The condition number is large (2e+05). This might indicate
strong multicollinearity or other numerical problems.

It seems like this bumped our adjusted R2 up further. This is good! Let's check some diagnostic plots to see what progress we've made. There is likely more to be done.

In [41]:
diagnostic_plots(LinearModel, 15, 3)

Dependent Data Standardization and GDP Log Transformation

These diagnostic plots still aren't great. It seems that grouping small counties and dropping Utah counties did not sufficiently reduce our data's skew to an acceptable level. With data skew this large, the best thing to do is standardize the data.

Standardization is a great technique for reducing data skew. We will force our swing data into a mean of 0 and a standard deviation of 1. This will create a distribution as close to normal as possible. We only need to do this to our swing data, because our independent variable data is not naturally skewed. Let's see if this improves our model.

In [42]:
# get the average and standard deviation of the data, then standardize
u = df_group['Swing_2012_2016'].mean()
o = df_group['Swing_2012_2016'].std()

df_group['Swing_2012_2016_std'] = (df_group['Swing_2012_2016'] - u) / o

It also seems from our model that our GDP data is not adapting to the model well. A look at the histogram of the GDP data shows serious skew that can likely be fixed by a log transformation. Data that shows this form of skew can be fixed by such a transformation.

In [43]:
plt.hist(df_group['GDP'])
plt.title('GDP Histogram')
plt.xlabel('GDP Values')
plt.xticks([])
plt.show()
In [44]:
# perform a logarithm on the GDP data
df_group['GDP_log'] = np.log(df_group['GDP'])
In [45]:
LinearModel = sm.ols(formula='Swing_2012_2016_std ~ GDP_log + College + White + Coastal', data=df_group).fit()
print(LinearModel.summary2())
                   Results: Ordinary least squares
=====================================================================
Model:              OLS                 Adj. R-squared:     0.600    
Dependent Variable: Swing_2012_2016_std AIC:                3397.4055
Date:               2020-12-18 17:58    BIC:                3424.7879
No. Observations:   1766                Log-Likelihood:     -1693.7  
Df Model:           4                   F-statistic:        663.6    
Df Residuals:       1761                Prob (F-statistic): 0.00     
R-squared:          0.601               Scale:              0.39975  
-----------------------------------------------------------------------
              Coef.    Std.Err.      t       P>|t|     [0.025    0.975]
-----------------------------------------------------------------------
Intercept     2.2377     0.4251     5.2633   0.0000    1.4038    3.0715
GDP_log      -0.1644     0.0411    -4.0041   0.0001   -0.2449   -0.0839
College       0.0620     0.0018    34.1676   0.0000    0.0584    0.0655
White        -0.0263     0.0008   -32.4236   0.0000   -0.0279   -0.0248
Coastal      -0.1441     0.0510    -2.8246   0.0048   -0.2442   -0.0441
---------------------------------------------------------------------
Omnibus:               65.291        Durbin-Watson:           1.352  
Prob(Omnibus):         0.000         Jarque-Bera (JB):        184.844
Skew:                  -0.037        Prob(JB):                0.000  
Kurtosis:              4.583         Condition No.:           2305   
=====================================================================
* The condition number is large (2e+03). This might indicate
strong multicollinearity or other numerical problems.
In [46]:
diagnostic_plots(LinearModel, 15, 3)

These diagnostic plots are not perfect, but they are definitely an improvement and enough to say we have significantly reduced the skew of our data. We have also brought our model to a point where all independent variables, with the exception of income, are significant. This is a big accomplishment! We can now finalize our complete model for this study.

Finalizing the Model

This is going to be the final edition of our model, used for the rest of the study. We will draw all of our main conclusions from this model, including what factors may be contributing to the shifting electoral landscape in the United States. The purpose of this linear regression model is to explain the correlation between college attainment, demographics, and coastal status with a given county's swing towards either party from the 2012 to the 2016 presidential election.

Training and Testing Data

For testing purposes, we will split our data into a training set for building the model (95% of the data) and testing set for determining how accurate our model is (5% of the data). This is a common practice in data science.

In [47]:
df_model = df[df['State'] != 'UT'] # remove Utah data
In [48]:
# perform the same data transformation as before
df_model = df_model[['County', 'State', 'Swing_2012_2016', 'Population', 'Income', 'GDP', 'College', 'White', 'Coastal']]
for interval in [1000, 2000, 3000, 4000, 5000, 10000, 15000, 20000]:
    df_interval = df_model[df_model['Population'] < interval]
    row = pd.DataFrame({'County':str(interval), 'State':'Interval',\
                       'Swing_2012_2016':df_interval['Swing_2012_2016'].mean(),\
                       'Population':df_interval['Population'].mean(), \
                       'Income':df_interval['Income'].mean(), \
                       'GDP':df_interval['GDP'].mean(), \
                       'College':df_interval['College'].mean(), \
                       'White':df_interval['White'].mean(), \
                       'Coastal':df_interval['Coastal'].mode()})
    
    df_model = df_model.drop(index=df_model[(df_model['Population'] < interval) & \
                                            (df_model['State'] != 'Interval')].index.values)
    df_model = df_model.append(row, ignore_index=True)
In [49]:
u = df_model['Swing_2012_2016'].mean()
o = df_model['Swing_2012_2016'].std()

df_model['Swing_2012_2016_std'] = (df_model['Swing_2012_2016'] - u) / o

df_model['GDP_log'] = np.log(df_model['GDP'])
In [50]:
df_train = df_model.sample(frac=0.95) # random sample of 95% of the data
df_test = df_model.drop(df_train.index) # the rest of the data not in the sample
In [51]:
FinalModel = sm.ols(formula='Swing_2012_2016_std ~ GDP_log + College + White + Coastal', data=df_train).fit()
print(FinalModel.summary2())
                   Results: Ordinary least squares
=====================================================================
Model:              OLS                 Adj. R-squared:     0.603    
Dependent Variable: Swing_2012_2016_std AIC:                3231.8677
Date:               2020-12-18 17:58    BIC:                3258.9945
No. Observations:   1678                Log-Likelihood:     -1610.9  
Df Model:           4                   F-statistic:        637.1    
Df Residuals:       1673                Prob (F-statistic): 0.00     
R-squared:          0.604               Scale:              0.40059  
-----------------------------------------------------------------------
              Coef.    Std.Err.      t       P>|t|     [0.025    0.975]
-----------------------------------------------------------------------
Intercept     2.2358     0.4384     5.0996   0.0000    1.3759    3.0957
GDP_log      -0.1629     0.0423    -3.8472   0.0001   -0.2459   -0.0798
College       0.0622     0.0019    33.2681   0.0000    0.0586    0.0659
White        -0.0266     0.0008   -31.9488   0.0000   -0.0283   -0.0250
Coastal      -0.1616     0.0521    -3.0999   0.0020   -0.2638   -0.0594
---------------------------------------------------------------------
Omnibus:               65.229        Durbin-Watson:           2.007  
Prob(Omnibus):         0.000         Jarque-Bera (JB):        190.932
Skew:                  -0.028        Prob(JB):                0.000  
Kurtosis:              4.652         Condition No.:           2315   
=====================================================================
* The condition number is large (2e+03). This might indicate
strong multicollinearity or other numerical problems.
In [52]:
diagnostic_plots(FinalModel, 18, 5)

Final Model Summation

Our complete linear model concludes that, when adjusting for skew, there exists a correlation between GDP per capita, college attainment, the white percentage of the population, and coastal status with the degree to which a US county swings towards one party or the other from the 2012 to 2016 presidential elections. The exact formula is as follows:

$ Standardized Vote Swing = \beta_{0} + \beta_{1}(log(GDP)) + \beta_{2}(College) + \beta_{3}(White) + \beta_{4}(Coastal) $

Basic Statistical Conclusions

  1. A 1 point increase in the logarithm of the GDP per capita is correlated with an increase in the data standardized swing towards the Republicans
  2. A 1 percentage point increase in college attainment is correlated with an increase in the data standardized swing towards the Democrats
  3. A 1 percentage point increase in the white population is correlated with an increase in the data standardized swing towards the Republicans
  4. A being coastal is correlated with an increase in the data standardized swing towards the Republicans.
  5. Overall, coastal areas and predominantly white areas are trending towards Republicans in the modern political era
  6. Overall, more college educated areas are trending towards Democrats in the modern political era


Important Caveats

  1. Correlation DOES NOT equal causation!
    • This is one of the most important laws of statistics. This model demonstrates that these factors are correlated, but it absolutely does not prove or even suggest that one causes the other to happen
  2. The data here has been standardized, so in order to get the true predicted swing you must calculate it using the model function, multiply it by the standard deviation, and add the mean. You must also take the logarithm of the GDP measurement you are using.
  3. The data is plagued by moderate skew. Though it is not severe enough to worry too much about, it is woth mentioning

Hypothesis Testing

Hypothesis testing is an important concept in data science and statistics. We have already done some subtle hypothesis testing using p-values. The p-value for each independent variable is the likelihood of concluding a coefficient at least as extreme as the given coefficient on the assumption that the true value of the coefficient is 0. Given that the odds of our model determining coefficients further from 0 than our given coefficients if the true coefficients were 0 were far less than 5%, we can be confident that the values of all of our coefficients are significantly different from 0. In other words, it is extremely unlikely that our model would have determined these coefficients if the true correlations were 0. This means that correlation exists with all of our independent variables.

Machine Learning

To complete our model implementation, we're going to do some basic machine learning and compare its results against our model using the testing dataset. Machine learning is a very complex concept, and this is meant only as a basic introduction.

Simple Linear Gradient Descent Algorithm

We will start by writing a simple gradient descent algorithm for linear regression. Gradient descent is the process of estimating model parameters using machine learning. This is done by estimating the combination of coefficients that minimizes the residuals as much as possible. Our algorithm will iterate through our dataset multiple times, adjusting coefficients along the way in a continuous improvement process. Coefficients will be adjusted based on the slope of the loss function at that given point, going down if loss is increasing and going up if loss is decreasing. Let's give it a try.

Gradient descent requires choosing a proper learning rate (rate in which the coefficients are adjusted each iteration. Our learning rate will be quite small to adjust for our large independent data values.


More information on this topic can be found here.

In [53]:
y = df_test['Swing_2012_2016_std'].to_numpy()
X = statsmodels.api.add_constant(df_test[['GDP_log', 'College', 'White', 'Coastal']].to_numpy())

# develop gradient descent model using a learning rate of 0.000000001
gradient_regressor = sklearn.linear_model.SGDRegressor(eta0=0.000000001).fit(X, y)
In [54]:
# capture coefficients calculated by gradient descent and our model
theta_gradient = gradient_regressor.coef_
theta_model = np.array(FinalModel.params)

Now that we have estimated new coefficients using gradient descent, we can test them against our model's coefficients. This can be done by gathering all results into a new DataFrame and comparing the accuracy of gradient descent's estimations against our models.

In [55]:
df_results = df_test[['County', 'State', 'Swing_2012_2016_std']].copy().reset_index(drop=True)
df_results.columns = ['County', 'State', 'Real Swing']

# we will unstandardize the data so it makes more sense in the real world
df_results['Real Swing'] = (df_results['Real Swing'] * o) + u
df_results['Model Prediction'] = (pd.Series([theta_model.dot(row) for row in X]) * o) + u
df_results['Gradient Prediction'] = (pd.Series([theta_gradient.dot(row) for row in X]) * o) + u

# now calculate how far off both estimations were for each county in our testing set
df_results['Model Miss'] = abs(df_results['Model Prediction'] - df_results['Real Swing'])
df_results['Gradient Miss'] = abs(df_results['Gradient Prediction'] - df_results['Real Swing'])
In [56]:
df_results.head()
Out[56]:
County State Real Swing Model Prediction Gradient Prediction Model Miss Gradient Miss
0 Franklin AL -20.0 -14.946314 -8.938003 5.053686 11.061997
1 Jackson AL -20.2 -17.583422 -8.938061 2.616578 11.261939
2 Pike AL -6.3 -4.252344 -8.937791 2.047656 2.637791
3 Gila AZ -5.2 -8.279857 -8.937870 3.079857 3.737870
4 Graham AZ -0.9 -6.797601 -8.937843 5.897601 8.037843
In [57]:
print("Model Miss:   ", df_results['Model Miss'].mean())
print("Gradient Miss:", df_results['Gradient Miss'].mean())
Model Miss:    4.755105932660072
Gradient Miss: 7.04345412478493

As we can see, both our regression model and the gradient descent algorithm produced fairly accurate results, missing the true swing value by only a few points each. It seems that our model is slightly more precise than the machine learning algorithm, though this is not conclusive and an adjustment to the learning rate and iteration count in gradient descent may change this. Overall, gradient descent appears to be a valuable resource when standard Python regression libraries are not available.

Brief Machine Learning Test: 1992-1996 Election Data

We can use machine learning, specifically gradient descent, to test another one of our hypotheses: that the correlation between vote swing and education that we've proven in this study is a fairly recent phenomenon. This is where our data from 1992-1996 comes in handy. We will now implement gradient descent for both 2012-2016 election data and 1992-1996 election data to see how they both perform. If our hypothesis is correct, the 1992-1996 algorithm will perform far worse because there is no correlation to be found. Let's get started.

In [58]:
df_compare = df[['County', 'State', 'Swing_2012_2016', 'Swing_1992_1996', 'College', 'College_1990']]\
                [df['State'] != 'LA'].copy()
In [59]:
u1 = df_compare['Swing_1992_1996'].mean()
o1 = df_compare['Swing_1992_1996'].std()

df_compare['Swing_2012_2016_std'] = (df_compare['Swing_2012_2016'] - u) / o
df_compare['Swing_1992_1996_std'] = (df_compare['Swing_1992_1996'] - u1) / o1
In [60]:
y_2012_2016 = df_compare['Swing_2012_2016_std'].to_numpy()
y_1992_1996 = df_compare['Swing_1992_1996_std']
X_2012_2016 = statsmodels.api.add_constant(df_compare[['College']].to_numpy())
X_1992_1996 = statsmodels.api.add_constant(df_compare[['College_1990']].to_numpy())

# perform gradient descent on both sets of data and return the results
gradient_descent_2012_2016 = sklearn.linear_model.SGDRegressor(eta0=0.000000001).fit(X_2012_2016, y_2012_2016)
gradient_descent_1992_1996 = sklearn.linear_model.SGDRegressor(eta0=0.000000001).fit(X_1992_1996, y_1992_1996)
In [61]:
theta_2012_2016 = gradient_descent_2012_2016.coef_
theta_1992_1996 = gradient_descent_1992_1996.coef_
In [62]:
df_results = df_compare[['County', 'State', 'Swing_2012_2016_std', 'Swing_1992_1996_std']].copy()
df_results.columns = ['County', 'State', 'Real Swing 2012-2016', 'Real Swing 1992-1996']

# we will unstandardize the data so it makes more sense in the real world
df_results['Real Swing 2012-2016'] = (df_results['Real Swing 2012-2016'] * o) + u
df_results['Gradient Descent 2012-2016'] = \
                        (pd.Series([np.transpose(theta_2012_2016).dot(row) for row in X_2012_2016]) * o) + u

df_results['Real Swing 1992-1996'] = (df_results['Real Swing 1992-1996'] * o1) + u1
df_results['Gradient Descent 1992-1996'] = \
                        (pd.Series([np.transpose(theta_1992_1996).dot(row) for row in X_1992_1996]) * o1) + u1

# now calculate how far off both estimations were for each county in our testing set
df_results['2012-2016 Miss'] = abs(df_results['Gradient Descent 2012-2016'] - df_results['Real Swing 2012-2016'])
df_results['1992-1996 Miss'] = abs(df_results['Gradient Descent 1992-1996'] - df_results['Real Swing 1992-1996'])
In [63]:
df_results.head()
Out[63]:
County State Real Swing 2012-2016 Real Swing 1992-1996 Gradient Descent 2012-2016 Gradient Descent 1992-1996 2012-2016 Miss 1992-1996 Miss
0 Autauga AL -3.0 -4.2 -8.937204 -0.949273 5.937204 3.250727
1 Baldwin AL -1.5 -5.2 -8.937136 -0.949256 7.437136 4.250744
2 Barbour AL -8.7 9.5 -8.937493 -0.949293 0.237493 10.449293
3 Bibb AL -8.6 -0.9 -8.937507 -0.949346 0.337507 0.049346
4 Blount AL -6.9 -5.2 -8.937485 -0.949329 2.037485 4.250671
In [64]:
print("2012-2016 Miss:   ", df_results['2012-2016 Miss'].mean())
print("1992-1996 Miss:   ", df_results['1992-1996 Miss'].mean())
2012-2016 Miss:    7.93470494385627
1992-1996 Miss:    5.9054315536007325

This is not entirely what was expected. It seems that the 1992-1996 model is actually slightly more accurate, though the difference is marginal. Let's check the education coefficients for each to gain some insights.

In [65]:
print('2012-2016 Education Coefficient:  ', theta_2012_2016[1], )
print('1992-1996 Education Coefficient:  ', theta_1992_1996[1], )
2012-2016 Education Coefficient:   1.960711687144499e-06
1992-1996 Education Coefficient:   9.591090901006e-07

So while it seems that the gradient descent algorithm is just as accurate for 1990s data, it shows a much smaller coefficient (about half the size). This suggests that education has become much more correlated with vote patters over time, even if it did have a correlation back in the 90s. This also enforces just how effective gradient descent can be in finding proper coefficients for a set of data.

It is important in data science to avoid dismissing data that does not fit your priors. This machine learning test did not result in the output I expected, but it is important to include it in the study and adjust my view of the data accordingly.

Comparing College Education Models

Finally, let's see what our linear regression library says about how college education's correlation with voting patterns has changed since the 90s. We will create two simple linear regression models comparing just the correlation with education and vote swing.

In [66]:
Model_2012_2016 = sm.ols(formula='Swing_2012_2016_std ~ College', data=df_compare).fit()
print("2012-2016 Model", Model_2012_2016.summary2(), end='\n\n\n\n')

Model_1992_1996 = sm.ols(formula='Swing_1992_1996_std ~ College_1990', data=df_compare).fit()
print("1992-1996 Model", Model_1992_1996.summary2())
2012-2016 Model                    Results: Ordinary least squares
=====================================================================
Model:              OLS                 Adj. R-squared:     0.275    
Dependent Variable: Swing_2012_2016_std AIC:                7863.8159
Date:               2020-12-18 17:58    BIC:                7875.8199
No. Observations:   2987                Log-Likelihood:     -3929.9  
Df Model:           1                   F-statistic:        1133.    
Df Residuals:       2985                Prob (F-statistic): 8.13e-211
R-squared:          0.275               Scale:              0.81394  
-----------------------------------------------------------------------
              Coef.    Std.Err.      t       P>|t|     [0.025    0.975]
-----------------------------------------------------------------------
Intercept    -1.4857     0.0418   -35.5257   0.0000   -1.5677   -1.4037
College       0.0602     0.0018    33.6572   0.0000    0.0567    0.0637
---------------------------------------------------------------------
Omnibus:               76.697        Durbin-Watson:           0.829  
Prob(Omnibus):         0.000         Jarque-Bera (JB):        138.980
Skew:                  0.196         Prob(JB):                0.000  
Kurtosis:              3.981         Condition No.:           59     
=====================================================================




1992-1996 Model                    Results: Ordinary least squares
=====================================================================
Model:              OLS                 Adj. R-squared:     0.004    
Dependent Variable: Swing_1992_1996_std AIC:                8465.9711
Date:               2020-12-18 17:58    BIC:                8477.9751
No. Observations:   2987                Log-Likelihood:     -4231.0  
Df Model:           1                   F-statistic:        13.79    
Df Residuals:       2985                Prob (F-statistic): 0.000208 
R-squared:          0.005               Scale:              0.99573  
----------------------------------------------------------------------
                   Coef.   Std.Err.     t     P>|t|    [0.025   0.975]
----------------------------------------------------------------------
Intercept         -0.1420    0.0424  -3.3510  0.0008  -0.2250  -0.0589
College_1990       0.0106    0.0029   3.7135  0.0002   0.0050   0.0162
---------------------------------------------------------------------
Omnibus:              189.107        Durbin-Watson:           1.022  
Prob(Omnibus):        0.000          Jarque-Bera (JB):        495.728
Skew:                 -0.351         Prob(JB):                0.000  
Kurtosis:             4.869          Condition No.:           35     
=====================================================================

It seems from this that there does exist a correlation between education and vote swing from 1992-1996, but it is much smaller than it is now. Also, according to the Adjusted R2 value in each model, education explains 27% of the vote swing change today but could only explain 0.4% of the change in the 90s. This suggests that education has increasingly become a dominating factor in the American political landscape.

Hypothesis Test

Let's test this hypothesis by determining whether the coefficient for one of our models is significantly different from the other's. We can do this by appending the 1992-1996 data onto the end of the 2012-2016 data and adding a binary variable indicating whether it is from 2012-2016 or from 1992-1996. We can then create another model with this data using an interaction term, multiplying the college data with the binary indicator data. If this interaction is significant, it will tell us that using 1992-1996 data does significantly change the coefficient of correlation.

Null Hypothesis: correlation is not significantly different from 1992-1996 data to 2012-2016 data
Alternative Hypothesis: correlation is significantly different

We are able to reject our null hypothesis and conclude a difference in correlation if the interaction term is significant in our model.

In [67]:
df_1 = df_compare[['County', 'State', 'Swing_2012_2016', 'College']].copy()
df_2 = df_compare[['County', 'State', 'Swing_1992_1996', 'College_1990']].copy()
df_1.columns = ['County', 'State', 'Swing', 'College']
df_2.columns = ['County', 'State', 'Swing', 'College']

# 'Modern' denotes whether the data was from 2012-2016 or not
df_1.loc[:, 'Modern'] = 1
df_2.loc[:, 'Modern'] = 0

df_hypothesis = df_1.append(df_2, ignore_index=True)
df_hypothesis['Interaction'] = df_hypothesis['College'] * df_hypothesis['Modern']
In [68]:
Hypothesis_Model = sm.ols(formula='Swing ~ College * Modern', data=df_hypothesis).fit()
print(Hypothesis_Model.summary2())
                  Results: Ordinary least squares
===================================================================
Model:              OLS              Adj. R-squared:     0.365     
Dependent Variable: Swing            AIC:                42040.5771
Date:               2020-12-18 17:58 BIC:                42067.3578
No. Observations:   5974             Log-Likelihood:     -21016.   
Df Model:           3                F-statistic:        1144.     
Df Residuals:       5970             Prob (F-statistic): 0.00      
R-squared:          0.365            Scale:              66.600    
-------------------------------------------------------------------
                 Coef.   Std.Err.    t     P>|t|   [0.025   0.975] 
-------------------------------------------------------------------
Intercept        -2.0449   0.3465  -5.9016 0.0000  -2.7241  -1.3656
College           0.0817   0.0233   3.5036 0.0005   0.0360   0.1274
Modern          -21.0510   0.5130 -41.0351 0.0000 -22.0567 -20.0453
College:Modern    0.4922   0.0284  17.3419 0.0000   0.4366   0.5479
-------------------------------------------------------------------
Omnibus:              180.757       Durbin-Watson:          0.915  
Prob(Omnibus):        0.000         Jarque-Bera (JB):       475.753
Skew:                 -0.033        Prob(JB):               0.000  
Kurtosis:             4.381         Condition No.:          135    
===================================================================

Our interaction term is significant! This confirms a significant difference in correlation between the two sets of data.

We can see from this model that college education is correlated with voters swinging towards the Democratic Party, and through our interaction term it becomes clear that data from 2012-2016 shows a much stronger correlation. The interaction term is significant, and its 95% confidence interval is entirely greater than 0. This means that we can be 95% confident that 2012-2016 data shows an increase in correlation. For 1992-1996 (interaction = 0), the education coefficient is 0.0817. For 2012-2016 (interaction = 1), the education coefficient is 0.5739. This is a significant increase in 20 years.

Comparing our two education models and using this interaction term model, we can come to the conclusion that education has become significantly more correlated with vote swings in recent years.

Part 5: Visualizations

In this section we will move into data visualizations, an integral part in the data science lifecycle. Without visualizations, data can be meaningless to all but those experienced in the field. Visualizations are essential for communicating data insights to those with less expertise.

How Factors Affect GDP

In [69]:
figure, axes = plt.subplots(2,2)
figure.suptitle('How Different Factors Correlate with GDP', fontsize=20)
figure.set_size_inches(15, 15)

# reduce the range of data for visibility
df[(df['Income'] < 80000) & (df['GDP'] < 80000)].sample(frac=0.25).plot.scatter('Income', 'GDP', ax=axes[0][0])
axes[0][0].set_title('Income')

df[df['GDP'] < 100000].sample(frac=0.25).plot.scatter('College', 'GDP', ax=axes[1][0])
axes[1][0].set_title('College Attainment')

df[df['GDP'] < 100000].sample(frac=0.25).plot.scatter('White', 'GDP', ax=axes[0][1])
axes[0][1].set_title('White Population')
axes[0][1].set_xlabel('White Population %')
axes[0][1].set_ylabel('')

coastal_group = df[['Coastal', 'GDP']].groupby('Coastal', as_index=False).mean().sort_values(by='Coastal', ascending=False)
coastal_group['Coastal'] = ['Coastal', 'Not Coastal']
barlist = coastal_group.plot('Coastal', 'GDP', kind='bar', ax=axes[1][1])
axes[1][1].get_children()[1].set_color('green')
axes[1][1].set_title('Coastal Status')
plt.xticks(rotation=0)
plt.show()

How Education Has Become More Significant in Vote Swings

In [70]:
labels = ['Education', 'Other Factors']
r_2016 = round(Model_2012_2016.rsquared * 100, 2)
r_1996 = round(Model_1992_1996.rsquared * 100, 2)

figure, axes = plt.subplots(1,2)
figure.suptitle('The Increasing Role of Eucation in Presidential Vote Swings', fontsize=20)
figure.set_size_inches(15, 8)

# show the adjusted R^2 values for each model
axes[0].pie([r_1996, 100-r_1996], labels=labels, colors=['red', 'grey'])
axes[0].set_title('1996-1996 Vote Swings Correlations')

axes[1].pie([r_2016, 100-r_2016], labels=labels, colors=['red', 'grey'])
axes[1].set_title('2012-2016 Vote Swings Correlations')
plt.show()

Correlations Between Education and Vote Swing

In [71]:
figure, ax = plt.subplots(1)
figure.set_size_inches(15,8)
plt.title('Correlation of Education with Vote Swing')
plt.xlabel('College Attainment Rate')
plt.ylabel('Swing Towards the Democrats')

data1 = df_hypothesis[df_hypothesis['Modern'] == 1].sample(n=300)
data2 = df_hypothesis[df_hypothesis['Modern'] == 0].sample(n=300)

# plot the correlations together
data1.plot.scatter('College', 'Swing', ax=ax)
m,b = np.polyfit(data1['College'], data1['Swing'], 1)
ax.plot(data1['College'], m*data1['College'] + b, linewidth=4)

data2.plot.scatter('College', 'Swing', ax=ax, color='red', marker="^")
m,b = np.polyfit(data2['College'], data2['Swing'], 1)
ax.plot(data2['College'], m*data2['College'] + b, color='red', linewidth=4)

plt.legend(['2012-2016 Data', '1992-1996 Data'])
plt.show()

How Correlation Has Changed

Here we will attempt to measure the education correlation with vote swings over time. To do this, we will calculate all of the election data from 1992-2020 and create a model for each swing.

In [72]:
elections_df2 = elections_df.copy()
In [73]:
# gather election swing data from other years, and use it to show how the role of education has changed 
# in vote swings
for year1,year2 in [(2000, 2004), (2008, 2012), (2016,2020)]:
    df_temp = election_swings(election_results(atlas_states[0], year1), election_results(atlas_states[0], year2))
    df_temp.loc[:, 'State'] = abbs[atlas_states[0]]
    for state in atlas_states[1:]:
        new_df = election_swings(election_results(state, year1), election_results(state, year2))
        new_df.loc[:, 'State'] = abbs[state]
        df_temp = df_temp.append(new_df, ignore_index=True)

    df_temp = df_temp[[df_temp.columns[0]] + [df_temp.columns[-1]] + list(df_temp.columns[1:-1])]
    elections_df2 = elections_df2.merge(df_temp, on=['County', 'State'], how='inner')
In [74]:
elections_df2 = elections_df2.merge(education_df, on=['County', 'State'], how='inner')
In [75]:
# create a model for each swing
Model_1992_1996 = sm.ols(formula='Swing_1992_1996 ~ College_1990', data=elections_df2).fit()
Model_2000_2004 = sm.ols(formula='Swing_2000_2004 ~ College_2000', data=elections_df2).fit()
Model_2008_2012 = sm.ols(formula='Swing_2008_2012 ~ College', data=elections_df2).fit()
Model_2012_2016 = sm.ols(formula='Swing_2012_2016 ~ College', data=elections_df2).fit()
Model_2016_2020 = sm.ols(formula='Swing_2016_2020 ~ College', data=elections_df2).fit()
In [76]:
# plot the education coefficient over time
figure, ax = plt.subplots(1)
figure.set_size_inches(12,6)
plt.xlabel('Election Years')
plt.ylabel('Education and Vote Swing Correlation')
plt.title("Education's Correlation with Vote Swing Over Time")

x = ['1992-1996', '2000-2004', '2008-2012', '2012-2016', '2016-2020']
y = [Model_1992_1996.params[1], Model_2000_2004.params[1], \
    Model_2008_2012.params[1], Model_2012_2016.params[1], Model_2016_2020.params[1]]

ax.plot(x, y)
plt.xticks(rotation=45)
plt.show()

Other Correlations by Geographic Region

In [77]:
df_newengland = df[(df['State'] == 'ME') | (df['State'] == 'NH') | (df['State'] == 'VT') | (df['State'] == 'MA') | \
                    (df['State'] == 'RI') | (df['State'] == 'CT')]

df_south = df[(df['State'] == 'AR') | (df['State'] == 'TN') | (df['State'] == 'MS') | (df['State'] == 'AL') | \
                    (df['State'] == 'GA') | (df['State'] == 'SC') | (df['State'] == 'NC')]

df_southwest = df[(df['State'] == 'TX') | (df['State'] == 'CO') | (df['State'] == 'NM') | (df['State'] == 'AZ') | \
                    (df['State'] == 'NV')]

df_west = df[(df['State'] == 'WA') | (df['State'] == 'OR') | (df['State'] == 'CA')]

df_midwest = df[(df['State'] == 'MI') | (df['State'] == 'WI') | (df['State'] == 'MN') | (df['State'] == 'OH') | \
                    (df['State'] == 'IN') | (df['State'] == 'IL') | (df['State'] == 'IA')]

df_mountainwest = df[(df['State'] == 'MT') | (df['State'] == 'WY') | (df['State'] == 'ID') | (df['State'] == 'ND') | \
                    (df['State'] == 'SD')]
In [78]:
# this is the same as the 6 plots from the exploratory section, but using GDP and race instead
figure, axes = plt.subplots(3,2)
figure.set_size_inches(15, 20)
figure.suptitle('GDP and Vote Shift: 2012-2016', fontsize=20)

df_newengland.plot.scatter('GDP', 'Swing_2012_2016', title='New England', ax=axes[0][0])
m,b = np.polyfit(df_newengland['GDP'], df_newengland['Swing_2012_2016'], 1)
axes[0][0].plot(df_newengland['GDP'], m*df_newengland['GDP'] + b, color='red')
axes[0][0].set_ylabel('Swing Towards Democrats from 2012-2016')

df_south.plot.scatter('GDP', 'Swing_2012_2016', title='The Deep South', ax=axes[0][1])
m,b = np.polyfit(df_south['GDP'], df_south['Swing_2012_2016'], 1)
axes[0][1].plot(df_south['GDP'], m*df_south['GDP'] + b, color='red')
axes[0][1].set_ylabel('Swing Towards Democrats from 2012-2016')
axes[0][1].set_xlim([0, 100000])

df_southwest.plot.scatter('GDP', 'Swing_2012_2016', title='The Southwest', ax=axes[1][0])
m,b = np.polyfit(df_southwest['GDP'], df_southwest['Swing_2012_2016'], 1)
axes[1][0].plot(df_southwest['GDP'], m*df_southwest['GDP'] + b, color='red')
axes[1][0].set_ylabel('Swing Towards Democrats from 2012-2016')
axes[1][0].set_xlim([0, 150000])

df_west.plot.scatter('GDP', 'Swing_2012_2016', title='West Coast', ax=axes[1][1])
m,b = np.polyfit(df_west['GDP'], df_west['Swing_2012_2016'], 1)
axes[1][1].plot(df_west['GDP'], m*df_west['GDP'] + b, color='red')
axes[1][1].set_ylabel('Swing Towards Democrats from 2012-2016')
axes[1][1].set_xlim([0, 200000])

df_midwest.plot.scatter('GDP', 'Swing_2012_2016', title='The Midwest', ax=axes[2][0])
m,b = np.polyfit(df_midwest['GDP'], df_midwest['Swing_2012_2016'], 1)
axes[2][0].plot(df_midwest['GDP'], m*df_midwest['GDP'] + b, color='red')
axes[2][0].set_ylabel('Swing Towards Democrats from 2012-2016')

df_mountainwest.plot.scatter('GDP', 'Swing_2012_2016', title='Mountain West', ax=axes[2][1])
m,b = np.polyfit(df_mountainwest['GDP'], df_mountainwest['Swing_2012_2016'], 1)
axes[2][1].plot(df_mountainwest['GDP'], m*df_mountainwest['GDP'] + b, color='red')
axes[2][1].set_ylabel('Swing Towards Democrats from 2012-2016')
axes[2][1].set_xlim([0, 150000])

plt.show()
In [79]:
figure, axes = plt.subplots(3,2)
figure.set_size_inches(15, 20)
figure.suptitle('White Population and Vote Shift: 2012-2016', fontsize=20)

df_newengland.plot.scatter('White', 'Swing_2012_2016', title='New England', ax=axes[0][0])
m,b = np.polyfit(df_newengland['White'], df_newengland['Swing_2012_2016'], 1)
axes[0][0].plot(df_newengland['White'], m*df_newengland['White'] + b, color='green')
axes[0][0].set_xlabel('White Population %')
axes[0][0].set_ylabel('Swing Towards Democrats from 2012-2016')

df_south.plot.scatter('White', 'Swing_2012_2016', title='The Deep South', ax=axes[0][1])
m,b = np.polyfit(df_south['White'], df_south['Swing_2012_2016'], 1)
axes[0][1].plot(df_south['White'], m*df_south['White'] + b, color='green')
axes[0][1].set_xlabel('White Population %')
axes[0][1].set_ylabel('Swing Towards Democrats from 2012-2016')

df_southwest.plot.scatter('White', 'Swing_2012_2016', title='The Southwest', ax=axes[1][0])
m,b = np.polyfit(df_southwest['White'], df_southwest['Swing_2012_2016'], 1)
axes[1][0].plot(df_southwest['White'], m*df_southwest['White'] + b, color='green')
axes[1][0].set_xlabel('White Population %')
axes[1][0].set_ylabel('Swing Towards Democrats from 2012-2016')

df_west.plot.scatter('White', 'Swing_2012_2016', title='West Coast', ax=axes[1][1])
m,b = np.polyfit(df_west['White'], df_west['Swing_2012_2016'], 1)
axes[1][1].plot(df_west['White'], m*df_west['White'] + b, color='green')
axes[1][1].set_xlabel('White Population %')
axes[1][1].set_ylabel('Swing Towards Democrats from 2012-2016')

df_midwest.plot.scatter('White', 'Swing_2012_2016', title='The Midwest', ax=axes[2][0])
m,b = np.polyfit(df_midwest['White'], df_midwest['Swing_2012_2016'], 1)
axes[2][0].plot(df_midwest['White'], m*df_midwest['White'] + b, color='green')
axes[2][0].set_xlabel('White Population %')
axes[2][0].set_ylabel('Swing Towards Democrats from 2012-2016')
axes[2][0].set_xlim([60,100])

df_mountainwest.plot.scatter('White', 'Swing_2012_2016', title='Mountain West', ax=axes[2][1])
m,b = np.polyfit(df_mountainwest['White'], df_mountainwest['Swing_2012_2016'], 1)
axes[2][1].plot(df_mountainwest['White'], m*df_mountainwest['White'] + b, color='green')
axes[2][1].set_xlabel('White Population %')
axes[2][1].set_ylabel('Swing Towards Democrats from 2012-2016')
axes[2][1].set_xlim([60,100])

plt.show()

Interactive Counties

An interactive map with Folium can be used to underscore the swings in some counties on a map. The user can select a map and see county data. This map is meant to underscore the fact that more educated and diverse counties and moving in one direction and less educated, whiter counties are moving in another.

In [80]:
map_osm = folium.Map(location=[40.8, -74.3118], zoom_start=6.5)
In [81]:
# go through some selected counties
for county,state,lat,long in [("Prince George's", "MD", 38.7849, -76.8721), \
                             ("Montgomery", "MD", 39.1547, -77.2405), \
                             ("Somerset", "MD", 38.0862, -75.8534), \
                             ("Anne Arundel", "MD", 38.9530, -76.5488), \
                             ("Salem", "NJ", 39.5849, -75.3879), \
                             ("Morris", "NJ", 40.8336, -74.5463), \
                             ("Luzerne", "PA", 41.1404, -75.9928), \
                             ("Westchester", "NY", 41.1220, -73.7949), \
                             ("Fairfield", "CT", 41.1408, -73.2613), \
                             ("Windham", "CT", 41.8276, -72.0468), \
                             ("Norfolk", "MA", 42.1767, -71.1449), \
                             ("Berkshire", "MA", 42.3118, -73.1822), \
                             ("Sullivan", "NY", 41.6897, -74.7805), \
                             ("Washington", "NY", 43.2519, -73.3709)]:
    df_section = df[(df['County'] == county) & (df['State'] == state)]
    
    # label the marker light blue or light red depending on how it swung in the election
    i = df_section.index.values[0]
    swing = df_section['Swing_2012_2016'][i]
    if swing > 0:
        c = 'lightblue'
    else:
        c = 'pink'
        
    # use county information as the marker popup
    info = county.upper() + " COUNTY" + \
    " -------- COLLEGE ATTAINMENT: " + str(round(df_section['College'][i], 2)) + "%" + \
    " -------- WHITE POPULATION: " + str(round(df_section['White'][i], 2)) + "%" + \
    " -------- 2012-2016 SWING: " + str(round(abs(swing), 2)) + " points " + \
                                              ("Democratic" if swing > 0 else "Republican")
    
    folium.Marker([lat, long], popup=info, icon=folium.Icon(color=c)).add_to(map_osm)

This map is an interactive selection of 14 counties. Many of these counties are extremely different from each other, from different education to different demographics. Check through these counties, ranging from Maryland up to upstate New York, and see if you can find any patterns between these factors and the county's vote swing from 2012-2016.

In [82]:
map_osm
Out[82]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Part 6: Conclusions

Congratulations! We have completed our data analysis for this study. Now it is time to make some conclusions.

Linear Regression Statistical Conclusions

  1. College attainment is strongly positively correlated with an areas swing towards the Democratic Party from 2012-2016
  2. This correlation existed from 1992-1996, though it is far weaker
  3. A higher white population is correlated with an areas swing towards the Republican Party from 2012-2016
  4. Education has become a far more important factor in vote swings in recent years
  5. Machine learning algorithms can be used to predict vote swings with a good degree of accuracy
  6. Income has no correlation with vote swings


These conclusions lead us to believe that education, GDP, white population, and coastal status would be enough to confidently predict a given counties vote in 2016 given its vote in 2012. Let's try it out!

In [83]:
df_final = df[['County', 'State', 'Romney_2012_%', 'Trump_2016_%', 'GDP', 'College', 'White']].copy()
df_final.columns = ['County', 'State', 'Romney', 'Trump', 'GDP', 'College', 'White']

Prediction_Model = sm.ols(formula='Trump ~ Romney + College + White', data=df_final).fit()
print(Prediction_Model.summary2())
                  Results: Ordinary least squares
===================================================================
Model:              OLS              Adj. R-squared:     0.953     
Dependent Variable: Trump            AIC:                16031.5976
Date:               2020-12-18 17:59 BIC:                16055.6905
No. Observations:   3051             Log-Likelihood:     -8011.8   
Df Model:           3                F-statistic:        2.080e+04 
Df Residuals:       3047             Prob (F-statistic): 0.00      
R-squared:          0.953            Scale:              11.194    
--------------------------------------------------------------------
                Coef.   Std.Err.     t      P>|t|    [0.025   0.975]
--------------------------------------------------------------------
Intercept      11.6390    0.3555   32.7379  0.0000  10.9419  12.3361
Romney          0.8154    0.0048  169.9681  0.0000   0.8060   0.8248
College        -0.4340    0.0070  -62.2261  0.0000  -0.4476  -0.4203
White           0.1609    0.0034   47.5591  0.0000   0.1543   0.1675
-------------------------------------------------------------------
Omnibus:             1745.445      Durbin-Watson:         1.044    
Prob(Omnibus):       0.000         Jarque-Bera (JB):      34599.066
Skew:                -2.306        Prob(JB):              0.000    
Kurtosis:            18.840        Condition No.:         595      
===================================================================

Even though the data is heavily skewed, education, demographics, and prior voting explains more than 95% of a county's vote pattern! This is helpful insight going forward when studying electoral politics.

Real World Conclusions

  1. More educated counties are moving towards the Democratic party in recent years
  2. More white counties are moving towards the Republican party in recent years
  3. These shifts vary depending on geographic region, underscoring the geographic diversity of America

Final Conclusions

After collecting our data, cleaning our data, and doing some exploratory analysis, several data transformations were required. Vote swing and GDP data were heavily skewed, and adjusting for this greatly improved our model.

We determined through linear regression analysis, visualization, hypothesis testing, and machine learning, that college attainment is increasingly becoming the most important factor correlated with a party's increasing success or failure in any given place in the United States. We also determined that demographics, GDP, and coastal status are also important factors.

Our conclusions can be used to further understand America's increasing political divide. The coming party coalitions will be built from differences in demographics and education, less so income alone. It is important for politicians, political observers, data scientists, pundits, and ordinary Americans to have an understanding of what is causing this monumental change because nothing will be able to be accomplished otherwise.

Going Forward

When it comes to this complicated topic, our study is just the beginning. The linear model showed that the data we collected only explains about 60% of the vote shift from 2012 to 2016. There are many other factors contributing to the political divide (though education may be the most important), and it should be up to political data scientists to dive deeper and get a more thorough understanding. It is also worth noting that American politics is never static, and factors that may be significant now may not be significant in the future. It is the responsibility of all data scientists to keep pursuing this issue as elections go by.

Bibliography

Dave Leip's Atlas

Leip, David. Dave Leip's Atlas of U.S. Presidential Elections. http://uselectionatlas.org (18 December 2020).

Data Sources

https://www.bea.gov/data/income-saving/personal-income-county-metro-and-other-areas
https://ssti.org/blog/useful-stats-gdp-capita-county-2012-2015
https://data.ers.usda.gov/reports.aspx?ID=17829
https://www.census.gov/data/tables/time-series/demo/popest/2010s-counties-detail.html
https://www2.census.gov/library/stories/2018/08/coastline-counties-list.xlsx

Images

https://www.cnn.com/style/article/why-democrats-are-donkeys-republicans-are-elephants-artsy/index.html

Helpful Articles (Industry Knowledge)

https://www.nytimes.com/2016/11/10/us/politics/donald-trump-voters.html
https://www.theguardian.com/commentisfree/2020/nov/11/joe-biden-voters-republicans-trump
https://www.washingtonpost.com/politics/2018/11/23/mississippis-special-election-is-taking-place-one-most-racially-polarized-states-country/
https://www.nytimes.com/elections/2016/results/louisiana
https://time.com/4397192/donald-trump-utah-gary-johnson/

Helpful Articles (Data Science Knowledge)

https://www.geeksforgeeks.org/python-pandas-dataframe/
https://www.getlore.io/knowledgecenter/data-standardization#:~:text=Data%20Standardization%20is%20a%20data,it's%20loaded%20into%20target%20systems.
https://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/
https://statisticsbyjim.com/regression/comparing-regression-lines/
https://builtin.com/data-science/gradient-descent