This project is done independently for the University of Maryland, with all code and prose written by me. All knowledge for completing this study was obtained through the University and internet sources.
The study in this project concerns American politics, but it is meant to be completely non-partisan. The purpose of this study is not to advance a particular viewpoint, but instead to take an impartial look into what drives the changes in the American political landscape.
American electoral politics has been constantly fluid since the development of the two-party system, with different party coalitions being built and falling apart over time. Today, the Republican Party enjoys substantial advantages in the South and Great Plains, while the Democratic Party holds a tight grip on the Northeast and West Coast. However, these modern coalitions did not always exist and are even rapidly changing today. The purpose of this study is to identify what factors currently contribute to electoral gains for either party in the modern era, and ascertain whether these factors have shifted over time.
In 2016, Donald J. Trump was elected the 45th President of the United States with a coalition of traditional Republican voters and new, white working class voters. This winning strategy was unique for a Republican in the modern era, as some of these voters were loyally Democratic before he came along. At the same time, Hillary Clinton made some substantial improvement among the more educated and affluent voters across American suburbs. In 2020, Joseph R. Biden won by expanding dramatically upon Clinton's new coalition while slipping even more among the new Trump base. Both American major parties are changing what groups they appeal to, and there are many factors that contribute to these shifts in the electoral map across the 50 states. Preliminary hypotheses from experts suggest that income and education are quickly becoming the leading factors contributing to the changing electoral landscape. This study will investigate these hypotheses, in addition to other factors.
We will seek to identify the factors that contribute to the changing American political coalitions in the 21st century by looking through historical presidential elections in America's 3,143 counties and determining factors correlated with changes in party support between two elections. This study will cover the entire data science lifecycle, from data collection all the way to conclusive findings. For the election data, we will be looking both at raw win margin in a single election and the shift in win margin across two elections for each party. In order to do this, we must first define several terms.
Why Is This Important?¶
The United States is increasingly a politically divided country. The ground is shifting beneath the politicians on both sides of the aisle, and it is often hard to discern why certain voters may be moving in a certain direction. This study attempts to offer a basic explanation for the changing American landscape by determining the factors that lead to vote swings for either party. Hopefully, this can help clarify things for political observers and open up a dialogue grounded in a more thorough understanding of the American voter.
In this section, we will collect all relevant data from several sources using Python. Our analysis will be on the county level. We will start by importing several relevant Python libraries for this study.
# standard Python libraries for data science
import numpy as np
import pandas as pd
import sklearn
from sklearn import linear_model
# Python libraries for HTTP requests of data from web pages
import requests
from bs4 import BeautifulSoup
# standard default dictionary module
from collections import defaultdict
# Python libraries for graphing and visualizations
import matplotlib.pyplot as plt
import statsmodels.api
import statsmodels.formula.api as sm
import seaborn
import folium
Possible factors to explain changes in the American electoral landscape include:
Data for personal income by county: https://www.bea.gov/data/income-saving/personal-income-county-metro-and-other-areas
Data for GDP per capita by county: https://ssti.org/blog/useful-stats-gdp-capita-county-2012-2015
Data for education by county: https://data.ers.usda.gov/reports.aspx?ID=17829
Data for race by county: https://www.census.gov/data/tables/time-series/demo/popest/2010s-counties-detail.html
List of coastal counties: https://www2.census.gov/library/stories/2018/08/coastline-counties-list.xlsx
Data for election results by county will be obtained through HTTP requests to Dave Leip's Atlas of U.S. Presidential Elections.
County election results for any modern presidential election in any state can be easily retrieved from Dave Leip's Atlas. A simple HTTP request using a personalized url will do the trick. This way, we can transform the data into a pandas DataFrame for later use.
First, we will define dictionaries for the FIPS value of each state, because this is how the Atlas identifies individual states. We will also create a dictionary for state abbreviations, which will be used in the DataFrame.
# maps states to their corresponding FIPS value for the atlas. Alaska and Louisiana data are not avaiable.
fips = {"Alabama": 1, "Arizona": 4, "Arkansas": 5, "California": 6, "Colorado": 8, "Connecticut" : 9, \
"Delaware": 10, "DC": 11, "Florida": 12, "Georgia": 13, "Hawaii": 15, "Idaho": 16, "Illinois": 17, \
"Indiana": 18, "Iowa": 19, "Kansas": 20, "Kentucky": 21, "Maine": 23, "Maryland": 24, \
"Massachusetts": 25, "Michigan": 26, "Minnesota": 27, "Mississippi": 28, "Missouri": 29, "Montana": 30, \
"Nebraska": 31, "Nevada": 32, "New Hampshire": 33, "New Jersey": 34, "New Mexico": 35, \
"New York":36, "North Carolina":37, "North Dakota":38, "Ohio": 39, "Oklahoma": 40, "Oregon": 41, \
"Pennsylvania": 42, "Rhode Island": 44, "South Carolina": 45, "South Dakota": 46, "Tennessee": 47, \
"Texas": 48, "Utah": 49, "Vermont": 50, "Virginia": 51, "Washington": 53, "West Virginia": 54, \
"Wisconsin": 55, "Wyoming": 56}
# map of state abbreviations for use in MapCharts
abbs = {"Alabama": "AL", "Alaska": "AK", "Arizona": "AZ", "Arkansas": "AR", "California": "CA", "Colorado": "CO", \
"Connecticut" : "CT", "Delaware": "DE", "DC": "DC", "Florida": "FL", "Georgia": "GA", "Hawaii": "HI", \
"Idaho": "ID", "Illinois": "IL", "Indiana": "IN", "Iowa": "IA", "Kansas": "KS", "Kentucky": "KY", \
"Louisiana": "LA", "Maine": "ME", "Maryland": "MD", "Massachusetts": "MA", "Michigan": "MI", \
"Minnesota": "MN", "Mississippi": "MS", "Missouri": "MO", "Montana": "MT", "Nebraska": "NE", \
"Nevada": "NV", "New Hampshire": "NH", "New Jersey": "NJ", "New Mexico": "NM", "New York": "NY", \
"North Carolina": "NC", "North Dakota": "ND", "Ohio":"OH", "Oklahoma": "OK", "Oregon": "OR", \
"Pennsylvania": "PA", "Rhode Island": "RI", "South Carolina": "SC", "South Dakota": "SD", "Tennessee": "TN", \
"Texas": "TX", "Utah": "UT", "Vermont": "VT", "Virginia": "VA", "Washington": "WA", \
"West Virginia": "WV", "Wisconsin": "WI", "Wyoming": "WY"}
all_states = list(abbs.keys())
atlas_states = list(fips.keys())
Next, we will define our election_results() function. This is our most complex function and is what all of our data collection is built on. This function will accept a state name and election year as arguments, and return a DataFrame containing presidential election results for that year in every county in the state. These results will be in the form of a column for each major candidate containing a vote percentage for each county. This is how we will calculate the margin for every county in the state.
def election_results(state, year):
request = requests.get("https://uselectionatlas.org/RESULTS/datagraph.php?year=" + \
str(year) + "&fips=" + str(fips[state]))
soup = BeautifulSoup(request.content, "html.parser")
# list of tables, each table corresponding to a county
tables = soup.body.find("div", {"class": "info"}).find_all("table")
results = {} # will contain results for each county
all_candidates = {} # running list of all candidates on the ballot in this state
for county in tables:
values = county.find_all("tr") # list of candidate rows for the county
county_name = values[0].find("td").b.string
# deal with exceptions with independent cities and counties of the same name
for exception,exception_state in [("Baltimore","Maryland"), ("Fairfax","Virginia"), ("Richmond", "Virginia"), \
("Bedford", "Virginia"), ("Franklin", "Virginia"), ("Roanoke","Virginia"), \
("St. Louis", "Missouri")]:
if county_name == exception and state == exception_state:
if not exception + " County" in list(results.keys()):
county_name = exception + " County"
else:
county_name = exception + " City"
break
# small changes in county names for consistency
if county_name == "Dewitt" and state == "Texas":
county_name = "DeWitt"
elif county_name == "Desoto" and state == "Florida":
county_name = "DeSoto"
elif county_name == "Dade" and state == "Florida":
county_name = "Miami-Dade"
elif county_name == "Ormsby" and state == "Nevada":
county_name = "Carson City"
elif county_name == "Shannon" and state == "South Dakota":
county_name = "Oglala Lakota"
candidate_results = defaultdict(lambda: 0.0) # results for each candidate in the county
for candidate in values[:2]: # first two candidates (Democrat, Republican)
name = candidate.find("td", {"class":"cnd"})
if name is None:
name = candidate.find("td").string
else:
name = name.string
candidate_results[name] = float((candidate.find("td", {"class":"per"}).string)[:-1])
all_candidates[name] = None
results[county_name] = candidate_results
# the election results have been retrieved and compiled into dictionary form.
# they must now be compiled into a pandas DataFrame
# Current data format for results
# {
# county_name:
# {Candidate: percentage, Candidate: percentage},
# county_name:
# {Candidate: percentage, Candidate: percentage},
# etc
# }
# return results in the form of a DataFrame
df = {"County": list(results.keys())}
for candidate in all_candidates.keys(): # for every candidate
df[candidate + "_" + str(year) + "_%"] = [results[county][candidate] for county in results.keys()]
return pd.DataFrame(df)
We will now define our election_swings() function, which will accept a state and two election results DataFrames as arguments, returning a DataFrame containing the vote swings for each county in that state between the two elections given.
def election_swings(election1, election2):
# compile results for each state
results = election1.merge(election2, on="County", how="outer")
candidates = results.columns[1:]
year1 = candidates[0][-6:-2]
year2 = candidates[2][-6:-2]
# calculate swing
swing = []
for i,row in results.iterrows():
swing.append((row[candidates[2]] - row[candidates[3]]) - (row[candidates[0]] - row[candidates[1]]))
results["Swing_" + year1 + "_" + year2] = swing
return results
Here is an example of what these functions return, using the state of Maine.
maine_example = election_results('Maine', 2020)
maine_example.head()
election_swings(election_results('Maine', 2016), maine_example).head()
As you can see, election_results() shows what percentage of the vote each major party candidate received in each county, and election_swings() shows that information as well as the vote swing between each election.
Note: for this study, a negative vote swing implies a swing towards the Republican party, while a positive vote swing implies a swing towards the Democratic party. A swing of 0 implies the vote margin did not change between the two elections. This was chosen arbitrarily.
We will now gather all election data from the 48 states where data is available (we will touch on the other two later). This can be done by looping through every state and merging the DataFrames. Because we will be doing 48 HTTP requests, this step will take the longest to compute. No need to worry if it takes several several seconds up to a minute to complete if you are doing this yourself. We will start by collecting all 2012-2016 vote swing data.
# swing between 2012 and 2016
elections_df = election_swings(election_results(atlas_states[0], 2012), election_results(atlas_states[0], 2016))
elections_df.loc[:, 'State'] = abbs[atlas_states[0]] # add state abbreviations
# iterate over every available states, appending the rows for each state at the end of the DataFrame
for state in atlas_states[1:]:
new_df = election_swings(election_results(state, 2012), election_results(state, 2016))
new_df.loc[:, 'State'] = abbs[state]
elections_df = elections_df.append(new_df, ignore_index=True)
# reset column names
elections_df = elections_df[[elections_df.columns[0]] + [elections_df.columns[-1]] + list(elections_df.columns[1:-1])]
elections_df.head()
Now let's try adding on 1992-1996 swing data. We can always add more years, but for now we will be working just with this these two sets of swings.
# do the same thing for 1992-1996 election data
df_1992 = election_swings(election_results(atlas_states[0], 1992), election_results(atlas_states[0], 1996))
df_1992.loc[:, 'State'] = abbs[atlas_states[0]]
for state in atlas_states[1:]:
new_df = election_swings(election_results(state, 1992), election_results(state, 1996))
new_df.loc[:, 'State'] = abbs[state]
df_1992 = df_1992.append(new_df, ignore_index=True)
df_1992 = df_1992[[df_1992.columns[0]] + [df_1992.columns[-1]] + list(df_1992.columns[1:-1])]
# outer merge, in order to allow us to address missing data directly
elections_df = elections_df.merge(df_1992, on=['County', 'State'], how='outer')
elections_df.head()
Now we have all of the election data that we want to begin with, so it is time to move on to collecting the rest of the data. This will be done by downloading Excel sheets from the data sources listed above.
Remember: the preliminary independent variables for our study are Income, GDP per capita, Education, Race, and Coastal or not (binary variable). We will start with income.
Here is the format we will be using for each DataFrame:
We will now gather per capita income data from the above source and compile it into a DataFrame.
Note: the footnotes in this excel sheet were manually removed for formatting purposes. In addition, formatting for Virginia independent cities has been fixed in order for them to be interpreted simply as counties in the state and Virginia "combination areas" have been removed. Finally, the county of "Washington" was added under the Washington section and the title of the section was changed to "DC" for formatting purposes. This is because in this study, DC is formatted as the state of DC with the lone county of Washington.
income_df = pd.read_excel('income_by_county.xlsx', header=3, usecols='A,D', names=['County', 'Income'])
Now we will collect the data for GDP per capita for each county.
gdp_df = pd.read_excel('gdp_per_capita_per_county.xlsx', header=1, usecols='B,C,P', \
names=['County', 'State', 'GDP'])
We will now move on to gathering education data for each county. Specifically, we will determine the percentage of the population in each county that has completed a college education. Because this factor will be used extensively in this study, we will record the data for each decade starting in 1990. Unfortunately, each state must be downloaded as its own separate Excel sheet. I will name them all similar names for convenience.
Note: the Florida and South Dakota datasets were manually edited to merge the two Miami-Dade counties and Oglala Lakota/Shannon name change counties. These two pairs of counties refer to the same county.
# we will store all of the education DataFrames in a dictionary
education_dfs = {}
for state in all_states:
education_dfs[state] = pd.read_excel('education/' + abbs[state].lower() + '_education.xlsx', header=2, \
usecols='B,F,G,I', names=['County', 'College_1990', 'College_2000', 'College'])\
.drop(index=0)
education_dfs[state] = education_dfs[state][education_dfs[state]['College'].notna()]
Finally, we will collect race demographics data for each county. Specifically, we are looking at what percentage of the population is non-Hispanic white.
Note: the dataset provided is far too large to use efficiently, as it splits all 3,000+ counties up by 19 age groups and 12 different measurements of the population across 10 years. This leads to more than 700,000 rows. In order to simplify this, I manually deleted all rows in the Excel sheet that were not from measurement 12 (the most recent measurement) and age group 0 (total).
demographics_df = pd.read_excel('demographics_by_county.xlsx', usecols='D,E,G,H,AI,AJ', \
names=['State', 'County', 'Age Group', 'Population', 'White Male', 'White Female'])
Here, we will set a binary variable (1 or 0) denoting whether the given county border the Pacific Ocean, Atlantic Ocean, or Gulf of Mexico.
coastline_df = pd.read_excel('coastline_counties.xlsx', header=3, usecols='D,E', names=['County','State'])\
.dropna(how='any')
Now that we have collected all of our data, it must be cleaned up and put into a format for merging all DataFrames together. This can be tricky, as not all data sources use the same format for county names. We will address each DataFrame individually. After formatting the data to be merged properly, we must deal with any missing data.
In order to merge all of our DataFrames into a single DataFrame, they must all have the same format for county names. The format we will be using is one column for the county name and one column for the state abbreviation, similar to the election swing DataFrame we have already created.
income_df.loc[:, 'State'] = np.nan
# used to check NaN values in String columns
def isnan(x):
try:
return np.isnan(x)
except:
return False
current_state = ''
rows_to_delete = []
for i,row in income_df.iterrows():
# this process aligns with the formatting of the table
# we do not want to include whole states, so we will check
# which rows are states and add their abbrevation to
# every subsequent county, dropping the state at the end
if isnan(income_df['County'][i]):
income_df.loc[i+1, 'Income'] = np.nan
rows_to_delete.append(i)
elif isnan(income_df['Income'][i]):
current_state = abbs[row['County']]
rows_to_delete.append(i)
else:
income_df.loc[i, 'State'] = current_state
income_df = income_df.drop(index=rows_to_delete).dropna(how='all')
income_df = income_df[['County', 'State', 'Income']]
income_df.head()
Fortunately, this data is already perfectly formatted.
gdp_df.head()
for state in all_states:
education_dfs[state].loc[:, 'State'] = \
education_dfs[state]['County'][education_dfs[state].index.values[0]].strip()[-2:]
# remove the word 'county' from each county name and add the relevant state
education_dfs[state]['County'] = [county.strip()[:-4] for county in education_dfs[state]['County']]
education_dfs[state] = education_dfs[state][['County', 'State', 'College', 'College_1990', 'College_2000']]
education_df = education_dfs[all_states[0]]
for state in all_states[1:]:
education_df = education_df.append(education_dfs[state], ignore_index=True)
education_df['College'] = education_df['College'] * 100 # adjust for percentage
education_df['College_1990'] = education_df['College_1990'] * 100
education_df['College_2000'] = education_df['College_2000'] * 100
education_df.head()
# only use Age Group 0 (total cumulative age groups)
demographics_df = demographics_df[demographics_df['Age Group'] == 0][['County', 'State', 'Population', \
'White Male', 'White Female']]
# combine white male and white female population
demographics_df['White Population'] = demographics_df['White Male'] + demographics_df['White Female']
# gather percentage by dividing by whole population
demographics_df['White'] = demographics_df['White Population'] / demographics_df['Population']
# this is for inconsistencies in county names
demographics_df['County'] = demographics_df['County'].apply(lambda c: \
c.replace('District of Columbia', 'Washington County'))
demographics_df['State'] = demographics_df['State'].apply(lambda c: \
c.replace('District of Columbia', 'DC'))
# add state abbreviations
demographics_df['State'] = [abbs[s] for s in demographics_df['State']]
demographics_df['County'] = [c.replace('Parish', 'County')[:-7] if c.replace('Parish', 'County')[-6:] == 'County' \
else c.replace('Parish', 'County')[:-5] for c in demographics_df['County']]
demographics_df = demographics_df[['County', 'State', 'Population', 'White']]
demographics_df['White'] = demographics_df['White'] * 100 # adjust for percentage
demographics_df.head()
coastline_df['County'] = coastline_df['County'].apply(lambda c: str(c).replace(' County', '').replace(' Parish', ''))
coastline_df['State'] = coastline_df['State'].apply(lambda s: abbs[s])
coastline_df.head()
Now that we have completed our formatting, we have 6 DataFrames containing all of the data we need to get started on our analysis.
We must now merge all of these DataFrames into a single master DataFrame, containing all necessary data to be used in our analysis. After doing this, and cleaning up any missing data, we can move on to analyzing our data.
df = elections_df.merge(income_df, on=['County', 'State'], how='outer')\
.merge(gdp_df, on=['County', 'State'], how='outer')\
.merge(education_df, on=['County', 'State'], how='outer')\
.merge(demographics_df, on=['County', 'State'], how='outer')
# add a Coastal column, originally set to all 0, and then change the coastal counties to 1
df.loc[:, 'Coastal'] = 0
for i,row in coastline_df.iterrows():
df.loc[((df['County'] == (row['County'])) & (df['State'] == (row['State']))), 'Coastal'] = 1
df.head()
Given that we gathered data from several different sources, there are likely to be discrepancies with county and state names that result in missing data. In addition, some counties that existed in 1992 may not exist today, and vice versa. Finally, Dave Leip's Atlas does not readily provide data for Louisiana. We must manage this missing data before moving onto the exploratory phase.
Unfortunately, the state of Alaska does not report presidential election results by county, instead reporting them by state legislative district. This makes Alaska rather useless in our analysis. Fortunately, Alaska is a very unique and geographically distinct state from the the other 49, and removing them from our analysis will not cause too much bias when we are working with so many other counties.
df = df[df['State'] != 'AK']
We will now check through all of the rows in our DataFrame that are missing data and determine a cause and potential solution. Before we begin, here are some potential causes for missing data that can be addressed.
len(df[df.isnull().any(axis=1)])
As we can see, there are 218 rows containing missing values somewhere. This is not ideal. After scouring through all of the rows in the DataFrame, I am able to identify the patterns that explain almost all of the missing data.
Identified Causes for Missing Values
Addressing the inconsistent county names is fairly simple. All we have to do is develop functions to change state and county names, and apply them to all situations that call for it.
# function that changes a county name if the county and state values match
def replace_county(dataframe, county, state, new_county):
try:
index = dataframe[(dataframe['County'] == county) & (dataframe['State'] == state)].index.values[0]
dataframe.loc[index, 'County'] = new_county
return dataframe
except:
return dataframe
# some basic name inconsistencies that can be easily fixed
elections_df = replace_county(elections_df, 'District of Columbia', 'DC', 'Washington')
elections_df = replace_county(elections_df, 'Dewitt', 'TX', 'DeWitt')
elections_df = replace_county(elections_df, 'Desoto', 'FL', 'DeSoto')
elections_df = replace_county(elections_df, 'Dade', 'FL', 'Miami-Dade')
elections_df = replace_county(elections_df, 'Shannon', 'SD', 'Oglala Lakota')
elections_df = replace_county(elections_df, 'Lac Qui Parle', 'MN', 'Lac qui Parle')
elections_df = replace_county(elections_df, 'Dona Ana', 'NM', 'Doña Ana')
# all of these places are both a county and an independent city that share a name.
# this addresses that problem of duplicate names.
for dataframe in [elections_df, income_df, gdp_df, education_df, demographics_df]:
for county,state in [("Baltimore","MD"), ("Fairfax","VA"), ("Richmond", "VA"), \
("Bedford", "VA"), ("Franklin", "VA"), ("Roanoke","VA"), \
("St. Louis", "MO")]:
replace_county(dataframe, county, state, county + ' County')
replace_county(dataframe, county, state, county + ' City')
# more inconsistent names
gdp_df = replace_county(gdp_df, 'District of Columbia', 'DC', 'Washington')
education_df = replace_county(education_df, 'District of Columbia', 'DC', 'Washington')
education_df = replace_county(education_df, 'La Salle', 'LA', 'LaSalle')
demographics_df = replace_county(demographics_df, 'Carson', 'NV', 'Carson City')
# more inconcsistent county names across the data sets. These must all be fixed
income_df['County'] = income_df['County'].apply(lambda c: c.replace(' (includes Yellowstone National Park)', '')\
.replace('Lagrange', 'LaGrange')\
.replace('Maui + Kalawao', 'Maui'))
gdp_df['County'] = gdp_df['County'].apply(lambda c: c.replace('(Independent City)', 'City')\
.replace(' (includes Yellowstone Park)', '')\
.replace('Lagrange', 'LaGrange')\
.replace('Carson City City', 'Carson City')\
.replace('Maui + Kalawao', 'Maui'))
education_df['County'] = education_df['County'].apply(lambda c: c.replace('Dona Ana', 'Doña Ana'))
# now that the formatting inconsistencies have been fixed, we can merge the data together again
df = elections_df.merge(income_df, on=['County', 'State'], how='outer')\
.merge(gdp_df, on=['County', 'State'], how='outer')\
.merge(education_df, on=['County', 'State'], how='outer')\
.merge(demographics_df, on=['County', 'State'], how='outer')
df.loc[:, 'Coastal'] = 0
for i,row in coastline_df.iterrows():
df.loc[((df['County'] == (row['County'])) & (df['State'] == (row['State']))), 'Coastal'] = 1
df = df[df['State'] != 'AK'] # remove Alaska data again
# check again how many rows have missing data
len(df[df.isnull().any(axis=1)])
After curing the inconsistent county names, we are left with 160 rows with missing data. These rows are a result of all of Louisiana's counties missing election data, city and county grouping in Virginia that leaves income and GDP data unavailable, a county in Colorado that did not exist in 1992, and a county in Hawaii so small that election data is not recorded.
The new Colorado county and tiny Hawaii county can simply be dropped, as their data would be irrelevant to this study. Unfortunately, we are going to have to drop all of the Virginia rows with missing data as well, because the grouping of counties and independent cities cannot be recovered. Fortunately, this data is Missing Completely at Random (MCAR), meaning there is no pattern to explain which counties are missing, so it is relatively safe to drop.
df = df.drop(df[(df['State'] == 'VA') & (df.isnull().any(axis=1))].index.values)
df = df.drop(df[(df['County'] == 'Broomfield') | (df['County'] == 'Kalawao')].index.values)
len(df[df.isnull().any(axis=1)])
This leaves just the 64 Louisian counties. This missing data will be dealt with in the exploratory phase.
Now that we have put together our comprehensive DataFrame, we can get to work. Our one remaining problem is the missing Louisian election data. Luckily, this data is Missing at Random (MAR), meaning that the cause of the missing data is an observed characteristic (whether the county is in Louisiana or not). That means there are several methods to impute this missing data with some level of confidence.
One such method for missing data imputation is hot-deck imputation, where we impute values from another row that matches the given data point as close as possible in another aspect related to the missing aspect. To the despair of many, the American South is one of the most racially polarized regions in the United States, meaning race is a very accurate predictor for how a certain area will vote. With this in mind, the racial demographics of a given county in the South will be a good predictor of how a similar county in Louisiana will vote. We can use this information to do some hot-deck imputation. We will provide 90% of our weight to race and 10% of our weight to education in determining a southern county most similar to each Louisiana county.
For this imputation, we will define the southern states as Arkansas, Tennessee, Mississippi, Alabama, Georgia, South Carolina, and North Carolina. Other southern states do not carry the same level of racial polarization and will thus not be as effective in predicting the missing data for Louisiana.
First, we will define a function for determining which southern county is most similar to a given Louisiana county.
# define the southern states
southern_states = df[(df['State'] == 'AR') | (df['State'] == 'TN') | (df['State'] == 'MS') | (df['State'] == 'AL') | \
(df['State'] == 'GA') | (df['State'] == 'SC') | (df['State'] == 'NC')]
# this is an algorithm that will determine the most similar county to the county given as an argument
# this algorithm uses the formula described above
def similar_southern_county(row):
most_similar = None
distance = 101
for i,r in southern_states.iterrows():
new_distance = (0.9 * abs(row['White'] - r['White'])) + \
(0.1 * abs(row['College'] - r['College']))
if new_distance < distance:
most_similar = r
distance = new_distance
return most_similar
Now it's time for the imputation.
for i,row in df[df['State'] == 'LA'].iterrows():
similar_county = similar_southern_county(row)
# we will leave 1990s data blank because these education and race factors might not apply back then
for col in df.columns[2:7]:
df.loc[i, col] = similar_county[col]
df[df['State'] == 'LA'].head()
Given the politics of the American South, we can expect many of these county estimations to be accurate within a margin of error. A brief check on the New York Times election results page can confirm that this hot-deck imputation did exactly what we wanted it to.
Congratulations! We officially have no more missing data for 2012-2016. Now we can get started building a preliminary model!
We are going to start by developing a preliminary regression model. We can expect that an early model will have many problems in need of addressing. This section is just to get the feel of linear regression; finalizing our model will occur in a later section.
A linear regression model is used to predict values in a system with a linear correlation between a dependent variable and one or more independent variables. For our initial hypothesis, we predict that the vote swing between 2012 and 2016 in each county is correlated with income, GDP per capita, racial demographics, and whether the county is coastal. We can use this hypothesis to create a simple regression model that will help us determine how correlated each characteristic is with vote swing.
Here is the equation that our hypothesis rests on.
$Vote Swing = \beta_{0} + \beta_{1}(Income Per Capita) + \beta_{2}(GDP per Capita) + \beta_{3}(College Attainment) + \beta_{4}(White Population) + \beta_{5}(Coastal)$
# use the statsmodels library to create linear regression model using an inputed equation and DataFrame
InitialModel = sm.ols(formula='Swing_2012_2016 ~ Income + GDP + College + White + Coastal', data=df).fit()
print(InitialModel.summary2())
Here are some statistics in this model worth reviewing
As we can see from this, our model explains 46.8% of the change in vote swing across the counties. It seems that, as opposed to our original theory, at least income and GDP are insignificantly correlated with vote swing.
We can determine from the coefficients that, on average, a 1 percentage point increase in college education in a given county will lead to a 0.5904 point increase in the county's swing towards the Democratic Party between two elections, all else equal. Conversely, a 1 percentage point increase in the white population will lead to a 0.2253 point increase in the county's swing towards the Republican party, all else equal. We will have to work on our model to determine if there is any correlation between whether a county is coastal and its vote swing.
We should now look at potential problems with our model beyond the basic statistics that can be fixed in the next section.
Multicollinearity is when two or more of the independent variables are highly correlated with each other. This can cause major problems with the model. We can check for any instances of multicollinearity with a simple correlation matrix of our DataFrame.
df[['Income', 'GDP', 'College', 'White', 'Coastal']].corr()
This correlation matrix underscores some of the problems with our original model. It seems, unsurprisingly, that GDP and income and are somewhat highly correlated. It also seems that college attainment and income are very highly correlated. From the high p-values in the model to the high correlation with other variables here, it is becoming clear that at least income is not necessary for our model, and is in fact hurting it.
Now that we have addressed multicollinearity, it is time to look at some diagnostic plots for our model. Specifically, we will be looking at the Fitted Values vs. Residuals plot, Histogram of Residuals plot, and QQ Plot.
Let's define a function for producing diagnostic plots and examine the results.
# this function takes in a model and some dimensions, and displays the diagnostic plots in the given dimensions
def diagnostic_plots(model, x, y):
# use matplotlib to plot some diagnostic plots
figure, (axis1, axis2, axis3) = plt.subplots(1,3) # creats a figure for the three plots
figure.set_size_inches(x,y)
axis1.scatter(model.fittedvalues, model.resid)
axis1.plot(model.fittedvalues, [0 for i in model.fittedvalues], color='red') # line on 0 for measuring
axis1.set_title('Fitted Values vs. Residuals')
axis1.set_xlabel('Fitted Values')
axis1.set_ylabel('Residuals')
axis2.hist(model.resid)
axis2.set_title('Residuals Histogram')
axis2.set_xlabel('Residuals')
statsmodels.api.qqplot(model.resid, line='45', ax=axis3)
axis3.set_title('Residuals QQ Plot')
plt.show()
diagnostic_plots(InitialModel, 18, 5)
These diagnostic plots, especially the QQ plot, are far from ideal. It appears from all three plots that our swing data is heavily skewed. This is likely explained by an abundance of very small counties with very large swings. It might also be explained by the unique case of Utah, a state that swung dramatically towards the Democrats in 2016 because of Mormon distaste for Donald Trump (more on this later). These problems can be easily addressed when we work to perfect our model in the next section.
We can use this section to exercise our ability in data visualizations. We will create some informative graphs using Pyplot and our data.
Let's start by graphing out some correlations between education and vote swing in some different geographic regions across the United States.
# define subgroups of our data set based on certain states
df_newengland = df[(df['State'] == 'ME') | (df['State'] == 'NH') | (df['State'] == 'VT') | (df['State'] == 'MA') | \
(df['State'] == 'RI') | (df['State'] == 'CT')]
df_south = df[(df['State'] == 'AR') | (df['State'] == 'TN') | (df['State'] == 'MS') | (df['State'] == 'AL') | \
(df['State'] == 'GA') | (df['State'] == 'SC') | (df['State'] == 'NC') | (df['State'] == 'LA')]
df_southwest = df[(df['State'] == 'TX') | (df['State'] == 'CO') | (df['State'] == 'NM') | (df['State'] == 'AZ') | \
(df['State'] == 'NV')]
df_west = df[(df['State'] == 'WA') | (df['State'] == 'OR') | (df['State'] == 'CA')]
df_midwest = df[(df['State'] == 'MI') | (df['State'] == 'WI') | (df['State'] == 'MN') | (df['State'] == 'OH') | \
(df['State'] == 'IN') | (df['State'] == 'IL') | (df['State'] == 'IA')]
df_mountainwest = df[(df['State'] == 'MT') | (df['State'] == 'WY') | (df['State'] == 'ID') | (df['State'] == 'ND') | \
(df['State'] == 'SD')]
# create a figure for each geographic region
figure, axes = plt.subplots(3,2)
figure.set_size_inches(15, 18)
figure.suptitle('Education and Vote Shift: 2012-2016', fontsize=20)
# plot the college education against vote swing for each region
df_newengland.plot.scatter('College', 'Swing_2012_2016', title='New England', ax=axes[0][0])
m,b = np.polyfit(df_newengland['College'], df_newengland['Swing_2012_2016'], 1) # this includes a regression line
axes[0][0].plot(df_newengland['College'], m*df_newengland['College'] + b, color='red')
axes[0][0].set_xlabel('College Attainment Rate')
axes[0][0].set_ylabel('Swing Towards Democrats from 2012 to 2016')
df_south.plot.scatter('College', 'Swing_2012_2016', title='The Deep South', ax=axes[0][1])
m,b = np.polyfit(df_south['College'], df_south['Swing_2012_2016'], 1)
axes[0][1].plot(df_south['College'], m*df_south['College'] + b, color='red')
axes[0][1].set_xlabel('College Attainment Rate')
axes[0][1].set_ylabel('Swing Towards Democrats from 2012 to 2016')
df_southwest.plot.scatter('College', 'Swing_2012_2016', title='The Southwest', ax=axes[1][0])
m,b = np.polyfit(df_southwest['College'], df_southwest['Swing_2012_2016'], 1)
axes[1][0].plot(df_southwest['College'], m*df_southwest['College'] + b, color='red')
axes[1][0].set_xlabel('College Attainment Rate')
axes[1][0].set_ylabel('Swing Towards Democrats from 2012 to 2016')
df_west.plot.scatter('College', 'Swing_2012_2016', title='West Coast', ax=axes[1][1])
m,b = np.polyfit(df_west['College'], df_west['Swing_2012_2016'], 1)
axes[1][1].plot(df_west['College'], m*df_west['College'] + b, color='red')
axes[1][1].set_xlabel('College Attainment Rate')
axes[1][1].set_ylabel('Swing Towards Democrats from 2012 to 2016')
df_midwest.plot.scatter('College', 'Swing_2012_2016', title='The Midwest', ax=axes[2][0])
m,b = np.polyfit(df_midwest['College'], df_midwest['Swing_2012_2016'], 1)
axes[2][0].plot(df_midwest['College'], m*df_midwest['College'] + b, color='red')
axes[2][0].set_xlabel('College Attainment Rate')
axes[2][0].set_ylabel('Swing Towards Democrats from 2012 to 2016')
df_mountainwest.plot.scatter('College', 'Swing_2012_2016', title='Mountain West', ax=axes[2][1])
m,b = np.polyfit(df_mountainwest['College'], df_mountainwest['Swing_2012_2016'], 1)
axes[2][1].plot(df_mountainwest['College'], m*df_mountainwest['College'] + b, color='red')
axes[2][1].set_xlabel('College Attainment Rate')
axes[2][1].set_ylabel('Swing Towards Democrats from 2012 to 2016')
plt.show()
As we can see, most geographic regions in the US show a very high correlation between education and vote swing. Generally, counties with a higher level of college attainment will swing more towards the Democratic party in 2016.
Now, let's see the same for the swings from 1992-1996.
# adjusted to remove Louisiana, because that data is still missing for 1992-1996
df_south = df[(df['State'] == 'AR') | (df['State'] == 'TN') | (df['State'] == 'MS') | (df['State'] == 'AL') | \
(df['State'] == 'GA') | (df['State'] == 'SC') | (df['State'] == 'NC')]
# do the same process, but using 1992-1996 data
figure, axes = plt.subplots(3,2)
figure.set_size_inches(15, 18)
figure.suptitle('Education and Vote Shift: 1992-1996', fontsize=20)
df_newengland.plot.scatter('College_1990', 'Swing_1992_1996', title='New England', ax=axes[0][0], c='red')
m,b = np.polyfit(df_newengland['College_1990'], df_newengland['Swing_1992_1996'], 1)
axes[0][0].plot(df_newengland['College_1990'], m*df_newengland['College_1990'] + b)
axes[0][0].set_xlabel('College Attainment Rate')
axes[0][0].set_ylabel('Swing Towards Democrats from 2012 to 2016')
df_south.plot.scatter('College_1990', 'Swing_1992_1996', title='The Deep South', ax=axes[0][1], c='red')
m,b = np.polyfit(df_south['College_1990'], df_south['Swing_1992_1996'], 1)
axes[0][1].plot(df_south['College_1990'], m*df_south['College_1990'] + b)
axes[0][1].set_xlabel('College Attainment Rate')
axes[0][1].set_ylabel('Swing Towards Democrats from 2012 to 2016')
df_southwest.plot.scatter('College_1990', 'Swing_1992_1996', title='The Southwest', ax=axes[1][0], c='red')
m,b = np.polyfit(df_southwest['College_1990'], df_southwest['Swing_1992_1996'], 1)
axes[1][0].plot(df_southwest['College_1990'], m*df_southwest['College_1990'] + b)
axes[1][0].set_xlabel('College Attainment Rate')
axes[1][0].set_ylabel('Swing Towards Democrats from 2012 to 2016')
df_west.plot.scatter('College_1990', 'Swing_1992_1996', title='West Coast', ax=axes[1][1], c='red')
m,b = np.polyfit(df_west['College_1990'], df_west['Swing_1992_1996'], 1)
axes[1][1].plot(df_west['College_1990'], m*df_west['College_1990'] + b)
axes[1][1].set_xlabel('College Attainment Rate')
axes[1][1].set_ylabel('Swing Towards Democrats from 2012 to 2016')
df_midwest.plot.scatter('College_1990', 'Swing_1992_1996', title='The Midwest', ax=axes[2][0], c='red')
m,b = np.polyfit(df_midwest['College_1990'], df_midwest['Swing_1992_1996'], 1)
axes[2][0].plot(df_midwest['College_1990'], m*df_midwest['College_1990'] + b)
axes[2][0].set_xlabel('College Attainment Rate')
axes[2][0].set_ylabel('Swing Towards Democrats from 2012 to 2016')
df_mountainwest.plot.scatter('College_1990', 'Swing_1992_1996', title='Mountain West', ax=axes[2][1], c='red')
m,b = np.polyfit(df_mountainwest['College_1990'], df_mountainwest['Swing_1992_1996'], 1)
axes[2][1].plot(df_mountainwest['College_1990'], m*df_mountainwest['College_1990'] + b)
axes[2][1].set_xlabel('College Attainment Rate')
axes[1][1].set_ylabel('Swing Towards Democrats from 2012 to 2016')
plt.show()
As we can see from some of these preliminary visuals, the correlation between education and vote swing is a fairly recent phenomenon. We can see very strong correlations today, but in the 1990s there exists basically no correlation. With this in mind, we can begin with our full model implementation.
Now that we have collected, cleaned, and explored our data, it is time to get to work developing a complete linear regression explaining the factors contributing to America's changing electoral landscape. We will use our independent variables to create a regression model to determine their correlation with the vote swings across the country. This will allow us to conclude what variables are correlated with each party's improving and diminishing fortunes in different places.
In order to do this, we will play around with data transformations and machine learning to perfect our model. We will also use hypothesis testing to determine if these factors have always contributed to electoral swings or if they are a new phenomenon.
As we saw from our diagnostic plots in Part 3, our county swing data is heavily skewed. There could be several explanations for this. One is that an abundance of small counties with large swings is skewing data.
The best way to handle this without dropping all of these small counties is to group them together. Let's try grouping all of the small counties into intervals:
These intervals will be grouped together, and their data will be averaged. This way, we can get a snapshot of the small counties without them skewing our data. We can also remove income from our list of independent variables, as our preliminary model suggested that it is the least likely to be correlated with vote swing.
df_group = df[['County', 'State', 'Swing_2012_2016', 'Population', 'Income', 'GDP', 'College', 'White', 'Coastal']]
# loop through each interval, create one grouped object, and drop the other items in that interval
for interval in [1000, 2000, 3000, 4000, 5000, 10000, 15000, 20000]:
df_interval = df_group[df_group['Population'] < interval]
# get the mean for every data point in this interval
row = pd.DataFrame({'County':str(interval), 'State':'Interval',\
'Swing_2012_2016':df_interval['Swing_2012_2016'].mean(),\
'Population':df_interval['Population'].mean(), \
'Income':df_interval['Income'].mean(), \
'GDP':df_interval['GDP'].mean(), \
'College':df_interval['College'].mean(), \
'White':df_interval['White'].mean(), \
'Coastal':df_interval['Coastal'].mode()})
# drop the items in that interval, because they have all been grouped together in a new single row
df_group = df_group.drop(index=df_group[(df_group['Population'] < interval) & \
(df_group['State'] != 'Interval')].index.values)
df_group = df_group.append(row, ignore_index=True)
Now let's try this grouped data out with a new regression model. We'll do this the same as before.
LinearModel = sm.ols(formula='Swing_2012_2016 ~ GDP + College + White + Coastal', data=df_group).fit()
print(LinearModel.summary2())
This seems to have worked! Our Adjusted R2 value bumped up quite a bit, now topping 50%. This is all good news, but there is probably more to do. Let's check our diagnostic plots.
diagnostic_plots(LinearModel, 15, 3)
This improved our adjusted R2 value, but did not improve much else. Another way to address skewed data is to remove outliers that have a clear cause for being so extreme, separate from other factors. Utah counties swung dramatically towards the Democrats in 2016 because of unique Mormon disaster for Donald Trump. Because Utah is such a unique state due its Mormon heritage and culture, we can safely remove Utah from our dataset without risking serious bias in the data.
It is in cases such as these where thorough industry knowledge can be very helpful in data science. Even those with top notch data skills may not know to do this without knowledge of American politics.
df_group = df_group[df_group['State'] != 'UT']
LinearModel = sm.ols(formula='Swing_2012_2016 ~ GDP + College + White + Coastal', data=df_group).fit()
print(LinearModel.summary2())
It seems like this bumped our adjusted R2 up further. This is good! Let's check some diagnostic plots to see what progress we've made. There is likely more to be done.
diagnostic_plots(LinearModel, 15, 3)
These diagnostic plots still aren't great. It seems that grouping small counties and dropping Utah counties did not sufficiently reduce our data's skew to an acceptable level. With data skew this large, the best thing to do is standardize the data.
Standardization is a great technique for reducing data skew. We will force our swing data into a mean of 0 and a standard deviation of 1. This will create a distribution as close to normal as possible. We only need to do this to our swing data, because our independent variable data is not naturally skewed. Let's see if this improves our model.
# get the average and standard deviation of the data, then standardize
u = df_group['Swing_2012_2016'].mean()
o = df_group['Swing_2012_2016'].std()
df_group['Swing_2012_2016_std'] = (df_group['Swing_2012_2016'] - u) / o
It also seems from our model that our GDP data is not adapting to the model well. A look at the histogram of the GDP data shows serious skew that can likely be fixed by a log transformation. Data that shows this form of skew can be fixed by such a transformation.
plt.hist(df_group['GDP'])
plt.title('GDP Histogram')
plt.xlabel('GDP Values')
plt.xticks([])
plt.show()
# perform a logarithm on the GDP data
df_group['GDP_log'] = np.log(df_group['GDP'])
LinearModel = sm.ols(formula='Swing_2012_2016_std ~ GDP_log + College + White + Coastal', data=df_group).fit()
print(LinearModel.summary2())
diagnostic_plots(LinearModel, 15, 3)
These diagnostic plots are not perfect, but they are definitely an improvement and enough to say we have significantly reduced the skew of our data. We have also brought our model to a point where all independent variables, with the exception of income, are significant. This is a big accomplishment! We can now finalize our complete model for this study.
This is going to be the final edition of our model, used for the rest of the study. We will draw all of our main conclusions from this model, including what factors may be contributing to the shifting electoral landscape in the United States. The purpose of this linear regression model is to explain the correlation between college attainment, demographics, and coastal status with a given county's swing towards either party from the 2012 to the 2016 presidential election.
For testing purposes, we will split our data into a training set for building the model (95% of the data) and testing set for determining how accurate our model is (5% of the data). This is a common practice in data science.
df_model = df[df['State'] != 'UT'] # remove Utah data
# perform the same data transformation as before
df_model = df_model[['County', 'State', 'Swing_2012_2016', 'Population', 'Income', 'GDP', 'College', 'White', 'Coastal']]
for interval in [1000, 2000, 3000, 4000, 5000, 10000, 15000, 20000]:
df_interval = df_model[df_model['Population'] < interval]
row = pd.DataFrame({'County':str(interval), 'State':'Interval',\
'Swing_2012_2016':df_interval['Swing_2012_2016'].mean(),\
'Population':df_interval['Population'].mean(), \
'Income':df_interval['Income'].mean(), \
'GDP':df_interval['GDP'].mean(), \
'College':df_interval['College'].mean(), \
'White':df_interval['White'].mean(), \
'Coastal':df_interval['Coastal'].mode()})
df_model = df_model.drop(index=df_model[(df_model['Population'] < interval) & \
(df_model['State'] != 'Interval')].index.values)
df_model = df_model.append(row, ignore_index=True)
u = df_model['Swing_2012_2016'].mean()
o = df_model['Swing_2012_2016'].std()
df_model['Swing_2012_2016_std'] = (df_model['Swing_2012_2016'] - u) / o
df_model['GDP_log'] = np.log(df_model['GDP'])
df_train = df_model.sample(frac=0.95) # random sample of 95% of the data
df_test = df_model.drop(df_train.index) # the rest of the data not in the sample
FinalModel = sm.ols(formula='Swing_2012_2016_std ~ GDP_log + College + White + Coastal', data=df_train).fit()
print(FinalModel.summary2())
diagnostic_plots(FinalModel, 18, 5)
Our complete linear model concludes that, when adjusting for skew, there exists a correlation between GDP per capita, college attainment, the white percentage of the population, and coastal status with the degree to which a US county swings towards one party or the other from the 2012 to 2016 presidential elections. The exact formula is as follows:
$ Standardized Vote Swing = \beta_{0} + \beta_{1}(log(GDP)) + \beta_{2}(College) + \beta_{3}(White) + \beta_{4}(Coastal) $
Basic Statistical Conclusions
Important Caveats
Hypothesis testing is an important concept in data science and statistics. We have already done some subtle hypothesis testing using p-values. The p-value for each independent variable is the likelihood of concluding a coefficient at least as extreme as the given coefficient on the assumption that the true value of the coefficient is 0. Given that the odds of our model determining coefficients further from 0 than our given coefficients if the true coefficients were 0 were far less than 5%, we can be confident that the values of all of our coefficients are significantly different from 0. In other words, it is extremely unlikely that our model would have determined these coefficients if the true correlations were 0. This means that correlation exists with all of our independent variables.
To complete our model implementation, we're going to do some basic machine learning and compare its results against our model using the testing dataset. Machine learning is a very complex concept, and this is meant only as a basic introduction.
We will start by writing a simple gradient descent algorithm for linear regression. Gradient descent is the process of estimating model parameters using machine learning. This is done by estimating the combination of coefficients that minimizes the residuals as much as possible. Our algorithm will iterate through our dataset multiple times, adjusting coefficients along the way in a continuous improvement process. Coefficients will be adjusted based on the slope of the loss function at that given point, going down if loss is increasing and going up if loss is decreasing. Let's give it a try.
Gradient descent requires choosing a proper learning rate (rate in which the coefficients are adjusted each iteration. Our learning rate will be quite small to adjust for our large independent data values.
More information on this topic can be found here.
y = df_test['Swing_2012_2016_std'].to_numpy()
X = statsmodels.api.add_constant(df_test[['GDP_log', 'College', 'White', 'Coastal']].to_numpy())
# develop gradient descent model using a learning rate of 0.000000001
gradient_regressor = sklearn.linear_model.SGDRegressor(eta0=0.000000001).fit(X, y)
# capture coefficients calculated by gradient descent and our model
theta_gradient = gradient_regressor.coef_
theta_model = np.array(FinalModel.params)
Now that we have estimated new coefficients using gradient descent, we can test them against our model's coefficients. This can be done by gathering all results into a new DataFrame and comparing the accuracy of gradient descent's estimations against our models.
df_results = df_test[['County', 'State', 'Swing_2012_2016_std']].copy().reset_index(drop=True)
df_results.columns = ['County', 'State', 'Real Swing']
# we will unstandardize the data so it makes more sense in the real world
df_results['Real Swing'] = (df_results['Real Swing'] * o) + u
df_results['Model Prediction'] = (pd.Series([theta_model.dot(row) for row in X]) * o) + u
df_results['Gradient Prediction'] = (pd.Series([theta_gradient.dot(row) for row in X]) * o) + u
# now calculate how far off both estimations were for each county in our testing set
df_results['Model Miss'] = abs(df_results['Model Prediction'] - df_results['Real Swing'])
df_results['Gradient Miss'] = abs(df_results['Gradient Prediction'] - df_results['Real Swing'])
df_results.head()
print("Model Miss: ", df_results['Model Miss'].mean())
print("Gradient Miss:", df_results['Gradient Miss'].mean())
As we can see, both our regression model and the gradient descent algorithm produced fairly accurate results, missing the true swing value by only a few points each. It seems that our model is slightly more precise than the machine learning algorithm, though this is not conclusive and an adjustment to the learning rate and iteration count in gradient descent may change this. Overall, gradient descent appears to be a valuable resource when standard Python regression libraries are not available.
We can use machine learning, specifically gradient descent, to test another one of our hypotheses: that the correlation between vote swing and education that we've proven in this study is a fairly recent phenomenon. This is where our data from 1992-1996 comes in handy. We will now implement gradient descent for both 2012-2016 election data and 1992-1996 election data to see how they both perform. If our hypothesis is correct, the 1992-1996 algorithm will perform far worse because there is no correlation to be found. Let's get started.
df_compare = df[['County', 'State', 'Swing_2012_2016', 'Swing_1992_1996', 'College', 'College_1990']]\
[df['State'] != 'LA'].copy()
u1 = df_compare['Swing_1992_1996'].mean()
o1 = df_compare['Swing_1992_1996'].std()
df_compare['Swing_2012_2016_std'] = (df_compare['Swing_2012_2016'] - u) / o
df_compare['Swing_1992_1996_std'] = (df_compare['Swing_1992_1996'] - u1) / o1
y_2012_2016 = df_compare['Swing_2012_2016_std'].to_numpy()
y_1992_1996 = df_compare['Swing_1992_1996_std']
X_2012_2016 = statsmodels.api.add_constant(df_compare[['College']].to_numpy())
X_1992_1996 = statsmodels.api.add_constant(df_compare[['College_1990']].to_numpy())
# perform gradient descent on both sets of data and return the results
gradient_descent_2012_2016 = sklearn.linear_model.SGDRegressor(eta0=0.000000001).fit(X_2012_2016, y_2012_2016)
gradient_descent_1992_1996 = sklearn.linear_model.SGDRegressor(eta0=0.000000001).fit(X_1992_1996, y_1992_1996)
theta_2012_2016 = gradient_descent_2012_2016.coef_
theta_1992_1996 = gradient_descent_1992_1996.coef_
df_results = df_compare[['County', 'State', 'Swing_2012_2016_std', 'Swing_1992_1996_std']].copy()
df_results.columns = ['County', 'State', 'Real Swing 2012-2016', 'Real Swing 1992-1996']
# we will unstandardize the data so it makes more sense in the real world
df_results['Real Swing 2012-2016'] = (df_results['Real Swing 2012-2016'] * o) + u
df_results['Gradient Descent 2012-2016'] = \
(pd.Series([np.transpose(theta_2012_2016).dot(row) for row in X_2012_2016]) * o) + u
df_results['Real Swing 1992-1996'] = (df_results['Real Swing 1992-1996'] * o1) + u1
df_results['Gradient Descent 1992-1996'] = \
(pd.Series([np.transpose(theta_1992_1996).dot(row) for row in X_1992_1996]) * o1) + u1
# now calculate how far off both estimations were for each county in our testing set
df_results['2012-2016 Miss'] = abs(df_results['Gradient Descent 2012-2016'] - df_results['Real Swing 2012-2016'])
df_results['1992-1996 Miss'] = abs(df_results['Gradient Descent 1992-1996'] - df_results['Real Swing 1992-1996'])
df_results.head()
print("2012-2016 Miss: ", df_results['2012-2016 Miss'].mean())
print("1992-1996 Miss: ", df_results['1992-1996 Miss'].mean())
This is not entirely what was expected. It seems that the 1992-1996 model is actually slightly more accurate, though the difference is marginal. Let's check the education coefficients for each to gain some insights.
print('2012-2016 Education Coefficient: ', theta_2012_2016[1], )
print('1992-1996 Education Coefficient: ', theta_1992_1996[1], )
So while it seems that the gradient descent algorithm is just as accurate for 1990s data, it shows a much smaller coefficient (about half the size). This suggests that education has become much more correlated with vote patters over time, even if it did have a correlation back in the 90s. This also enforces just how effective gradient descent can be in finding proper coefficients for a set of data.
It is important in data science to avoid dismissing data that does not fit your priors. This machine learning test did not result in the output I expected, but it is important to include it in the study and adjust my view of the data accordingly.
Finally, let's see what our linear regression library says about how college education's correlation with voting patterns has changed since the 90s. We will create two simple linear regression models comparing just the correlation with education and vote swing.
Model_2012_2016 = sm.ols(formula='Swing_2012_2016_std ~ College', data=df_compare).fit()
print("2012-2016 Model", Model_2012_2016.summary2(), end='\n\n\n\n')
Model_1992_1996 = sm.ols(formula='Swing_1992_1996_std ~ College_1990', data=df_compare).fit()
print("1992-1996 Model", Model_1992_1996.summary2())
It seems from this that there does exist a correlation between education and vote swing from 1992-1996, but it is much smaller than it is now. Also, according to the Adjusted R2 value in each model, education explains 27% of the vote swing change today but could only explain 0.4% of the change in the 90s. This suggests that education has increasingly become a dominating factor in the American political landscape.
Let's test this hypothesis by determining whether the coefficient for one of our models is significantly different from the other's. We can do this by appending the 1992-1996 data onto the end of the 2012-2016 data and adding a binary variable indicating whether it is from 2012-2016 or from 1992-1996. We can then create another model with this data using an interaction term, multiplying the college data with the binary indicator data. If this interaction is significant, it will tell us that using 1992-1996 data does significantly change the coefficient of correlation.
Null Hypothesis: correlation is not significantly different from 1992-1996 data to 2012-2016 data
Alternative Hypothesis: correlation is significantly different
We are able to reject our null hypothesis and conclude a difference in correlation if the interaction term is significant in our model.
df_1 = df_compare[['County', 'State', 'Swing_2012_2016', 'College']].copy()
df_2 = df_compare[['County', 'State', 'Swing_1992_1996', 'College_1990']].copy()
df_1.columns = ['County', 'State', 'Swing', 'College']
df_2.columns = ['County', 'State', 'Swing', 'College']
# 'Modern' denotes whether the data was from 2012-2016 or not
df_1.loc[:, 'Modern'] = 1
df_2.loc[:, 'Modern'] = 0
df_hypothesis = df_1.append(df_2, ignore_index=True)
df_hypothesis['Interaction'] = df_hypothesis['College'] * df_hypothesis['Modern']
Hypothesis_Model = sm.ols(formula='Swing ~ College * Modern', data=df_hypothesis).fit()
print(Hypothesis_Model.summary2())
Our interaction term is significant! This confirms a significant difference in correlation between the two sets of data.
We can see from this model that college education is correlated with voters swinging towards the Democratic Party, and through our interaction term it becomes clear that data from 2012-2016 shows a much stronger correlation. The interaction term is significant, and its 95% confidence interval is entirely greater than 0. This means that we can be 95% confident that 2012-2016 data shows an increase in correlation. For 1992-1996 (interaction = 0), the education coefficient is 0.0817. For 2012-2016 (interaction = 1), the education coefficient is 0.5739. This is a significant increase in 20 years.
Comparing our two education models and using this interaction term model, we can come to the conclusion that education has become significantly more correlated with vote swings in recent years.
In this section we will move into data visualizations, an integral part in the data science lifecycle. Without visualizations, data can be meaningless to all but those experienced in the field. Visualizations are essential for communicating data insights to those with less expertise.
figure, axes = plt.subplots(2,2)
figure.suptitle('How Different Factors Correlate with GDP', fontsize=20)
figure.set_size_inches(15, 15)
# reduce the range of data for visibility
df[(df['Income'] < 80000) & (df['GDP'] < 80000)].sample(frac=0.25).plot.scatter('Income', 'GDP', ax=axes[0][0])
axes[0][0].set_title('Income')
df[df['GDP'] < 100000].sample(frac=0.25).plot.scatter('College', 'GDP', ax=axes[1][0])
axes[1][0].set_title('College Attainment')
df[df['GDP'] < 100000].sample(frac=0.25).plot.scatter('White', 'GDP', ax=axes[0][1])
axes[0][1].set_title('White Population')
axes[0][1].set_xlabel('White Population %')
axes[0][1].set_ylabel('')
coastal_group = df[['Coastal', 'GDP']].groupby('Coastal', as_index=False).mean().sort_values(by='Coastal', ascending=False)
coastal_group['Coastal'] = ['Coastal', 'Not Coastal']
barlist = coastal_group.plot('Coastal', 'GDP', kind='bar', ax=axes[1][1])
axes[1][1].get_children()[1].set_color('green')
axes[1][1].set_title('Coastal Status')
plt.xticks(rotation=0)
plt.show()
labels = ['Education', 'Other Factors']
r_2016 = round(Model_2012_2016.rsquared * 100, 2)
r_1996 = round(Model_1992_1996.rsquared * 100, 2)
figure, axes = plt.subplots(1,2)
figure.suptitle('The Increasing Role of Eucation in Presidential Vote Swings', fontsize=20)
figure.set_size_inches(15, 8)
# show the adjusted R^2 values for each model
axes[0].pie([r_1996, 100-r_1996], labels=labels, colors=['red', 'grey'])
axes[0].set_title('1996-1996 Vote Swings Correlations')
axes[1].pie([r_2016, 100-r_2016], labels=labels, colors=['red', 'grey'])
axes[1].set_title('2012-2016 Vote Swings Correlations')
plt.show()
figure, ax = plt.subplots(1)
figure.set_size_inches(15,8)
plt.title('Correlation of Education with Vote Swing')
plt.xlabel('College Attainment Rate')
plt.ylabel('Swing Towards the Democrats')
data1 = df_hypothesis[df_hypothesis['Modern'] == 1].sample(n=300)
data2 = df_hypothesis[df_hypothesis['Modern'] == 0].sample(n=300)
# plot the correlations together
data1.plot.scatter('College', 'Swing', ax=ax)
m,b = np.polyfit(data1['College'], data1['Swing'], 1)
ax.plot(data1['College'], m*data1['College'] + b, linewidth=4)
data2.plot.scatter('College', 'Swing', ax=ax, color='red', marker="^")
m,b = np.polyfit(data2['College'], data2['Swing'], 1)
ax.plot(data2['College'], m*data2['College'] + b, color='red', linewidth=4)
plt.legend(['2012-2016 Data', '1992-1996 Data'])
plt.show()
Here we will attempt to measure the education correlation with vote swings over time. To do this, we will calculate all of the election data from 1992-2020 and create a model for each swing.
elections_df2 = elections_df.copy()
# gather election swing data from other years, and use it to show how the role of education has changed
# in vote swings
for year1,year2 in [(2000, 2004), (2008, 2012), (2016,2020)]:
df_temp = election_swings(election_results(atlas_states[0], year1), election_results(atlas_states[0], year2))
df_temp.loc[:, 'State'] = abbs[atlas_states[0]]
for state in atlas_states[1:]:
new_df = election_swings(election_results(state, year1), election_results(state, year2))
new_df.loc[:, 'State'] = abbs[state]
df_temp = df_temp.append(new_df, ignore_index=True)
df_temp = df_temp[[df_temp.columns[0]] + [df_temp.columns[-1]] + list(df_temp.columns[1:-1])]
elections_df2 = elections_df2.merge(df_temp, on=['County', 'State'], how='inner')
elections_df2 = elections_df2.merge(education_df, on=['County', 'State'], how='inner')
# create a model for each swing
Model_1992_1996 = sm.ols(formula='Swing_1992_1996 ~ College_1990', data=elections_df2).fit()
Model_2000_2004 = sm.ols(formula='Swing_2000_2004 ~ College_2000', data=elections_df2).fit()
Model_2008_2012 = sm.ols(formula='Swing_2008_2012 ~ College', data=elections_df2).fit()
Model_2012_2016 = sm.ols(formula='Swing_2012_2016 ~ College', data=elections_df2).fit()
Model_2016_2020 = sm.ols(formula='Swing_2016_2020 ~ College', data=elections_df2).fit()
# plot the education coefficient over time
figure, ax = plt.subplots(1)
figure.set_size_inches(12,6)
plt.xlabel('Election Years')
plt.ylabel('Education and Vote Swing Correlation')
plt.title("Education's Correlation with Vote Swing Over Time")
x = ['1992-1996', '2000-2004', '2008-2012', '2012-2016', '2016-2020']
y = [Model_1992_1996.params[1], Model_2000_2004.params[1], \
Model_2008_2012.params[1], Model_2012_2016.params[1], Model_2016_2020.params[1]]
ax.plot(x, y)
plt.xticks(rotation=45)
plt.show()
df_newengland = df[(df['State'] == 'ME') | (df['State'] == 'NH') | (df['State'] == 'VT') | (df['State'] == 'MA') | \
(df['State'] == 'RI') | (df['State'] == 'CT')]
df_south = df[(df['State'] == 'AR') | (df['State'] == 'TN') | (df['State'] == 'MS') | (df['State'] == 'AL') | \
(df['State'] == 'GA') | (df['State'] == 'SC') | (df['State'] == 'NC')]
df_southwest = df[(df['State'] == 'TX') | (df['State'] == 'CO') | (df['State'] == 'NM') | (df['State'] == 'AZ') | \
(df['State'] == 'NV')]
df_west = df[(df['State'] == 'WA') | (df['State'] == 'OR') | (df['State'] == 'CA')]
df_midwest = df[(df['State'] == 'MI') | (df['State'] == 'WI') | (df['State'] == 'MN') | (df['State'] == 'OH') | \
(df['State'] == 'IN') | (df['State'] == 'IL') | (df['State'] == 'IA')]
df_mountainwest = df[(df['State'] == 'MT') | (df['State'] == 'WY') | (df['State'] == 'ID') | (df['State'] == 'ND') | \
(df['State'] == 'SD')]
# this is the same as the 6 plots from the exploratory section, but using GDP and race instead
figure, axes = plt.subplots(3,2)
figure.set_size_inches(15, 20)
figure.suptitle('GDP and Vote Shift: 2012-2016', fontsize=20)
df_newengland.plot.scatter('GDP', 'Swing_2012_2016', title='New England', ax=axes[0][0])
m,b = np.polyfit(df_newengland['GDP'], df_newengland['Swing_2012_2016'], 1)
axes[0][0].plot(df_newengland['GDP'], m*df_newengland['GDP'] + b, color='red')
axes[0][0].set_ylabel('Swing Towards Democrats from 2012-2016')
df_south.plot.scatter('GDP', 'Swing_2012_2016', title='The Deep South', ax=axes[0][1])
m,b = np.polyfit(df_south['GDP'], df_south['Swing_2012_2016'], 1)
axes[0][1].plot(df_south['GDP'], m*df_south['GDP'] + b, color='red')
axes[0][1].set_ylabel('Swing Towards Democrats from 2012-2016')
axes[0][1].set_xlim([0, 100000])
df_southwest.plot.scatter('GDP', 'Swing_2012_2016', title='The Southwest', ax=axes[1][0])
m,b = np.polyfit(df_southwest['GDP'], df_southwest['Swing_2012_2016'], 1)
axes[1][0].plot(df_southwest['GDP'], m*df_southwest['GDP'] + b, color='red')
axes[1][0].set_ylabel('Swing Towards Democrats from 2012-2016')
axes[1][0].set_xlim([0, 150000])
df_west.plot.scatter('GDP', 'Swing_2012_2016', title='West Coast', ax=axes[1][1])
m,b = np.polyfit(df_west['GDP'], df_west['Swing_2012_2016'], 1)
axes[1][1].plot(df_west['GDP'], m*df_west['GDP'] + b, color='red')
axes[1][1].set_ylabel('Swing Towards Democrats from 2012-2016')
axes[1][1].set_xlim([0, 200000])
df_midwest.plot.scatter('GDP', 'Swing_2012_2016', title='The Midwest', ax=axes[2][0])
m,b = np.polyfit(df_midwest['GDP'], df_midwest['Swing_2012_2016'], 1)
axes[2][0].plot(df_midwest['GDP'], m*df_midwest['GDP'] + b, color='red')
axes[2][0].set_ylabel('Swing Towards Democrats from 2012-2016')
df_mountainwest.plot.scatter('GDP', 'Swing_2012_2016', title='Mountain West', ax=axes[2][1])
m,b = np.polyfit(df_mountainwest['GDP'], df_mountainwest['Swing_2012_2016'], 1)
axes[2][1].plot(df_mountainwest['GDP'], m*df_mountainwest['GDP'] + b, color='red')
axes[2][1].set_ylabel('Swing Towards Democrats from 2012-2016')
axes[2][1].set_xlim([0, 150000])
plt.show()
figure, axes = plt.subplots(3,2)
figure.set_size_inches(15, 20)
figure.suptitle('White Population and Vote Shift: 2012-2016', fontsize=20)
df_newengland.plot.scatter('White', 'Swing_2012_2016', title='New England', ax=axes[0][0])
m,b = np.polyfit(df_newengland['White'], df_newengland['Swing_2012_2016'], 1)
axes[0][0].plot(df_newengland['White'], m*df_newengland['White'] + b, color='green')
axes[0][0].set_xlabel('White Population %')
axes[0][0].set_ylabel('Swing Towards Democrats from 2012-2016')
df_south.plot.scatter('White', 'Swing_2012_2016', title='The Deep South', ax=axes[0][1])
m,b = np.polyfit(df_south['White'], df_south['Swing_2012_2016'], 1)
axes[0][1].plot(df_south['White'], m*df_south['White'] + b, color='green')
axes[0][1].set_xlabel('White Population %')
axes[0][1].set_ylabel('Swing Towards Democrats from 2012-2016')
df_southwest.plot.scatter('White', 'Swing_2012_2016', title='The Southwest', ax=axes[1][0])
m,b = np.polyfit(df_southwest['White'], df_southwest['Swing_2012_2016'], 1)
axes[1][0].plot(df_southwest['White'], m*df_southwest['White'] + b, color='green')
axes[1][0].set_xlabel('White Population %')
axes[1][0].set_ylabel('Swing Towards Democrats from 2012-2016')
df_west.plot.scatter('White', 'Swing_2012_2016', title='West Coast', ax=axes[1][1])
m,b = np.polyfit(df_west['White'], df_west['Swing_2012_2016'], 1)
axes[1][1].plot(df_west['White'], m*df_west['White'] + b, color='green')
axes[1][1].set_xlabel('White Population %')
axes[1][1].set_ylabel('Swing Towards Democrats from 2012-2016')
df_midwest.plot.scatter('White', 'Swing_2012_2016', title='The Midwest', ax=axes[2][0])
m,b = np.polyfit(df_midwest['White'], df_midwest['Swing_2012_2016'], 1)
axes[2][0].plot(df_midwest['White'], m*df_midwest['White'] + b, color='green')
axes[2][0].set_xlabel('White Population %')
axes[2][0].set_ylabel('Swing Towards Democrats from 2012-2016')
axes[2][0].set_xlim([60,100])
df_mountainwest.plot.scatter('White', 'Swing_2012_2016', title='Mountain West', ax=axes[2][1])
m,b = np.polyfit(df_mountainwest['White'], df_mountainwest['Swing_2012_2016'], 1)
axes[2][1].plot(df_mountainwest['White'], m*df_mountainwest['White'] + b, color='green')
axes[2][1].set_xlabel('White Population %')
axes[2][1].set_ylabel('Swing Towards Democrats from 2012-2016')
axes[2][1].set_xlim([60,100])
plt.show()
An interactive map with Folium can be used to underscore the swings in some counties on a map. The user can select a map and see county data. This map is meant to underscore the fact that more educated and diverse counties and moving in one direction and less educated, whiter counties are moving in another.
map_osm = folium.Map(location=[40.8, -74.3118], zoom_start=6.5)
# go through some selected counties
for county,state,lat,long in [("Prince George's", "MD", 38.7849, -76.8721), \
("Montgomery", "MD", 39.1547, -77.2405), \
("Somerset", "MD", 38.0862, -75.8534), \
("Anne Arundel", "MD", 38.9530, -76.5488), \
("Salem", "NJ", 39.5849, -75.3879), \
("Morris", "NJ", 40.8336, -74.5463), \
("Luzerne", "PA", 41.1404, -75.9928), \
("Westchester", "NY", 41.1220, -73.7949), \
("Fairfield", "CT", 41.1408, -73.2613), \
("Windham", "CT", 41.8276, -72.0468), \
("Norfolk", "MA", 42.1767, -71.1449), \
("Berkshire", "MA", 42.3118, -73.1822), \
("Sullivan", "NY", 41.6897, -74.7805), \
("Washington", "NY", 43.2519, -73.3709)]:
df_section = df[(df['County'] == county) & (df['State'] == state)]
# label the marker light blue or light red depending on how it swung in the election
i = df_section.index.values[0]
swing = df_section['Swing_2012_2016'][i]
if swing > 0:
c = 'lightblue'
else:
c = 'pink'
# use county information as the marker popup
info = county.upper() + " COUNTY" + \
" -------- COLLEGE ATTAINMENT: " + str(round(df_section['College'][i], 2)) + "%" + \
" -------- WHITE POPULATION: " + str(round(df_section['White'][i], 2)) + "%" + \
" -------- 2012-2016 SWING: " + str(round(abs(swing), 2)) + " points " + \
("Democratic" if swing > 0 else "Republican")
folium.Marker([lat, long], popup=info, icon=folium.Icon(color=c)).add_to(map_osm)
This map is an interactive selection of 14 counties. Many of these counties are extremely different from each other, from different education to different demographics. Check through these counties, ranging from Maryland up to upstate New York, and see if you can find any patterns between these factors and the county's vote swing from 2012-2016.
map_osm
Congratulations! We have completed our data analysis for this study. Now it is time to make some conclusions.
These conclusions lead us to believe that education, GDP, white population, and coastal status would be enough to confidently predict a given counties vote in 2016 given its vote in 2012. Let's try it out!
df_final = df[['County', 'State', 'Romney_2012_%', 'Trump_2016_%', 'GDP', 'College', 'White']].copy()
df_final.columns = ['County', 'State', 'Romney', 'Trump', 'GDP', 'College', 'White']
Prediction_Model = sm.ols(formula='Trump ~ Romney + College + White', data=df_final).fit()
print(Prediction_Model.summary2())
Even though the data is heavily skewed, education, demographics, and prior voting explains more than 95% of a county's vote pattern! This is helpful insight going forward when studying electoral politics.
After collecting our data, cleaning our data, and doing some exploratory analysis, several data transformations were required. Vote swing and GDP data were heavily skewed, and adjusting for this greatly improved our model.
We determined through linear regression analysis, visualization, hypothesis testing, and machine learning, that college attainment is increasingly becoming the most important factor correlated with a party's increasing success or failure in any given place in the United States. We also determined that demographics, GDP, and coastal status are also important factors.
Our conclusions can be used to further understand America's increasing political divide. The coming party coalitions will be built from differences in demographics and education, less so income alone. It is important for politicians, political observers, data scientists, pundits, and ordinary Americans to have an understanding of what is causing this monumental change because nothing will be able to be accomplished otherwise.
When it comes to this complicated topic, our study is just the beginning. The linear model showed that the data we collected only explains about 60% of the vote shift from 2012 to 2016. There are many other factors contributing to the political divide (though education may be the most important), and it should be up to political data scientists to dive deeper and get a more thorough understanding. It is also worth noting that American politics is never static, and factors that may be significant now may not be significant in the future. It is the responsibility of all data scientists to keep pursuing this issue as elections go by.
Leip, David. Dave Leip's Atlas of U.S. Presidential Elections. http://uselectionatlas.org (18 December 2020).
https://www.bea.gov/data/income-saving/personal-income-county-metro-and-other-areas
https://ssti.org/blog/useful-stats-gdp-capita-county-2012-2015
https://data.ers.usda.gov/reports.aspx?ID=17829
https://www.census.gov/data/tables/time-series/demo/popest/2010s-counties-detail.html
https://www2.census.gov/library/stories/2018/08/coastline-counties-list.xlsx
https://www.nytimes.com/2016/11/10/us/politics/donald-trump-voters.html
https://www.theguardian.com/commentisfree/2020/nov/11/joe-biden-voters-republicans-trump
https://www.washingtonpost.com/politics/2018/11/23/mississippis-special-election-is-taking-place-one-most-racially-polarized-states-country/
https://www.nytimes.com/elections/2016/results/louisiana
https://time.com/4397192/donald-trump-utah-gary-johnson/
https://www.geeksforgeeks.org/python-pandas-dataframe/
https://www.getlore.io/knowledgecenter/data-standardization#:~:text=Data%20Standardization%20is%20a%20data,it's%20loaded%20into%20target%20systems.
https://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/
https://statisticsbyjim.com/regression/comparing-regression-lines/
https://builtin.com/data-science/gradient-descent