Skip to content

This project aims to recreate some of the machine learning methods Nate Silver used in 2016, by using the actual election data and determining how accurate some of these methods were at predicting the final results. The methods we used were Principal Component Analysis, Hierarchical Clustering, Decision Trees, Logistic Regression and Lasso Regul…

Notifications You must be signed in to change notification settings

jasminekwok/election2016-analysis

Repository files navigation

Predicting Voting Behavior

Abstract

The 2016 US Presidential Election came as a shock to everyone when it was determined that Donald Trump would assume that position for the next 4 years. Even to skilled statistician Nate Silver, who ended up wrongly predicting the result back then. Predicting election results is always challenging because there are so many different factors that may play into why a voter chooses or doesn’t choose a specific candidate. This project aims to recreate some of the machine learning methods Nate Silver used in 2016, by using the actual election data and determining how accurate some of these methods were at predicting the final results. The methods we used were Principal Component Analysis, Hierarchical Clustering, Decision Trees, Logistic Regression and Lasso Regularization. In addition to using those methods, we performed other classification methods such as K-Nearest Neighbors and Random Forest and explored the possibility of Simpson’s Paradox in our dataset used for the algorithms.

Some key variables that have a large impact on winning the election in our decision tree algorithm were transit, county total, white, and unemployment. For logistic regression, the key variables were citizen, income per cap, professional, service, production, drive, employed, unemployment, and county total. As a result, we found the logistic regression to be the best algorithm if we look at the records table as it has the lowest test error of 0.0634 (6.34%) compared to other algorithms. However, the ROC curve and area under the curve (AUC) suggest the lasso logistic regression (0.9488) to be slightly better than the logistic regression model (0.9482). As it is a more important evaluation metric for checking a classification model’s performance, and it solves the problem of perfect separation, we would prefer to use lasso logistic regression to classify results of the 2016 US Presidential Election.

Method and Analysis

Exploratory plots such as numerical summaries and principal component analysis are used to identify interesting patterns in the data and relationship between the variables. Principal component analysis (PCA) was used for dimension reduction to extract interesting features and remove unwanted predictors present in the dataset.

The machine learning models used in this project are hierarchical clustering, decision trees, logistic regression, and a lasso logistic regression model. The purpose of using hierarchical clustering which is an unsupervised learning algorithm is to identify subgroups of data and attempt to understand what constitutes a grouping in the 2016 election for different states. The predictive models such as decision tree, logistic regression, and Lasso logistic regression are supervised learning models to generate a model containing important variables to predict winning candidates for each county and state.

Notable / Interesting Figures

alt text alt text

Results

In this project, we applied many different machine learning algorithms on election and census data with the intention of finding the best method at classifying correct results. We found that some key factors that may have an impact on the election may be transit, county total, white, and unemployment from the decision tree model. Other important predictors identified from the logistic regression model were citizen, income per cap, professional, service, production, drive, employed, and private work. Amongst these variables, service and professional have a greater impact on the probability of a candidate winning the election. Using cross validation method in choosing the best model, we found logistic regression to have the lowest test error of 0.0634 (6.34%) compared to the decision tree with 0.0650 (6.50%) and the lasso logistic regression which is 0.0696 (6.96%). However, we also found that using the logistic regression, we encountered the problem of perfect separation which may have arised due to overfitting. This issue is corrected through the regularization method which is related to the issue of bias variance tradeoff. The regularization method tries to reduce variance in the model by minimizing the coefficient to be close to zero. Utilizing the ROC curve to calculate area under the curve (AUC), we found the lasso logistic regression (0.9488) to perform slightly better than the logistic regression model (0.9482) since we prefer AUC to be close to 1. As it is a more important evaluation metric for checking a classification model’s performance and it solves the problem of perfect separation, we would prefer to use lasso logistic regression to classify results of the 2016 US Presidential Election. As this model works well with our data, we may infer that our assumption of the election data to be “sparse” is true.

About

This project aims to recreate some of the machine learning methods Nate Silver used in 2016, by using the actual election data and determining how accurate some of these methods were at predicting the final results. The methods we used were Principal Component Analysis, Hierarchical Clustering, Decision Trees, Logistic Regression and Lasso Regul…

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages