How to think about uncertainty in election models
What can we know (and measure) about the world?
Friends,
Happy Saturday. What follows is a long and long-overdue post on uncertainty in election forecasting. I think there’s a lot we can know about the world but more that we can’t. The only way to model things is to analyze that uncertainty empirically and rigorously.
(PS: I typed much of the post on my phone so please be forgiving of any typos.)
Elliott
I often find myself flipping back and forth from thinking that (a) most political outcomes are fundamentally predictable ahead of time and (b) there is too much uncertainty in the world to really know anything confidently at all.
One the one hand, we like to think that our attitudes and behaviors are shaped by measurable externalities and rules. On the other, those are external forces are constantly changing, and unpredictable things happen more often than people think. In Michael Lewis's biography of Amos Tversky and Daniel Kahneman, the author excerpts remarks from one of their students who says Kahneman was rarely ever certain about anything but doubt. As a general rule, I think that's pretty good advice for people who anticipate events or outcomes for a living. But that’s not the only rule we can follow (and rarely a practical one for modelers, let alone a satisfying one).
I've been giving this a lot of thought recently as I’m seeing more and more election forecasts (including my own) pop up. Disclaimer: These are just the musings of an amateur social scientist/data analyst/statistician/etc, but I think (though I can't be sure) that I can provide some insights worth sharing with y'all.
So here’s the overarching question I’m pondering: What happens if the key predictors of electoral performance break down this year? And, just as important: What's the likelihood of that happening?
The fundamentals
First, consider that there could be an unforeseen disconnect between the results of the election and the political and economic fundamentals that we typically use to predict them. This disconnect could happen both with the average predicted outcome and the range of outcomes we assign to the forecast—statistically, the mean and variance of our distribution. Maybe GDP growth doesn't correlate with vote shares this time around because Trump doesn't get blamed for the recession as a president normally does (some polling suggests that people are blaming covid-19 instead). For now, set aside the clear caveat that Trump is currently being indirectly blamed for the recession because of his poor handling of the virus.
Consider also the chance that the president underperforms the political fundamentals (his approval rating) because polarization has increased the congruence between how people feel about the president and whether they vote for him. In the past, an incumbent with a -5 net approval rating would be favored to win a bare popular vote victory. But you don't see a lot of reluctant Trump voters (those who disapprove of him but say they will support him) in the polls right now, so he might come in on the low side of whatever we’d predict for him at -15 (his rating right now).
As I've written about extensively, we can also fall into the trap of relying too much on recent electoral history when we build these underlying regression models. If an entirely new event occurs that isn't included in the training set and causes the models to implode, we're not only concerned about the chance that our mean prediction is off, but also about assigning too low a likelihood to the ultimate result.
We do have some techniques for making sure these models aren’t overfit to historical data. For one, we can make predictions for elections without letting the model see those data and calibrate our uncertainty accordingly. But after that there’s no real solution. There are some creative ways to add extra error to our predictions, increasing the height of the tail of the distribution of outcomes, but this too is no certain fix
The polls
Our model could also go wrong if polls are uniformly (and drastically( worse this year than they have been in recent elections. The polling error that most people use for parameterizing their models is equal to the average (or, my preference, the root-mean-squared) error of national polls going back to the mid-1900s. (If you want to be fancier, you'll also let state polling errors be a bit larger than the national ones, but that’s besides the point for now.)
However, the past error might not be enough to control for the chance that all the polls in one state—or a group of states—are biased against one candidate in the same direction because of yet another unforeseen event. In 2016, the polls were off by 6-7 points in Wisconsin (depending on who you ask) and by similar amounts in Michigan and New Hampshire. Those misses typically landed outside the margin of error for many poll-based forecasters, especially if they didn't take the high number of undecideds into account and inflate the amount of potential error. What if that happens again this year? What if Trump voters decide to lie to pollsters en masse (something we don't have evidence of, for what it's worth). What if the pandemic causes pollsters to overestimate turnout in the cities?
We also know (not just from 2016, but from the elections that came before) that polls usually err by similar amounts in states with similar social, political and demographic compositions. But those correlations aren't constant across elections. That's because the correlates of vote choice and turnout change from year to year. In 2004, income was one of the strongest predictors of which party people voted for, and education wasn't. Back then, poor, working-class whites were still aligned with the Democratic Party, and there was almost no education gradient in the election whatsoever. But by 2016 that had flipped.
If a forecaster failed to take the racial and educational composition of a state into account when determining the correlation of polling error, they would have missed the most recent election worse than most did. Similarly, if there's a new primary correlate of vote choice in 2020 that we don't capture in the model, the model won't be as predictive as it could be.
The conditional factors
Finally, any predictive model explicitly ignores the factors you don't plug into it. We typically say that there are a set of variables on which our model is conditioned. In covid prediction models, for example, the model is conditional on the virus not mutating to a new, deadlier or more infective strain. In election prediction, most of those conditional events are about rare events that we don't think we need to model, or don't know how to. That's stuff like (a) both party's nominees being alive, (b) both being on the ballot on November 3rd, (c) the election actually happening, and (yes) (d) a meteor not hitting the Earth.
But if one of those events does happen, there's really no way to ensure that the model is running "correctly.”
I hope this post has given you an idea of what we can and cannot know about the world. We have shown how we can control for the “known unknowns” (fundamentals error, pollster error, modeling error) that we already have implicit data on. But what about the “unknown unknowns?” The things that might impact Election Day that we can’t know about and control for ahead of time?
Well, that is where my doubt really kicks in. If we have already tried our best to account for everything we can, the only thing left is to hope that the unforeseen events don’t break things.
Whatever the polls show, the missing variable here involves political administration. Ted Lowi used to argue that political scientists, like Macy's, discounted the 3 to 4% that slipped through due to corruption was reasonable given the costs of controlling for it. But what if the 70,000 "monitors" dispatched by the RNC. and other state acts actually do severely and disproportionately depress turnout? Not your are area of expertise, but are there any of us looking at this?