What makes a model good? 📊 August 9, 2020
A follow-up to some tweets on how we train and test statistical models
One way to make a model more uncertain is to use a statistical distribution with fat tails. But that’s a little demeaning—the Cauchy prefers to be called “big-boned.” And this is my weekly newsletter.
I’m G. Elliott Morris, a data journalist and political analyst who mostly covers polls, elections, and political science. As always, I invite you to drop me a line (or just respond to this email). Please hit the ❤️ below the title if you like what you’re reading; it’s our little trick to sway Substack’s curation algorithm. If you want more content, I publish subscriber-only posts 1-2x a week.
I imagine we will be getting some more election-forecasting models in the coming weeks. No doubt they will differ from each other. This is a blog post about a Twitter discussion between me and Nate Silver on how to determine what sets different models apart. In particular, we will answer the questions: What gives us faith in a model that hasn’t made real-world predictions before? Can we be confident in its future performance, or are indicators just a rough guide? And, if the latter, is that still helpful?
One thing I’ve learned about modeling since 2016 is that it takes a lot of hard work and rethinking to get it right. It’s easy to get stuck in loops of affirmational thinking about a formula you’ve landed on or parameterization you think is good. I think one of Nate’s real strengths is figuring out how to break out of those loops and really land on a good, robust solution. Or maybe he’s just really, really lucky. Either way, this is something I can probably work on.
What makes a model good?
A follow-up to some tweets on how we train and test statistical models
I was on NPR’s On Point radio show on Friday to talk about uncertainty in political polling, and why poll-based election forecasts aren’t gospel. I think we (me and the other panelists) had a good discussion and I think you should give the show a listen, even though it is not directly germane to the topic of this post.
The On Point segment and my recent history of interacting with Nate Silver on Twitter made me think that this tweet he sent Saturday evening was about me. I’m going to be excerpting the Twitter discussions in text form here, but you can read the tweets for yourself at this link.
It is generally better to think more carefully about how to pragmatically account for real-world uncertainty in a forecast than to not think carefully about it but then engage in a whole lot of existential conversation about it.
Maybe it was just my ego, but I was pretty sure he was referencing me and the On Point segment. So I typed this back:
So... assuming this is directed at me, We include the potential for systematic bias and terms for pollster-level, mode, population, and weighting effects & have extra measurement error to account for other non-sampling variance. I think that mirrors the NPR interview pretty well!
We’re certainly open to having a discussion about extra uncertainty from covid, which we think mostly is sucked into variance over LV modeling. But it seems your tweet is more about historical accuracy in polls, in which case: consult our record!
To which Nate replied:
It wasn't really about you or anyone in particular but if you're talking about a historical record for a forecast that **hadn't actually been published before this year** then that likely indicates a lot of barking up the wrong tree.
Fundamentally, the failure to understand that *fitting a model* is not the same thing as *out-of-sample forecasting* is the problem. And it's not really about you, at all. It's why a lot of "data science" curriculum sucks.
One more tweet from me before I start adding more context:
Well, ok then. But now I’m curious: can we not study things unless we make predictions about them before they happen? If your argument is not to trust our model because we haven’t published it before, I don’t know what to say to you. Aside from maybe let’s talk again on Nov 4th?
Okay, let’s start dissecting claims and figure out where the real point of tension is before moving forward.
First, the original premise gets dropped immediately. We never talk about uncertainty in polling and the election model I helped build for The Economist again. Instead, we’re going to get into some technical banter about model-building and validating predictions.
Nate’s central claim is that there is a difference between (1) a procedure used to test and calibrated statistical model known as “train-test-validation splits” and (2) actually making predictions on events before they happen. I obviously agree with this. And something that may have been muddied by the reactionary nature of conversing on Twitter is that I think the performance of models on the latter basis is the best way to gauge reliability. Full-stop.
So I asked Nate a related question: whether comparing models on the basis of their validation performance—the term we use for how accurate our predictions are on events or observations that have occurred in reality but which our models don’t technically see beforehand—is helpful to determine if they are worthwhile.
My read of the following tweet is that he doesn’t think so. In fact, he calls validation-set accuracy “irrelevant.”
You're really not getting it. You didn't publish a model in 2016. It's neither interesting nor relevant what a model designed in 2020 would have said about 2016. This is such a fundamental point that it depresses me a bit about the status of "data science" education tbh. […] Given "reasonable" choices of parameters, backtesting often tells you little about how a model will perform on unknown data.
There is no "validation set" other than reality. You can't determine predictive accuracy until a model actually makes (publicly disclosed ahead-of-time) predictions. And to the extent this seems like a heterodox view, it reflects a lot of failings in data-science education.
This is a bit hard for me to agree with. I believe that how a model performs on unseen data it’s actually very relevant. That’s not because its testing procedures will tell you with 100% accuracy how the model will do in the future, but because robust testing can tell you which of two candidate models is likely to do better on totally unseen observations (eg, the 2020 election) in the future. I lay out my case with an example in four tweets:
Play this out. Mode Al uses polls and economic data to predict elections. B uses yard signs. The literal application of your argument means you can’t quantify which one is likelier to do better in Nov. That’s total BS because you have validation data to serve as a rough guide!
The more reasonable interpretation of what you’re saying is that validation-set performance is meaningless for comparing models. In which case I’d love to know your development procedure, bc how else do you judge whether including one new variable or spec improves performance?
If we can’t know that a given model will be improved by something using train-test-validation splits, how do we do it? Make one minor revision every four years and then wait to find out how the model did? That’s silly!
Again the claim here is not that validation-set performance is the best measure of a model’s quality. The claim is that **comparing different models on the basis of their validation-set performance, holding information constant, is instructive.** Can you engage with that?
Nate never responded to this tweet, but we did have a bit more back-and-forth elsewhere. This is where the discussion stopped being productive so I’m done quoting tweets.
What makes a model good?
Let me make one thing clear for readers before continuing: I value Nate’s work and think he’s a good statistician. His use of data to learn about politics was one of my original inspirations for pursuing a career in political data science. I don’t think he’s a dummy and I believe we learn a lot when we engage. So I’m happy he’s taking the time to respond to me on Twitter.
That being said, I just don’t think we’re getting through to each other. So let me make two things clear:
First, I 100% agree that measuring predictive accuracy on “unseen” data is not a suitable replacement for calculating how a model, built before an election, performed on said election after-the-fact. That’s a pretty simple statistical point and I’m not sure if Nate came away with the impression that I knew what he was talking about. I did, but was focusing on the latter point, which is two-fold:
Second, comparing how candidate models do on the same “unseen” validation data, with the same choice of the same parameters, with the same available information, is crucial to developing a good final model. Second, it’s also a reasonable basis for calibrating expectations for separate models on future, truly unseen data.
It seems like we never really got around to discussing the second point, which I don’t really see a glaring fault in. The overall thrust of my argument is that if a model performed similarly to Nate’s in 2008, 2012, and 2016, we should expect that it will achieve similar performance in 2020, with some room for error.
And, to bring it full circle to the NPR interview: one way that Nate could help modelers determine whether their models are capturing the uncertainty in polling as well as his before the election happened is if he compared his validation-set accuracy to theirs. Holding prior information constant and comparing accuracy should provide some useful data for calibrating expectations.
That’s my central claim. It seems pretty reasonable to me. If a model is so demonstrably bad, one should be able to demonstrate why.
Posts for subscribers
August 9: Are the polls tightening? Five questions to ask yourself when it looks like the race might be shifting
What I'm Reading and Working On
I moved apartments this week, which means I now have 300 books stacked on the floor waiting to be shelved. Maybe I’ll get to that on November 4th…
This week I will be writing about the history of the electoral college and doing some math about how it confers advantages to different groups of people for arbitrary reasons that hurt democracy.
Thanks for reading!
Thanks for reading. I’ll be back in your inbox next Sunday. In the meantime, follow me online or reach out via email if you’d like to engage. I’d really love to hear from you.
If you want more content, I publish subscriber-only posts on Substack 1-3 times each week. Sign up today for $5/month (or $50/year) by clicking on the following button. Even if you don't want the extra posts, the funds go toward supporting the time spent writing this free, weekly letter. Your support makes this all possible!
Nobody sent in a picture of their pets this week, which is more than a little disappointing. Send in a photo of your pets for next week to firstname.lastname@example.org. In the meantime, here’s my cat Bacon: