Every week, teams each submit not only a point forecast predicting a single number outcome (say, that in one week there will be 500 deaths). They also submit probabilistic predictions that quantify the uncertainty by estimating the likelihood of the number of cases or deaths at intervals, or ranges, that get narrower and narrower, targeting a central forecast. For instance, a model might predict that there’s a 90 percent probability of seeing 100 to 500 deaths, a 50 percent probability of seeing 300 to 400, and 10 percent probability of seeing 350 to 360.
“It’s like a bull’s eye, getting more and more focused,” says Reich.
Funk adds: “The sharper you define the target, the less likely you are to hit it.” It’s fine balance, since an arbitrarily wide forecast will be correct, and also useless. “It should be as precise as possible,” says Funk, “while also giving the correct answer.”
In collating and evaluating all the individual models, the ensemble tries to optimize their information and mitigate their shortcomings. The result is a probabilistic prediction, statistical average, or a “median forecast.” It’s a consensus, essentially, with a more finely calibrated, and hence more realistic, expression of the uncertainty. All the various elements of uncertainty average out in the wash.
The study by Reich’s lab, which focused on projected deaths and evaluated about 200,000 forecasts from mid-May to late-December 2020 (an updated analysis with predictions for four more months will soon be added), found that the performance of individual models was highly variable. One week a model might be accurate, the next week it might be way off. But, as the authors wrote, “In combining the forecasts from all teams, the ensemble showed the best overall probabilistic accuracy.”
And these ensemble exercises serve not only to improve predictions, but also people’s trust in the models, says Ashleigh Tuite, an epidemiologist at the Dalla Lana School of Public Health at the University of Toronto. “One of the lessons of ensemble modeling is that none of the models is perfect,” Tuite says. “And even the ensemble sometimes will miss something important. Models in general have a hard time forecasting inflection points—peaks, or if things suddenly start accelerating or decelerating.”
The use of ensemble modeling is not unique to the pandemic. In fact, we consume probabilistic ensemble forecasts every day when Googling the weather and taking note that there’s 90 percent chance of precipitation. It’s the gold standard for both weather and climate predictions.
“It’s been a real success story and the way to go for about three decades,” says Tilmann Gneiting, a computational statistician at the Heidelberg Institute for Theoretical Studies and the Karlsruhe Institute of Technology in Germany. Prior to ensembles, weather forecasting used a single numerical model, which produced, in raw form, a deterministic weather forecast that was “ridiculously overconfident and wildly unreliable,” says Gneiting (weather forecasters, aware of this problem, subjected the raw results to subsequent statistical analysis that produced reasonably reliable probability of precipitation forecasts by the 1960s).
Gneiting notes, however, that the analogy between infectious disease and weather forecasting has its limitations. For one thing, the probability of precipitation doesn’t change in response to human behavior—it’ll rain, umbrella or no umbrella—whereas the trajectory of the pandemic responds to our preventative measures.
Forecasting during a pandemic is a system subject to a feedback loop. “Models are not oracles,” says Alessandro Vespignani, a computational epidemiologist at Northeastern University and ensemble hub contributor, who studies complex networks and infectious disease spread with a focus on the “techno-social” systems that drive feedback mechanisms. “Any model is providing an answer that is conditional on certain assumptions.”
When people process a model’s prediction, their subsequent behavioral changes upend the assumptions, change the disease dynamics and render the forecast inaccurate. In this way, modeling can be a “self-destroying prophecy.”
And there are other factors that could compound the uncertainty: seasonality, variants, vaccine availability or uptake; and policy changes like the swift decision from the CDC about unmasking. “These all amount to huge unknowns that, if you actually wanted to capture the uncertainty of the future, would really limit what you could say,” says Justin Lessler, an epidemiologist at the Johns Hopkins Bloomberg School of Public Health, and a contributor to the COVID-19 Forecast Hub.
The ensemble study of death forecasts observed that accuracy decays, and uncertainty grows, as models make predictions farther into the future—there was about two times the error looking four weeks ahead versus one week (four weeks is considered the limit for meaningful short-term forecasts; at the 20-week time horizon there was about five times the error).
“It’s fair to debate when things worked and when things didn’t.”
But assessing the quality of the models—warts and all—is an important secondary goal of forecasting hubs. And it’s easy enough to do, since short-term predictions are quickly confronted with the reality of the numbers tallied day-to-day, as a measure of their success.
Most researchers are careful to differentiate between this type of “forecast model,” aiming to make explicit and verifiable predictions about the future, which is only possible in the short- term; versus a “scenario model,” exploring “what if” hypotheticals, possible plotlines that might develop in the medium- or long-term future (since scenario models are not meant to be predictions, they shouldn’t be evaluated retrospectively against reality).
During the pandemic, a critical spotlight has often been directed at models with predictions that were spectacularly wrong. “While longer-term what-if projections are difficult to evaluate, we shouldn’t shy away from comparing short-term predictions with reality,” says Johannes Bracher, a biostatistician at the Heidelberg Institute for Theoretical Studies and the Karlsruhe Institute of Technology, who coordinates a German and Polish hub, and advises the European hub. “It’s fair to debate when things worked and when things didn’t,” he says. But an informed debate requires recognizing and considering the limits and intentions of models (sometimes the fiercest critics were those who mistook scenario models for forecast models).
Similarly, when predictions in any given situation prove particularly intractable, modelers should say so. “If we have learned one thing, it’s that cases are extremely difficult to model even in the short run,” says Bracher. “Deaths are a more lagged indicator and are easier to predict.”
In April, some of the European models were overly pessimistic and missed a sudden decrease in cases. A public debate ensued about the accuracy and reliability of pandemic models. Weighing in on Twitter, Bracher asked: “Is it surprising that the models are (not infrequently) wrong? After a 1-year pandemic, I would say: no.” This makes it all the more important, he says, that models indicate their level of certainty or uncertainty, that they take a realistic stance about how unpredictable cases are, and about the future course. “Modelers need to communicate the uncertainty, but it shouldn’t be seen as a failure,” Bracher says.
Trusting some models more than others
As an oft-quoted statistical aphorism goes, “All models are wrong, but some are useful.” But as Bracher notes, “If you do the ensemble model approach, in a sense you are saying that all models are useful, that each model has something to contribute”—though some models may be more informative or reliable than others.
Observing this fluctuation prompted Reich and others to try “training” the ensemble model—that is, as Reich explains, “building algorithms that teach the ensemble to ‘trust’ some models more than others and learn which precise combination of models works in harmony together.” Bracher’s team now contributes a mini-ensemble, built from only the models that have performed consistently well in the past, amplifying the clearest signal.
“The big question is, can we improve?” Reich says. “The original method is so simple. It seems like there has to be a way of improving on just taking a simple average of all these models.” So far, however, it is proving harder than expected—small improvements seem feasible, but dramatic improvements may be close to impossible.
A complementary tool for improving our overall perspective on the pandemic beyond week-to-week glimpses is to look further out on the time horizon, four to six months, with those “scenario modeling.” Last December, motivated by the surge in cases and the imminent availability of the vaccine, Lessler and collaborators launched the COVID-19 Scenario Modeling Hub, in consultation with the CDC.