Issue 76

Forecasting lessons

Delivered on 04 May 2020 by Justin Pyvis. About a 6 min read.

Eleven years ago programmers at Google thought they had solved a long-running problem: early detection of disease activity. As they wrote in a 2009 Nature journal article, such an innovation could save millions of lives:

In addition to seasonal influenza, a new strain of influenza virus against which no previous immunity exists and that demonstrates human-to-human transmission could result in a pandemic with millions of fatalities. Early detection of disease activity, when followed by a rapid response, can reduce the impact of both seasonal and pandemic influenza.

The ability to rapidly trace COVID-19 outbreaks may have helped many countries avoid costly lockdowns. Here's how Google tackled the problem:

Because the relative frequency of certain queries is highly correlated with the percentage of physician visits in which a patient presents with influenza-like symptoms, we can accurately estimate the current level of weekly influenza activity in each region of the United States, with a reporting lag of about one day. This approach may make it possible to use search queries to detect influenza epidemics in areas with a large population of web search users.

Google's method involved using a black box containing four years' worth of data with 50 million search queries feeding 450 million different models to test queries across a distributed computing framework with hundreds of machines. Sounds impressive, right? The problem was, it didn't work:

GFT failed—and failed spectacularly—missing at the peak of the 2013 flu season by 140 percent. When Google quietly euthanized the program, called Google Flu Trends (GFT), it turned the poster child of big data into the poster child of the foibles of big data.
For example, Google’s algorithm was quite vulnerable to overfitting to seasonal terms unrelated to the flu, like “high school basketball.” With millions of search terms being fit to the CDC’s data, there were bound to be searches that were strongly correlated by pure chance, and these terms were unlikely to be driven by actual flu cases or predictive of future trends. Google also did not take into account changes in search behavior over time. After the introduction of GFT, Google introduced its suggested search feature as well as a number of new health-based add-ons to help people more effectively find the information they need. While this is great for those using Google, it also makes some search terms more prevalent, throwing off GFT’s tracking.

The lesson? For forecasting, simplicity is often better:

Our team from Northeastern University, the University of Houston, and Harvard University compared the performance of GFT with very simple models based on the CDC’s data, finding that GFT had begun to perform worse. Moreover, we highlighted a persistent pattern of GFT performing well for two to three years and then failing significantly and requiring substantial revision.

All Big Data, machine learning and artificial intelligence (AI) systems suffer from the same flaws, in that they can only draw inferences from past events. That would be all well and good if we lived in a mostly static, unchanging world. But we don't, as a more recent Google AI medical experiment discovered:

When introducing new technologies, planners, policy makers, and technology designers did not account for the dynamic and emergent nature of issues arising in complex healthcare programs. The authors argue that attending to people—their motivations, values, professional identities, and the current norms and routines that shape their work—is vital when planning deployments.

We live in a world with people, not robots. Dynamic, independent people. Occasionally one of them might think it's a good idea to chow down on a bat sandwich.

It seems that at least in the private sector, forecasters are starting to acknowledge that fact:

Most retail companies rely on some type of model or algorithm to help predict what their customers will want, whether it be a simple Excel spreadsheet or a refined, engineer-built program. Normally, those models are fairly reliable and work well. But just like everything else, they’re affected by the pandemic.

“When you have something like COVID-19, it’s just a total outlier,” says Joel Beal, the co-founder of the consumer goods analytics company Alloy. “No model can predict that.”

Because of the massive, worldwide disruptions, the normal data feeding the models — which include buying patterns over years — aren’t as relevant.

“You’re probably going to not use as much historical data or will not be weighing that as much as you expected,” Beal says. Instead, companies are likely using much more recent data: looking to last week to predict next week, for example, or just relying on the few months of information on what was purchased since the pandemic took off worldwide.

For now, judgement has replaced black box models as the most important component in decision making. According to Beal:

“Companies have to rely more on good demand planners and forecasting people, who will say, ‘do I believe this?’ Rather than believing these models will be able to capture everything that’s going on.”

My personal view is that models are useful as a sensibility check. If you expect X to be Y at a certain date, a model can tell you whether or not that is reasonable by running a regression against historical data. Does adding judgement bias the forecast? Sure. But unlike models, that bias can be defended, and it has the benefit of building in the transparency and flexibility that many so-called sophisticated models sorely lack.

Enjoy the rest of this week's issue. Cheers,

— Justin

Other bits of interest

Speaking of models

This hasn't aged well. We're now at the start of May, meaning Sweden would need to record another 49,000 COVID-19 deaths in the next two months just to reach the lower bound of this study's 95% confidence interval. Sweden currently has about 2,600 COVID-19 deaths.

If only a bit of judgement had been used 🤔 Emphasis mine:

Our model for Sweden shows that, under conservative epidemiological parameter estimates, the current Swedish public-health strategy will result in a peak intensive-care load in May that exceeds pre-pandemic capacity by over 40-fold, with a median mortality of 96,000 (95% CI 52,000 to 183,000).

Note that Sweden's approach is not 'live and let live'; the government has generally encouraged voluntary social distancing, closed its secondary schools and universities, banned visits to nursing homes, required minimum distancing at restaurants, banned gatherings larger than 50, and people aged over 70 (i.e. those most at risk) have been instructed to self-isolate.

Trust us, we're from the government

The Australian government is doing everything it can to encourage people not to use its COVID-19 tracing app, to the point where it has even ignored its own software development guidelines:

The Digital Transformation Agency's (DTA) Digital Service Standard has 13 criteria "to help government agencies design and deliver services that are simple, clear and fast".

The eighth criteria is to "make all new source code open by default".

According to the agency, making source code open saves money, increases transparency, and adds benefits through improvements by other developers.

By the time an app goes live, the DTA said the developers should be able to show how they are making the source code open and reusable, provided guidance for open source contributors, and detailed how they are going to handle bug fixes and updates to the code.

This is the same government that was just hacked, "revealing the personal details of 774,000 migrants and people aspiring to migrate to Australia".

Trust us. Yeah. Sure.

Other approaches to contact tracing

It's important for governments to get their contact tracing apps right so that they incentivise cooperation and become widely adopted, e.g. the German or Italian decentralised approaches, not the unnecessarily risky/violating UK or Australian centralised approaches. Otherwise liability-conscious employers will fill the void:

The companies, which also include smaller US start-ups Locix and Microshare, want to give employers the confidence to reopen their facilities, to take action to control outbreaks, and to alert staff if they come into contact with infected colleagues.

While governments and tech companies are working on voluntary tools that send similar alerts, these may not be widely adopted. By contrast, PwC said companies could make its tool mandatory.

“You really need a majority of people to do this,” said Rob Mesirow, who leads PwC’s connected solutions practice. “US businesses are going to have to [tell employees]: If you’re going to come back to the work environment, you need this app on your phone.”

Issue 76: Forecasting lessons was compiled by Justin Pyvis and delivered on 04 May 2020.