Answers to Quora questions.: Q: How accurate is Google Flu Trends?

Monday, February 4, 2013

Q: How accurate is Google Flu Trends?

Update March 14, 2013: In 2012–13, Google Flu Trends did not successfully track the target flu indexes in the U.S., France, or Japan. Here are my slides from a talk at the Children's Health Informatics Program (March 14, 2013).

Why this happened is a mystery. Google has said they will present their own view some time this fall. I think the divergence suggests that one needs to be careful about trusting these kinds of machine-generated estimators, even when they work well for three years in a row. It can be hard to predict when they will fall down. (And without an underlying index that is still measured, you might never know when it has stopped working.)

I did an interview with WBUR's CommonHealth blog in January and again in February, and spoke on the radio in January.

Summary: At this point [Feb. 4, 2013], it appears likely that Google Flu Trends has considerably overstated this year's flu activity in the U.S. But we won't be able to draw a firmer conclusion until after the flu season has ended. I don't know why the model broke down this year but am eager to learn, when and if Google comes to a similar conclusion. For now, I suspect this episode may provide a cautionary tale about the limits of inference from "big data" and the perils of overconfidence in a sophisticated and seemingly-omniscient statistical model.
I am not an expert on the flu and you should not make health decisions based on Quora. You should get vaccinated, wash hands often, cover a cough, stay home from work if sick, and follow the CDC's advice: Seasonal Influenza (Flu).

Here are the nationwide figures Google Flu Trends has estimated since its launch, and the underlying CDC index it tries to predict:

The CDC reported Friday (Feb. 1) that for the week of January 13, 2012, through January 19, 2013, 4.5% of doctor visits were by patients with a fever > 100 degrees F and a sore throat or cough (what's known as influenza-like illness). (This was a revision of the original figure of 4.3% that they published the previous Friday before.)

By contrast, on Jan. 20 Google had finalized its prediction for the same statistic: 10.6%. This 6.0-percentage-point difference is larger than has ever occurred before.

For that week, in eight out of the ten HHS regions, Google's predictor was more than double the CDC's figure. (For example, here in New England where I am, Google was saying 14.2% of doctor visits were for influenza-like illness, vs. a CDC figure of only 2.9%.)

Google's scientists quite properly point out that the CDC will adjust its data retroactively, as more data comes in for past weeks. Thus the "red line" on the above graph can change over time, whereas the "blue line" is locked down in all but its very last data point. However, the CDC's adjustments have been modest so far, and even for recent weeks where the Google-CDC divergence is very large, the total number of sites reporting to the CDC has now reached close to the most it ever hits. Further dramatic revision of those data points seems unlikely.

In my nonexpert view, at this point there is little chance that Google Flu Trends's estimates can somehow be vindicated. A similar episode happened in 2009, causing the previous tweak seen above. Some of their approach may need to change if they want these divergences, repaired only after the fact, to happen less often.

Why the model did not work this year is an exciting mystery, and I am eager to learn the answer, when and if Google reaches a similar conclusion that something went awry.

Did one or two of the 160 search queries dramatically increase in popularity? Or was the effect seen over all 160 queries used in the model?
Did Google's decision not to retrain the model since 2009 make the difference? If they had retrained the model knowing what they knew in mid-2012, would this season's estimates have been better or worse? Were there hints last summer that, in retrospect, suggest they were mistaken not to retrain the model, and if so can those hints improve the decision process for retraining in the future?
Can we evaluate the effect of different retraining policies (e.g. retrain every year, retrain every month vs. the current policy of retraining only on certain triggers), and how they balance various risks? What should the triggers be?
Is it possible to estimate flu intensity using queries chosen by a computer without human intervention for their retrospective accuracy? Would the predictions be better if humans intervened to make sure every query made sense as a flu predictor? Or would this simply introduce more problems?

The promise of real-time disease-activity estimates is a valuable idea that has the potential to save lives. Google is the most sophisticated company in the world at this kind of inference, and the fact that even they can apparently stumble suggests that this is really, really tricky. I hope Google improves their technique and continues to attempt it. I'm happy to help them in any way if they could use it, but I doubt they will need me.

===

My graph above may yield a different impression from Google's own graph of their performance, at http://www.google.org/flutrends/about/how.html . That graph looks like this:

Although there is nothing incorrect about this plot, there are a few things that Google could make clearer to reduce the possibility of confusion among readers who have not closely read the scientific papers:

The graph shows a "hindcast" from Google's "2009" Flu Trends predictor, the one launched in September 2009 just after the end of the plot. No data point on this graph was actually displayed to the public as a contemporary estimate. From the launch in November 2008 until the end of this plot, Google displayed predictions from the "2008 algorithm." The data from that algorithm is no longer available from Google's site; I had to trace it from a bitmap image that Google submitted in a scientific paper to PLoS ONE about the algorithm's difficulties in 2009 and subsequent improvements (Assessing Google Flu Trends Performance in the United States during the 2009 Influenza Virus A (H1N1) Pandemic).
The graph is almost entirely of training data; in other words the CDC data that was used to design the 2009 Google algorithm. It could avoid some confusion if Google were clearer about how much of this graph was actually predicted, versus information that the algorithm was given going in. That would look like this:
The plot ends at the launch of the current algorithm and hasn't been extended to the present. In other words, every single data point in my plot above is missing from Google's plot, and none of the data points on Google's plot are on my plot. In my view, the important thing is the prospective performance of the Google Flu Trends system as it actually estimated the flu. This is the exact opposite of what Google, with its own plot, implicitly expresses it thinks is the relevant figure of merit. (I'm not saying they really believe this; just what the graph says to me.)
Although the data shown is from the "2009" algorithm, the only reference Google gives is to their earlier scientific paper about the 2008 algorithm, published in "Nature" (http://research.google.com/archive/papers/detecting-influenza-epidemics.pdf). There is no mention of the later paper or the change in algorithm.
In the Nature paper, Google reported that the 2008 algorithm had a 97% mean correlation with the CDC data on a held-out verification set, which is a fantastic result (even higher than the 90% that Google had achieved on their own training set!). In the PLoS ONE paper, published three years later in a less prestigious venue, Google reported that the 2008 algorithm's actual correlation with the first wave of early-2009 "swine flu" was only 29%. I have heard cynics hyperbolize that the purpose of Nature is to publish fantastic-seeming results so that they can be debunked under subsequent scrutiny by less-prestigious journals. It's depressing to see a case where that was somewhat realized.
Although Google wrote in the PLoS ONE paper that "We will continue to perform annual updates of Flu Trends models to account for additional changes in behavior, should they occur," and a similar statement in the Nature supplement, in practice Google has not updated the algorithm since September 2009. As they write below, they determined that an update wasn't necessary. But they could make this more clear -- as well as discussing how they determine whether or not to update the model -- in their papers and Web site.

The fact that Google decided not to update the model for 2012-13, and subsequently the model performed poorly in 2012-13, suggests that the procedure for deciding when an update is necessary may need to be reworked. On the other hand, it's possible that even if Google had updated the model, the divergence would have been just as bad (or worse). The difference may mean different things for how Google Flu Trends can be improved in the future. These are questions I sincerely hope Google examines and answers in a future scientific paper or Web site update.