These estimates have been generally consistent with conventional surveillance data collected by health agencies, both nationally and regionally.
A linear model is used to compute the log-odds of Influenza-like illness (ILI) physician visit and the log-odds of ILI-related search query: P is the percentage of ILI physician visit and Q is the ILI-related query fraction computed in previous steps.
This process produces a list of top queries which gives the most accurate predictions of CDC ILI data when using the linear model.
Using the sum of top 45 ILI-related queries, the linear model is fitted to the weekly ILI data between 2003 and 2007 so that the coefficient can be gained.
[10] They conceded that the use of user-generated data could support public health effort in significant ways, but expressed their worries that "user-specific investigations could be compelled, even over Google's objection, by court order or Presidential authority".
An initial motivation for GFT was that being able to identify disease activity early and respond quickly could reduce the impact of seasonal and pandemic influenza.
However, Google's data of search queries about flu symptoms was able to show that same spike two weeks prior to the CDC report being released.
“This seems like a really clever way of using data that is created unintentionally by the users of Google to see patterns in the world that would otherwise be invisible,” said Thomas W. Malone, a professor at the Sloan School of Management at MIT.
[7] In fall 2013, Google began attempting to compensate for increases in searches due to prominence of flu in the news, which was found to have previously skewed results.
[20] Similar projects such as the flu-prediction project[21] by the Institute of Cognitive Science at Universitat Osnabrück carry the basic idea forward, by combining social media data e.g. Twitter with CDC data, and structural models that infer the spatial and temporal spreading [22] of the disease.