February 26, 2013
This weekend, I read an interesting article in The New York Times’ Bits section: “Data Without Context Tells a Misleading Story.”
The article covered Google’s release of new data based on an algorithm it created to figure out how many Americans had the flu from related Google searches and those searchers’ locations. In addition, the company leveraged that algorithm to create Google Flu Trends, a tool that compiled a number of “good indicators” to estimate “flu activity.”
The problem: Nature, a science journal, found inaccuracies with Google’s data, citing that it was double the number released by the Centers for Disease Control and Prevention. But it appears that while Google’s data was about the correlation between the number of people who think they have the flu and who search for information on it and the number of people who actually have the flu, Nature thought Google was predicting exactly how many people had the flu.
Now, there’s some backlash against Google for releasing its data and its flu algorithm. The company obviously has tons of great data sets, but Google’s mistake was releasing that data without context. Basically, Google said: Here’s our data, draw your own conclusion.
This is a perfect example of why data needs context, and that correlation between two data points doesn’t necessarily imply anything more. Google isn’t alone. One of the most common data errors is assuming causality is implied by correlation. Two other examples:
1) Mac users spend more on vacations. Last summer, Orbitz had to handle a bit of a press blitz when The Wall Street Journal reported that the website was showing higher-priced hotels to Mac users than PC users once its data revealed Mac users spent more on vacations overall.
Great correlation. That’s exactly what the data showed, but there was no context. Without context, Orbitz didn’t take into account that the socioeconomic status of a Mac user is probably higher than that of a cheap Netbook user. So it wasn’t really a case of cause and effect. It wasn’t really new information. Rather, it boiled down to people with higher incomes spend more money. In fact, uncovering that Mac users spend more money is just noise. It’s a second effect of a fact that’s already been established.
2) Nightlights make children nearsighted. Data woes aren’t limited to Google and travel suppliers. A University of Pennsylvania study released in 1999 found that leaving nightlights on increased the likelihood a child would be nearsighted. This research was fact checked and peer reviewed, then it was published and roundly accepted as true.
The problem? The researchers made a mistake when it came to correlation and causation. What really happened? Nearsightedness is a hereditary trait. So if a child is nearsighted, there’s a high probability that their parents are nearsighted. And if the parents are nearsighted, there’s a good chance they turn the night light on in order to see their children better. In fact, nightlights have absolutely nothing to do with whether a child is nearsighted.
In both of these examples, the data told a story, but it wasn’t refined or put into context. And data without context is meaningless.
All of this reminds me of Ann Winblad, a venture capitalist, who is famously quoted as saying: “Data is the new oil.” Catchy, right? Data is very powerful and there’s a lot companies can learn from from their raw analytics. But when you think about the data as oil comparison, it’s actually a faulty analogy.
Can you use oil? Not unless you’re an oil company. Oil has to be refined. A meaningful percentage of newly drilled oil is sludge and impurities, and you have to remove those to have anything useful. Even when you refine oil, you have to have a goal in order to do it the right way to get the right end product: Are you making kerosene? Are you making gasoline? Are you refining the oil into plastic so you can make polyester for a cheap suit?
The same is true for data. You have to refine data with a goal in mind. Otherwise, you will just have a bunch of interesting tidbits that won’t change your business. If you’re not asking a “Wow!” question about your data, you’ll never get a “Wow!” answer.
Focus on context. Begin with the end in mind before you jump into your data. Start with a premise or hypothesis. Ask yourself: Should we care about the answer to this question? If we get this answer, can we do something with it? Should we do something with it?
Once you have a hypothesis, and once you have some data to confirm or deny that hypothesis, you can start testing your theories. It’s the only way to prove that the effect you’re observing is real and worth pursuing. It’s the only way to make sure your data will have context.
Context is exactly what hurt Google. Google wasn’t wrong, but it wasn’t trying hard enough to be right. Google shared information without context. Their data was unrefined oil. And you put unrefined oil into the gas tank at your own peril.
Data Brain image courtesy of ShutterStock.