In 2008, Google and its team of researchers believed they had found the perfect demonstration of big data’s potential. Despite having no expertise in epidemic research, they declared that they could predict flu trends two weeks faster and more accurately than the US Centers for Disease Control and Prevention. They then launched Google Flu Trends (GFT), a web service that predicted flu outbreaks in real-time by analyzing flu-related Google Search keywords gathered from users all over the US. The theory was that people in flu-affected areas search for flu-related content online, which signaled an outbreak.
Ultimately, GFT failed. In fact, it failed terribly, with individual predictions off by as much as 50%. Between 2011 and 2013, it was wrong 100 out of 108 weeks. In 2013, Google discontinued the service.
Failures like GFT are common. In fact, Gartner has found that 85% of all data science projects fail. The main reason seems to be “big data hubris”—the assumption that big data can substitute for (rather than supplement) traditional data collection and analysis.
Simply put, people are becoming overconfident about big data. We’re starting to believe that it is the answer to everything, that it can replace traditional analysis and human intervention.
There is a misconception that, with enough data, we can train machines to solve problems automatically. However, while it’s often categorized under artificial intelligence, data science is largely a human-centric process. In fact, today’s machine learning technology has three significant limitations.
1. Machines cannot define goals.
King Midas, a famous figure in Greek mythology, wished for everything he touched to turn into gold. In the end, this backfired tremendously, as he became unable even to touch the food he ate.
Computers behave similarly to the parameters of Midas’s curse: they simply perform the way humans tell them to. They cannot determine what should not be turned into gold. If a self-driving car is instructed to drive to a location as fast as possible, it will drive through lawns and houses and indiscriminately damage everything along its way that does not slow it down.
Setting proper goals requires a holistic understanding of the world and ethics, which only humans have. If goals are not properly defined by a human, no amount of data will result in anything useful.
2. Machines do not know what data should be collected.
In a 2016 TED Talk, Tricia Wang recounted her experience working for Nokia back in 2009. She had collected a sample of qualitative insights from consumers which suggested that the threat to Nokia’s business from Apple’s new iPhone was real. Nokia, however, quickly dismissed her worries, believing in the company’s big data, which suggested smartphones were nothing but a fad.
We all know how that turned out.
Nokia’s executives were limiting their scope to data from a pool of their own customers. No matter how much of this data they had, it never would have pointed them in the right direction. This is a great example of GIGO, garbage-in-garbage-out.
Human beings are irreplaceable when it comes to identifying what kind of data should be collected. The important data, which Wang coined as “thick data,” are often not readily available. It requires curiosity, creativity, empathy, and a well-defined goal to identify them, and then a humanistic process to collect them.
3. Machines cannot interpret data.
When GFT was still up and running, it was found that the search term “high school basketball” had a strong correlation with the flu. Were high school basketball players more susceptible to getting sick? As it turned out, the season for high school basketball in the US coincides with the winter flu season. Computers identified the relationship, but only humans were able to make sense out of it. Researchers at Google had to manually identify and remove terms like “high school basketball” which would contribute to incorrect predictions.
Data alone is not sufficient for making business decisions. The role of data is to back up the context and courses of action conceived by humans for a more complete picture. To quote Canon CEO, Fujio Mitarai, “Telling a story without numbers and telling numbers without a story are both meaningless.”
Thirty years ago, before the internet era, data collection was expensive. Companies went to great lengths to define goals, decide what kinds of stories they wanted to tell, and identify the data needed before actually going out to collect it. Even with all of today’s technological advancements, the human role in data analysis remains unchanged. If anything, it is more important than ever.