Data did not suddenly become big. While it is true that a few new sources of data have emerged in recent years and that we generate and collect data in increasing quantities, changes have been incremental—a matter of degree—not a qualitative departure from the past. Essentially, “big data” is a marketing campaign.
Like many terms that have been coined to promote new interest in data-based decision support (business intelligence, business analytics, business performance monitoring, etc.), big data is more hype than substance and it thrives on remaining ill defined. If you perform a quick Web search on the term, all of the top links other than the Wikipedia entry are to business intelligence (BI) vendors. Interest in big data today is a direct result of vendor marketing; it didn’t emerge naturally from the needs of users. Some of the claims about big data are little more than self-serving fantasies that are meant to inspire big revenues for companies that play in this space. Here’s an example from McKinsey Global Institute (MGI):
MGI studied big data in five domains—healthcare in the United States, the public sector in Europe, retail in the United States, and manufacturing and personal-location data globally. Big data can generate value in each. For example, a retailer using big data to the full could increase its operating margin by more than 60 percent. Harnessing big data in the public sector has enormous potential, too. If US healthcare were to use big data creatively and effectively to drive efficiency and quality, the sector could create more than $300 billion in value every year. Two-thirds of that would be in the form of reducing US healthcare expenditure by about 8 percent. In the developed economies of Europe, government administrators could save more than €100 billion ($149 billion) in operational efficiency improvements alone by using big data, not including using big data to reduce fraud and errors and boost the collection of tax revenues. And users of services enabled by personal-location data could capture $600 billion in consumer surplus.
If you’re willing to put your trust in claims such as a 60% increase in operating margin, a $300 billion annual increase in value, an 8% reduction in expenditures, and a $600 billion consumer surplus, don’t embarrass yourself by trying to quantify these benefits after spending millions of dollars on big data technologies. Using data more effectively can indeed lead to great benefits, including those that are measured in monetary terms, but these benefits can’t be predicted in the manner, to the degree, or with the precision that McKinsey suggests.
When I ask representatives of BI vendors what they mean by big data, two characteristics dominate their definitions:
- New data sources: These consist primarily of unstructured data sources, such as text-based information related to social media, and new sources of transactional data, such as from sensors.
- Increased data volume: Data, data everywhere, in massive quantities.
Collecting data from new sources rarely introduces data of a new nature; it just adds more of the same. For example, even if new types of sensors measure something that we’ve never measured before, a measurement is a measurement—it isn’t a new type of data that requires special handling. What about all of those new sources of unstructured data, such as that generated by social media (Twitter and its cohorts)? Don’t these unstructured sources require new means of data sensemaking? They may require new means of data collection, but rarely new means of data exploration and analysis.
Do new sources of data require new means of visualization? If so, it isn’t obvious. Consider unstructured social networking data. This information must be structured before it can be visualized, and once it’s structured, we can visualize it in familiar ways. Want to know what people are talking about on Twitter? To answer this question, you search for particular words and phrases that you’ve tied to particular topics and you count their occurrences. Once it’s structured in this way, you can visualize it simply, such as by using a bar graph with a bar for each topic sized by the number of occurrences in ranked order from high to low. If you want to know who’s talking to whom in an email system or what’s linked to what on your Web site, you glean those interactions from your email or Web server and count them. Because these interactions are structured as a network of connections (i.e., not a linear or hierarchical arrangement), you can visualize them as a network diagram: an arrangement of nodes and links. Nodes can be sized to indicate popular people or content and links (i.e., lines that connect the nodes) can vary in thickness to show the volume of interactions between particular pairs of nodes. Never used nodes and links to visualize, explore, and make sense of a network of relationships? This might be new to you, but it’s been around for many years and information visualization researchers have studied the hell out of it.
What about exponentially increasing data volumes? Does this have an effect on data visualization? Not significantly. In my 30 years of experience using technology to squeeze meaning and usefulness from data, data volumes have always been big. When wasn’t there more data than we could handle? Although it is true that the volume of data continues to grow at an increasing rate, did it cross some threshold in the last few years that has made it qualitatively different from before? I don’t think so. The ability of technology to adequately store and access data has always remained just a little behind what we’d like to have in capacity and performance. A little more and a little faster have always been on our wish list. While information technology has struggled to catch up, mostly by pumping itself up with steroids, it has lost sight of the objective: to better understand the world—at least one’s little part of it (e.g., one’s business)—so we can make it better. Our current fascination with big data has us looking for better steroids to increase our brawn rather than better skills to develop our brains. In the world of analytics, brawn will only get us so far; it is better thinking that will open the door to greater insight.
Big data is built on the unquestioned premise that more is better. More of the right data can be useful, but more for the sake of more does nothing but complicate our lives. In the words of the 21st Century Information Fluency Project, we live in a time of “infowhelm.” Just because we can generate and collect more and more data doesn’t mean that we should. We certainly shouldn’t until we figure out how to make sense and use of the data we already have. This seems obvious, but almost no attention is being given to building the skills and technologies that help us use data more effectively. As Richards J. Heuer, Jr. argued in the Psychology of Intelligence Analysis (1999), the primary failures of analysis are less due to insufficient data than to flawed thinking. To succeed analytically, we must invest a great deal more of our resources in training people to think effectively and we must equip them with tools that augment cognition. Heuer spent 45 years supporting the work of the CIA. Identifying a potential terrorist plot requires that analysts sift through a lot of data (yes, big data), but more importantly, it relies on their ability to connect the dots. Contrary to Heuer’s emphasis on thinking skills, big data is merely about more, more, more, which will bury most of the organizations that embrace it deeper in shit.
Is there anything new about data today, big or otherwise, that should be leading us to visualize data differently? I was asked to think about this recently when advising a software vendor that’s trying to develop powerful visualization solutions specifically for managing big data. After wracking my brain, I came up with little. Almost everything that we should be doing to support the visual exploration, analysis, and presentation of data today involves better implementations of visualizations, statistical calculations, and data interactions that we’ve known about for years. Even though these features are old news, they still aren’t readily available in most commercial software today; certainly not in ways that work well. Rather than “going to where no one has gone before,” vendors need to do the less glorious work of supporting the basics well and data analysts need to further develop their data sensemaking skills. This effort may not lend itself to an awe-inspiring marketing campaign, but it will produce satisfied customers and revenues will follow.
I’m sure that new sources of data and increasing volumes might require a few new approaches to data visualization, though I suspect that most are minor tweaks rather than significant departures from current approaches. If you can think of any big data problems that visualization should address in new ways, please share them with us. Let’s see if we can identify a few efforts that vendors should support to truly make data more useful.