The first time that all but a few of us heard the term “Big Data,” we heard it in the context of a marketing campaign by information technology vendors to promote their products and services. It is this marketing campaign that has made the term popular, leading eventually to the household name that it is today. Despite its popularity, it remains a term seeking a definitive meaning. There are as many definitions of Big Data as there are individuals and organizations that would like to benefit from the belief that it exists. My objective in this brief blog article is to ask, “Does Big Data signify anything that is actually happening, and if so, what is it?”
Long before the term came into common usage around the year 2010, it began to pop up here and there in the late 1990s. It first appeared in the context of data visualization in 1997 at the IEEE 8th Conference on Visualization in a paper by Michael Cox and David Ellsworth titled “Application-controlled demand paging for out-of-core visualization.” The article begins as follows:
Visualization provides an interesting challenge for computer systems: data sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk. We call this the problem of big data. When data sets do not fit in main memory (in core), or when they do not fit even on local disk, the most common solution is to acquire more resources.
Two years later at the 1999 IEEE Conference on Visualization a panel convened titled “Automation or interaction: what’s best for big data?”
In February of 2001, Doug Laney, at the time an analyst with the Meta Group, now with Gartner, published a research note titled “3D Data Management: Controlling Data Volume, Velocity, and Variety.” The term Big Data did not appear in the note, but a decade later, the “3Vs” of volume, velocity, and variety became the most common attributes that are used to define Big Data.
The first time that I ran across the term personally was in a 2005 email from the software company Insightful, the maker of S+, a derivative of the statistical analysis language R, in the title of a course “Working with Big Data.”
By 2008 the term had become used enough in scientific circles to warrant a special issue of Nature magazine. It still didn’t begin to be used more broadly until February, 2010 when Kenneth Cukier wrote a special report for The Economist titled “Data, Data Everywhere” in which he said:
…the world contains an unimaginably vast amount of digital information which is getting ever vaster ever more rapidly… The effect is being felt everywhere, from business to science, from governments to the arts. Scientists and computer engineers have coined a new term for the phenomenon: “big data.”
It was around this time that the term was snatched from the world of academia to become the most successful information technology marketing campaign of the current decade. (I found most of the historical references to the term Big Data in the Forbes June 6, 2012 blog post by Gil Press titled “A Very Short History of Big Data.”)
Because Big Data has no commonly accepted definition, discussions about it are rarely meaningful or useful. Not once have I encountered a definition of Big Data that actually identifies anything that is new about data or its use. Doug Laney’s 3Vs, which describe exponential increases in data volume, velocity, and variety, have been happening since the advent of the computer many years ago. You might think that technological milestones such as the advent of the personal computer, Internet, or social networking have created exponential increases in data, but they have merely sustained exponential increases that were already happening. Had it not been for these technological advances, increases in data would have ceased to be exponential. Recently, definitions have emphasized the notion that Big Data is data that cannot be processed by conventional technologies. What constitutes conventional vs. unconventional technologies? My most recent encounter with this was the claim that Big Data is that which cannot be processed by a desktop computer. Based on this rather silly definition, Big Data has always existed, because personal computers have never been capable of processing many of the datasets that organizations collect.
So, if Big Data hasn’t been defined in an agreed-upon manner and if none of the existing definitions identify anything about data or its use that is actually new, does the term really describe anything? I’ve thought about this a great deal and I’ve concluded that it describes one thing only that has actually occurred in recent years:
Big Data is a rapid increase in public awareness that data is a valuable resource for discovering useful and sometimes potentially harmful knowledge.
Even if Big Data is this and nothing more, you might think that I’d be grateful for it. I make my living helping people understand and communicate information derived from data, so Big Data has produced a greater appreciation for my work. Here’s the rub: Big Data, as a term with no clear definition, which serves as a marketing campaign for technology vendors, encourages people to put their faith in technologies without first developing the skills that are needed to use those technologies. As a result, organizations waste their money and time chasing the latest so-called Big Data technologies—some useful, some not—to no effect because technologies can only augment the analytical abilities of humans; they cannot make up for our lack of skills or entirely replace our skills. Data is indeed a valuable resource, but only if we develop the skills to make sense of it and find within the vast and exponentially growing noise those relatively few signals that actually matter. Big Data doesn’t do this, people do—people who have taken the time to learn.