If Big Data Is Anything at All, This Is It
The first time that all but a few of us heard the term “Big Data,” we heard it in the context of a marketing campaign by information technology vendors to promote their products and services. It is this marketing campaign that has made the term popular, leading eventually to the household name that it is today. Despite its popularity, it remains a term seeking a definitive meaning. There are as many definitions of Big Data as there are individuals and organizations that would like to benefit from the belief that it exists. My objective in this brief blog article is to ask, “Does Big Data signify anything that is actually happening, and if so, what is it?”
Long before the term came into common usage around the year 2010, it began to pop up here and there in the late 1990s. It first appeared in the context of data visualization in 1997 at the IEEE 8th Conference on Visualization in a paper by Michael Cox and David Ellsworth titled “Application-controlled demand paging for out-of-core visualization.” The article begins as follows:
Visualization provides an interesting challenge for computer systems: data sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk. We call this the problem of big data. When data sets do not fit in main memory (in core), or when they do not fit even on local disk, the most common solution is to acquire more resources.
Two years later at the 1999 IEEE Conference on Visualization a panel convened titled “Automation or interaction: what’s best for big data?”
In February of 2001, Doug Laney, at the time an analyst with the Meta Group, now with Gartner, published a research note titled “3D Data Management: Controlling Data Volume, Velocity, and Variety.” The term Big Data did not appear in the note, but a decade later, the “3Vs” of volume, velocity, and variety became the most common attributes that are used to define Big Data.
The first time that I ran across the term personally was in a 2005 email from the software company Insightful, the maker of S+, a derivative of the statistical analysis language R, in the title of a course “Working with Big Data.”
By 2008 the term had become used enough in scientific circles to warrant a special issue of Nature magazine. It still didn’t begin to be used more broadly until February, 2010 when Kenneth Cukier wrote a special report for The Economist titled “Data, Data Everywhere” in which he said:
…the world contains an unimaginably vast amount of digital information which is getting ever vaster ever more rapidly… The effect is being felt everywhere, from business to science, from governments to the arts. Scientists and computer engineers have coined a new term for the phenomenon: “big data.”
It was around this time that the term was snatched from the world of academia to become the most successful information technology marketing campaign of the current decade. (I found most of the historical references to the term Big Data in the Forbes June 6, 2012 blog post by Gil Press titled “A Very Short History of Big Data.”)
Because Big Data has no commonly accepted definition, discussions about it are rarely meaningful or useful. Not once have I encountered a definition of Big Data that actually identifies anything that is new about data or its use. Doug Laney’s 3Vs, which describe exponential increases in data volume, velocity, and variety, have been happening since the advent of the computer many years ago. You might think that technological milestones such as the advent of the personal computer, Internet, or social networking have created exponential increases in data, but they have merely sustained exponential increases that were already happening. Had it not been for these technological advances, increases in data would have ceased to be exponential. Recently, definitions have emphasized the notion that Big Data is data that cannot be processed by conventional technologies. What constitutes conventional vs. unconventional technologies? My most recent encounter with this was the claim that Big Data is that which cannot be processed by a desktop computer. Based on this rather silly definition, Big Data has always existed, because personal computers have never been capable of processing many of the datasets that organizations collect.
So, if Big Data hasn’t been defined in an agreed-upon manner and if none of the existing definitions identify anything about data or its use that is actually new, does the term really describe anything? I’ve thought about this a great deal and I’ve concluded that it describes one thing only that has actually occurred in recent years:
Big Data is a rapid increase in public awareness that data is a valuable resource for discovering useful and sometimes potentially harmful knowledge.
Even if Big Data is this and nothing more, you might think that I’d be grateful for it. I make my living helping people understand and communicate information derived from data, so Big Data has produced a greater appreciation for my work. Here’s the rub: Big Data, as a term with no clear definition, which serves as a marketing campaign for technology vendors, encourages people to put their faith in technologies without first developing the skills that are needed to use those technologies. As a result, organizations waste their money and time chasing the latest so-called Big Data technologies—some useful, some not—to no effect because technologies can only augment the analytical abilities of humans; they cannot make up for our lack of skills or entirely replace our skills. Data is indeed a valuable resource, but only if we develop the skills to make sense of it and find within the vast and exponentially growing noise those relatively few signals that actually matter. Big Data doesn’t do this, people do—people who have taken the time to learn.
Take care,
6 Comments on “If Big Data Is Anything at All, This Is It”
Hi Steve,
I have watched with interest your (and others) blogs over the years regarding “Big Data”.
I reminds me very much of the arguments and discusions that went around when people tried to define the differences between data, information, knowledge and wisdom.
The main difference I see here is that the historical arguments relating to data, information, knowledge and wisdom seemed to be grounded in a solid academic and professional desire to have a base on which to develop workable definitions for the development of the fields in which they operate. In the case of Big Data, that seems to be missing – the formation of viable definitions.
It seems big data is the dragon in the woods. Invisible to all, but ever present to scare small children!
In my opinion, nothing has changed. We still have data, information etc, and as you say, its volume has grown – but hasn’t it always. I am reminded here of the work of Peter Checkalnand and Sue Holwell in “Information, systems and Information systems – making sense of the field” especially pages 86 onwards.
I wonder… are we covering old ground? Do we really need to worry? can’t we just ignore the “dragon” for what it is, and instead of listening to the scaremongers (marketeers) we listen instead to the storytellers (academics/professionals) whose goal is truth, not myth.
Philip.
Steve,
I think one important point that is missing from your summary is this: it’s not just that more people are becoming aware that data is a valuable resource; more small companies are beginning to store and analyze more (bigger) data.
While I am sure it is true that large corporations have had data at (relatively speaking) large scale to crunch for decades, companies like the one I work for are starting to cross a threshold, and are looking for new ways to scale applications to process, aggregate, store and retrieve it all, and are looking to do it in ways that are dynamic and flexible.
In other words, it may not be new to everyone, but it’s new to a wider market of smaller organizations.
Does that help with a definition of Big Data? Not at all :)
But I think it helps explain what some of the fuss is all about, in terms of the data community in general.
I think you are right and the propaganda around big data is wrong.
But there are new developments that matter just not what the propaganda claims.
The real gains have been from new tools designed to handle big data more easily. Web/cloud vendors like Amazon (with Redshift) and Google (with BigQuery) have developed new web-based tools that enable people to handle huge datasets easily and quickly. This enables people with the right skills to interact with big datasets quickly so they can pose far more questions in a given amount of time and with very small upfront investments in infrastructure. This makes a huge difference to how productive skilled analysts and data scientists can be even with old but large datasets. This is the real revolution not the 3’V’s of the propaganda.
jbriggs — It is no doubt true that this growing awareness is leading some organizations to take greater advantage of data, but it’s also true that it is leading many organizations to waste their time investing in technologies–some that are useful and some that are not–without first developing data sensemaking skills. I haven’t seen any clear signs that all the talk about Big Data has produced positive change overall.
steve black — Technological advances like the products that you mentioned have occurred all along. Some are helpful, some are not, but none of these are truly revolutionary.
“On two occasions I have been asked, ‘Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?’ … I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.”
—Charles Babbage, Passages from the Life of a Philosopher
Some people just want to believe in technology.
Thought this was a good video and i think Andrew has nailed it. He is stating much the same as yourself, there are big “Data” problems but no such thing as “Big Data”.
http://www.youtube.com/watch?v=Ow6047-3LXg