One of my favorite Italian words is “basta,†followed by an exclamation point. No, basta does not mean “bastardâ€; it means “enough,†as in “I’ve had ENOUGH of you!†We’ve definitely had enough of Big Data. As a term, Big Data has been an utter failure. It has never managed to mean anything in particular. A term that means nothing in particular means nothing at all. The term can legitimately claim two outcomes that some consider useful:
- It has sold a great many products and services. Those who have collected the revenues love the term.
- It has awakened some people to the power of data to inform decisions. The usefulness of this outcome, however, is tainted by the deceit that some data today is substantially different from data of the past. As a result, Big Data encourages organizations to waste time and money seeking an illusion.
If you’ve thought much about Big Data, you’ve noticed the confusion that plagues the term. What is Big Data? This question lacks a clear answer for the following reasons:
- There are almost as many definitions of Big Data as there are people with opinions.
- None of the many definitions that have been proposed describe anything about data and its use that is substantially different from the past.
- Most of the definitions are so vague or ambiguous that they cannot be used to determine, one way or the other, if a particular set of data or use of data qualifies as Big Data.
The term remains what it was when it first became popular—a marketing campaign, and as such, a source of fabricated need and endless confusion. Nevertheless, like spam, it refuses to go away. Why does this matter? Because chasing illusions is a waste of time and money that also carries a high cost of lost opportunity. It makes no sense to chase Big Data, whatever you think it is, if you continue to derive little or no value from the data that you already have.
Ill-defined terms that capture minds and hearts, as Big Data has, often exert influence in irresponsible and harmful ways. Big Data has been the basis for several audacious claims, such as, “Now that we have Big Data…â€
- “…we no longer need to be concerned with data qualityâ€
- “…we no longer need to understand the nature of causalityâ€
- “…science has become a thing of the pastâ€
- “…we can’t survive without itâ€
People who make such claims either don’t understand data and its use or they are trying to sell you something. Even more disturbing in some respects are the ways in which the seemingly innocuous term Big Data has been used to justify unethical practices, such as gleaning information from our private emails to support targeted ads—a practice that Google is only now abandoning.
Data has always been big and getting bigger. Data has always been potentially valuable for informing better evidence-based decisions. On the other hand, data has always been useless unless it can inform us about something that matters. Even potentially informative data remains useless until we manage to make sense of it. How we make sense of data involves skills and methods that have, with few exceptions, been around for a long time. Skilled data sensemakers have always made good use of data. Those who don’t understand data and its use mask their ignorance and ineffectiveness by introducing new terms every few years as a bit of clever sleight of hand.
The definitions of Big Data that I’ve encountered fall into a few categories. Big Data is…
- …data sets that are extremely large (i.e., an exclusive emphasis on volume)
- …data from various sources and of various types, some of which are relatively new (i.e., an exclusive emphasis on variety)
- …data that is both large in volume and derived from various sources (and is sometimes also produced and acquired at fast speeds, to complete the three Vs of volume, velocity, and variety)
- …data that is especially complex
- …data that provides insights and informs decisions
- …data that is processed using advanced analytical methods
- …any data at all that is associated with a current fad
Let’s consider the problems that are associated with the definitions in each of these categories.
Data Sets That Are Extremely Large
According to the statistical software company SAS:
Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis.
SAS.com
This definition fails in several respects, not the least of which is its limitation to business data. The fundamental problem with definitions such as this, which focus primarily on the size of data as the defining factor, is their failure to specify how large data must be to qualify as Big Data rather than just plain data. Large data sets have always existed. What threshold must be crossed to transition from data to Big Data? This definition doesn’t say.
Here’s a definition that attempts to identify the threshold:
Big Data is a phrase used to mean a massive volume of both structured and unstructured data that is so large it is difficult to process using traditional database and software techniques.
Vangie Beal, Webopedia.com
Do you recognize the problem of defining the threshold in this manner? What are “traditional database and software techniques� The following definition is slightly less vague:
Big data means data that cannot fit easily into a standard relational database.
Hal Varian, Chief Economist, Google
(Source Note: All of the definitions that I quote that are attributed to an individual, independent of a particular publication, appeared in an article written by Jennifer Dutcher of the U.C. Berkeley School of Information titled “What is Big Data?†on September 3, 2014.)
In theory, there are no limits to the amount of data that can be stored in a relational database. Databases of all types have practical limits. People have suggested technology-based volume thresholds of various types, including anything that cannot fit into an Excel spreadsheet. All of these definitions establish arbitrary limits. Some are based on arbitrary measures.
Big data is data that even when efficiently compressed still contains 5-10 times more information (measured in entropy or predictive power, per unit of time) than what you are used to right now.
Vincent Granville, Co-Founder, Data Science Central
So, if you are accustomed to 1,000 row Excel tables, a simple SQL Server database consisting of 5,000 to 10,000 rows qualifies as Big Data. Such definitions highlight the uselessness of arbitrary limits on data volume. Here’s another definition that acknowledges its arbitrary nature:
Big data is when…the standard, simple methods (maybe it’s SQL, maybe it’s k-means, maybe it’s a single server with a cron job) break down on the size of the data set, causing time, effort, creativity, and money to be spent crafting a solution to the problem that leverages the data without simply sampling or tossing out records.
 John Foreman, Chief Data Scientist, MailChimp
Some definitions acknowledge the arbitrariness of the threshold without recognizing it as a definitional failure:
The term big data is really only useful if it describes a quantity of data that’s so large that traditional approaches to data analysis are doomed to failure. That can mean that you’re doing complex analytics on data that’s too large to fit into memory or it can mean that you’re dealing with a data storage system that doesn’t offer the full functionality of a standard relational database. What’s essential is that your old way of doing things doesn’t apply anymore and can’t just be scaled out.
John Myles White
What good is a definition that is based on a subjective threshold in data volume?
The following definition acknowledges that, when based on data volume, what qualifies as Big Data not only varies from organization to organization, but over time as well:
Big data is data that contains enough observations to demand unusual handling because of its sheer size, though what is unusual changes over time and varies from one discipline to another. Scientific computing is accustomed to pushing the envelope, constantly developing techniques to address relentless growth in dataset size, but many other disciplines are now just discovering the value — and hence the challenges — of working with data at the unwieldy end of the scale.
Annette Greiner, Lecturer, UC Berkeley School of Information
Not only do these definitions identify Big Data in a manner that lacks objective boundaries, they also acknowledge (perhaps inadvertently) that so-called Big Data has always been with us, for data has always been on the increase in a manner that leads to processing challenges. In other words, Big Data is just data.
There is a special breed of volume-based definitions that advocate “Collect and store everything.†Here is the most thorough definition of this sort that I’ve encountered:
The rising accessibility of platforms for the storage and analysis of large amounts of data (and the falling price per TB of doing so) has made it possible for a wide variety of organizations to store nearly all data in their purview — every log line, customer interaction, and event — unaggregated and for a significant period of time. The associated ethos of “store everything now and ask questions later†to me more than anything else characterizes how the world of computational systems looks under the lens of modern “big data†systems.
Josh Schwartz, Chief Data Scientist, Chartbeat
These definitions change the nature of the threshold from a measure of volume to an assumption: you should collect everything at the lowest level of granularity, whether useful or not, for you never know when it might be useful. Definitions of this type are a hardware vendor’s dream, but they are an organization’s nightmare, for the cost of unlimited storage extends well beyond the cost of hardware. The time and resources that are required to do this are enormous and rarely justified. Anyone who actually works with data knows that the vast majority of the data that exists in the world is noise and will always be noise. Don’t line the pockets of hardware vendor executives with gold by buying into this ludicrous assumption.
Data from Various Sources and of Various Types
Independent of a data set’s size, some definitions of Big Data emphasis its variety. Here’s one of the clearest:
What’s “big†in big data isn’t necessarily the size of the databases, it’s the big number of data sources we have, as digital sensors and behavior trackers migrate across the world. As we triangulate information in more ways, we will discover hitherto unknown patterns in nature and society — and pattern-making is the wellspring of new art, science, and commerce.
Quentin Hardy, Deputy Tech Editor, The New York Times
Definitions that emphasize variety suffer from the same problems as those that emphasize volume: where is the threshold? How many data sources are needed to qualify data as Big Data? In what sense does the addition of new data sources—something that is hardly new—change the nature of data or its use? It doesn’t. New data sources are sometimes useful and sometimes not. Collecting and storing every possible source of data is no more productive than collecting and storing every instance of data.
Data That Exhibits High Volume, Velocity, and Variety
I’ll use Gartner’s definition to represent this category in honor of the fact that Doug Laney of Gartner was the first to identify the three Vs (volume, velocity, and variety) as game changers.
Big Data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.
Gartner
Combining volume and variety, plus adding velocity—the speed at which data is generated and acquired—produces definitions that suffer from all of the problems that I’ve already identified. Increases in volume, velocity, and variety have been with us always. They have not fundamentally changed the nature of data or its use.
Data That Is Especially Complex
Some definitions focus on the complexity of data.
While the use of the term is quite nebulous and is often co-opted for other purposes, I’ve understood “big data†to be about analysis for data that’s really messy or where you don’t know the right questions or queries to make — analysis that can help you find patterns, anomalies, or new structures amidst otherwise chaotic or complex data points.
Philip Ashlock, Chief Architect, Data.gov
You can probably anticipate what I’ll say about definitions of this sort: once again they lack of a clear threshold and identify a quality that has always been true of data. How complex is complex enough and at what point in history has data not exhibited complexity?
Data That Provides Insights and Informs Decisions
As you no doubt already anticipate, definitions in this category exhibit the same problems as those in the categories that we’ve already considered. Here’s an example:
Big Data has the potential to help companies improve operations and make faster, more intelligent decisions. This data…can help a company to gain useful insight to increase revenues, get or retain customers, and improve operations.
Vangie Beal, Webopedia.com
Data That Is Processed Using Advanced Analytical Methods
According to definitions in this category, it is nothing about the data itself that determines Big Data, but is instead the methods that are used to make sense of it.
The term “big data” often refers simply to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data.
Wikipedia
Some of these definitions allow quite a bit of leeway regarding the nature of the advanced methods, while others are more specific, such as the following:
Big data is an umbrella term that means…the possibility of doing extraordinary things using modern machine learning techniques on digital data. Whether it is predicting illness, the weather, the spread of infectious diseases, or what you will buy next, it offers a world of possibilities for improving people’s lives.
Shashi Upadhyay, CEO and Founder, Lattice Engines
What analytical methods qualify as Big Data? The answer usually depends on the methods that the person who is quoted uses or sells. Can you guess what kind of software Lattice Engines sells?
Advanced methods that are considered advanced have been around for a long time. Even most of the methods that are identified as advanced today when defining Big Data have been around for quite some time. For example, even though computers were not always powerful enough to run machine-learning algorithms on large data sets, these algorithms are fundamentally based on traditional statistical methods.
A few of the definitions in this category have emphasized advanced skills rather than technologies, such as the following:
As computational efficiency continues to increase, “big data†will be less about the actual size of a particular dataset and more about the specific expertise needed to process it. With that in mind, “big data†will ultimately describe any dataset large enough to necessitate high-level programming skill and statistically defensible methodologies in order to transform the data asset into something of value.
Reid Bryant, Data Scientist, Brooks Bell
Once again, however, there is nothing new about these skills.
Any Data at All that Is Associated With a Current Fad
Some definitions of Big Data apply the term to anything data related that is trending. Here’s an example:
I see big data as storytelling — whether it is through information graphics or other visual aids that explain it in a way that allows others to understand across sectors.
Mike Cavaretta, Data Scientist and Manager, Ford Motor Company
This tendency has been directly acknowledged by Ryan Swanstrom in his Data Science 101 blog: “Now big data has become a buzzword to mean anything related to data analytics or visualization.†This is what happens with fuzzy definitions. They can be easily manipulated to mean anything you wish. As such, they are meaningless and useless.
Now What?
The definitional messiness and thus uselessness of the term Big Data is far from unique. Many information technology terms exhibit these dysfunctional traits. I’ve worked in the field that goes by the name “business intelligence†for many years, but this industry has never adhered to or lived up to the definition provided by Howard Dresner, who coined the term: “Concepts and methods to improve business decision making by using fact-based support systems.†Instead, the term has primarily functioned as a name for technologies and processes that are used to collect, store, and produce automated reports of data. Rarely has there been an emphasis on “concepts and methods to improve business decision making,†which features humans rather than technologies. This failure of emphasis has resulted in the failure of most business intelligence efforts, which have produced relatively little intelligence.
All of the popular terms that have emerged during my career to describe the work that I and many others do with data, including decision support, data warehousing, analytics, data science, and of course, Big Data, have been plagued by definitional dysfunction, leading to confusion and bad practices. I prefer the term “data sensemaking†for the concepts, methods, and practices that we engage in to understand data. And to promote the value of data as the raw material from which understanding is woven, healthcare has suggested one of the most useful terms: “evidence-based medicine.†In its generic form, “evidence-based decision making†is simple, straightforward, and clear. If we used these terms to describe the work and its importance, we would stop wasting time chasing illusions and would focus on what’s fundamentally needed: data sensemaking skills, augmented by good technologies, to support evidence-based decision making. Perhaps then, we would make more progress.
Let’s say “goodbye†to the term Big Data. It doesn’t mean anything in particular and all of the many things that people have used it to mean merely refer to data. Do we really need a new term to promote the importance of evidence-based decision making? The only people who are prepared to glean real value from data don’t need a new term or a marketing campaign. Rallying those who don’t understand data or its use will only lead to good outcomes if we begin by helping them understand. Meaningless terms lead in the opposite direction.
Take care,
