A More Thoughtful but No More Convincing View of Big Data

I have a problem with Big Data. As someone who makes his living working with data and helping others do the same as effectively as possible, my objection doesn’t stem from a problem with data itself, but instead from the misleading claims that people often make about data when they refer to it as Big Data. I have frequently described Big Data as nothing more than a marketing campaign cooked up by companies that sell information technologies either directly (software and hardware vendors) or indirectly (analyst groups such as Gartner and Forrester). Not everyone who promotes Big Data falls into this specific camp, however. For example, several academics and journalists also write articles and books and give keynotes at conferences about Big Data. Perhaps people who aren’t directly motivated by increased sales revenues talk about Big Data in ways that are more meaningful? To examine this possibility, I recently read the best selling book on the topic, Big Data: A Revolution That Will Transform How We Live, Work, and Think, by Viktor Mayer-Schönberger and Kenneth Cukier. Mayer-Schönberger is an Oxford professor of Internet governance and regulation and Cukier is the data editor of the Economist.


Big Data: A Revolution That Will Transform How We Live, Work, and Think
Viktor Mayer-Schönberger and Kenneth Cukier
Houghton Mifflin Harcourt, 2013

I figured that if anyone had something useful to say about Big Data, these were the guys. What I found in their book, however, left me convinced even more than before that Big Data is a ruse, and one that should concern us.

What Is Big Data?

One of the problems with Big Data, like so many examples of techno-hype, is that it is ill-defined. What is Big Data exactly? The authors address this concern early in the book:

There is no rigorous definition of big data. Initially the idea was that the volume of information had grown so large that the quantity being examined no longer fit into the memory that computers use for processing, so engineers needed to revamp the tools they used for analyzing it all. (p. 6)

Given this state of affairs, I was hoping that the authors would propose a definition of their own to reduce some of the confusion. Unfortunately, they never actually define the term, but they do describe it in various ways. Here’s one of the descriptions:

The sciences like astronomy and genomics, which first experienced the explosion in the 2000s, coined the term “big data.” The concept is now migrating to all areas of human endeavor. (p. 6)

Actually, the term “Big Data” was coined back in 1997 in the proceedings of IEEE’s Visualization conference by Michael Cox and David Ellsworth in a paper titled “Application-controlled demand paging for out-of-core visualization.” Scientific data in the early 2000s was not our first encounter with huge datasets. For instance, I was helping the telecommunications and banking industries handle what they experienced as explosions in data quantity back in the early 1980s. What promoters of Big Data fail to realize is that data has been increasing at an exponential rate since the advent of the computer long ago. We have not experienced any actual explosions in the quantity of data in recent years. The exponential rate of increase has continued unabated all along.

How else do the authors define the term?

[Big Data is]…the ability of society to harness information in novel ways to produce useful insights or goods and services of significant value. (p. 2)

On the contrary, we’ve been finding “novel ways to produce” value from data forever, not just in the last few years. We haven’t crossed any threshold recently.

How else do they define it?

At its core, big data is about predictions. Though it is described as part of the branch of computer science called artificial intelligence, and more specifically, an area called machine learning, this characterization is misleading. Big data is not about trying to “teach” a computer to “think” like humans. Instead, it’s about applying math to huge quantities of data in order to infer probabilities. (pp. 11 and 12)

So, Big Data is essentially about “predictive analytics.” Did we only in recent years begin applying math to huge quantities of data in an attempt to infer probabilities? We neither began this activity recently, nor did data suddenly become huge.

Is Big Data a technology?

But where most people have considered big data as a technological matter, focusing on the hardware or the software, we believe the emphasis needs to shift to what happens when the data speaks. (p. 190)

I wholeheartedly agree. Where I and the authors appear to differ, however, is in our understanding of the methods that are used to find and understand the messages in data. The ways that we do this are not new. The skills that we need to make sense of data—skills that go by the names statistics, business intelligence, analytics, and data science—have been around for a long time. Technologies incrementally improve to help us apply these skills with greater ease to increasingly larger datasets, but the skills themselves have changed relatively little, even though we come up with new names for the folks who do this work every few years.

So what it is exactly that separates Big Data from data of the past?

One way to think about the issue today—and the way we do in this book—is this: big data refers to things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms of value, in ways that change markets, organizations, the relationship between citizens and governments, and more.

But this is just the start. The era of big data challenges the way we live and interact with the world. Most strikingly, society will need to shed some of its obsession for causality in exchange for simpler correlations: not knowing why but only what. This overturns centuries of established practices and challenges our most basic understanding of how to make decisions and comprehend reality. (pp. 6 and 7)

To “make decisions and comprehend reality” we no longer need to understand why things happen together (i.e., causation) but only what things happened together (correlation). When I read this an eerie feeling crawled up my spine. The implications of this line of thinking are scary. Should we really race into the future satisfied with a level of understanding that is relatively superficial?

According to the authors, Big Data consists of “things one can do at a large scale that cannot be done at a smaller one.” And what are these things and how exactly do they change how “we live and interact with the world?” Let’s see if the authors tell us.

Does Big Data Represent a Change of State?

One claim that the authors make, which is shared by other promoters of Big Data, is that data has grown so large and so fast that the increase in quantity constitutes a qualitative change of state.

Not only is the world awash with more information than ever before, but that information is growing faster. The change of scale has led to a change of state. The quantitative change has led to a qualitative one. (p. 6)

The essential point about big data is that change of scale leads to change of state. (p. 151)

All proponents make this claim about Big Data, but I’ve yet to see anyone substantiate it. Perhaps it is true that things can grow to such a size and at such a rate that they break through some quantitative barrier into the realm of qualitative change, but what evidence do we have that this has happened with data? This book contains many examples of data analytics that have been useful during the past 20 years or so, which the authors classify as Big Data, but I believe this attribution is contrived. Not one of the examples demonstrates a radical departure from the past. Here’s one involving Google:

Big data operates at a scale that transcends our ordinary understanding. For example, the correlation Google identified between a handful of search terms and the flu was the result of testing 450 million mathematical models. (p. 179)

I suspect that Google’s discovery of a correlation between search activity and incidents of the flu in particular areas resulted, not from 450 million distinct mathematical models, but rather a predictive analytics algorithm making millions of minor adjustments during the process of building a single. If they really created 450 million different models, or even if they actually made that many tweaks to an evolving model to find this relatively simple correlation, is this really an example of progress. A little statistical thinking by a human being could have found this correlation with the help of a computer much more directly. Regardless of how many models were actually used, the final model was not overly complicated. What was done here does not transcend the ordinary understanding of data analysts.

And now, for the paradigm-shattering implications of this change of state:

Big data is poised to reshape the way we live, work, and think. The change we face is in some ways even greater than those sparked by earlier epochal innovations that dramatically expanded the scope and scale of information in society. The ground beneath our feet is shifting. Old certainties are being questioned. Big data requires fresh discussion of the nature of decision-making, destiny, justice. A worldview we thought was made of causes is being challenged by a preponderance of correlations. The possession of knowledge, which once meant any understanding of the past, is coming to mean an ability to predict the future. (p. 190)

Does any of this strike you as particularly new? Everything that the authors claim as particular and new to Big Data is in fact old news. If you’re wondering what they mean by the “worldview we thought was made of causes is being challenged by a preponderance of correlations,” stay tuned; we’ll look into this soon.

Who Is the Star of Big Data?

Who or what in particular is responsible for the capabilities and potential benefits of big data? Are technologies responsible? Are data scientists responsible? Here’s the authors’ answer:

The real revolution is not in the machines that calculate data but in data itself and how we use it. (p. 7)

What we can do with data today was not primarily enabled by machines, but it is also not intrinsic to the data itself. Nothing about the nature of data has changed. Data is always noise until it provides an answer to a question that is asked to solve a problem or take advantage of an opportunity.

In the age of big data, all data will be regarded as valuable, in and of itself. (p. 100)

With big data, the value of data is changing. In the digital age, data shed its role of supporting transactions and often became the good itself that was traded. In a big-data world, things change again. Data’s value shifts from its primary use to it potential future uses. That has profound consequences. If affects how businesses value the data they hold and who they let access it. It enables, and may force, companies to change their business models. It alters how organizations think about data and how they use it. (p. 99)

God help us. This notion should concern us because most data will always remain noise beyond its initial use. We certainly find new uses for data that was originally generated for another purpose, such as transaction data that we later use for analytical purposes to improve decisions, but in the past we rarely collected data primarily for potential secondary uses. Perhaps this is a characteristic that actually qualifies as new. Regardless, we must ask the question, “Is this a viable business model?” Should all organizations begin collecting and retaining more data in hope of finding unforeseen secondary uses for it in the future? I find it hard to imagine that secondary uses of data will provide enough benefit to warrant collecting everything and keeping it forever, as the authors seem to believe. Despite their argument that this is a no-brainer based on decreasing hardware costs, the price is actually quite high. The price is not based on the cost of hardware alone.

Discarding data may have been appropriate when the cost and complexity of collecting, storing, and analyzing it were high, but this is no longer the case. (p. 60)

Every single dataset is likely to have some intrinsic, hidden, not yet unearthed value, and the race is on to discover and capture all of it. (p. 15)

Data’s true value is like an iceberg floating in the ocean. Only a tiny part of it is visible at first sight, while much of it is hidden beneath the surface. (p. 103)

Imagine the time that will be wasted and how much it will cost. Only a tiny fraction of data that is being generated today will ever be valuable beyond its original use. A few nuggets of gold might exist in that iceberg below the water line, but do we really need to collect and save it all? Even the authors exhibit concern for the persistent value of data.

Most data loses some of it utility over time. In such circumstances, continuing to rely on old data doesn’t just fail to add value; it actually destroys the value of fresher data. (p. 110)

Collecting, storing, and retaining everything will make it harder and harder to focus on the little that actually has value. Nevertheless, the authors believe that the prize will go to those with the most.

Scale still matters, but it has shifted [from technical infrastructure]. What counts is scale in data. This means holding large pools of data and being able to capture even more of it with ease. Thus large data holders will flourish as they gather and store more of the raw material of their business, which they can reuse to create additional value. (p. 146)

If this were true, wouldn’t the organizations with the most data today be the most successful? This isn’t the case. In fact, many organizations with the most data are drowning in it. I know, because I’ve tried to help them change this dynamic. Having lots of data is useless unless you know how to make sense of it and how to apply what you learn.

Attempts to measure the potential value of data used for secondary purposes have so far been little more than wild guesses. Consider the way that Facebook was valued prior to going public.

Doug Laney, vice president of research at Gartner, a market research firm, crunched the numbers during the period before the initial public offering (IPO) and reckoned that Facebook had collected 2.1 trillion pieces of “monetizable content” between 2009 and 2011, such as “likes,” posted material, and comments. Compared against its IPO valuation, this means that each item, considered as a discrete data point, had a value of about five cents. Another way of looking at it is that every Facebook user was worth around $100, since users are the source of the information that Facebook collects.

How to explain the vast divergence between Facebook’s worth under accounting standards ($6.3 billion) and what the market initially valued it at ($104 billion)? There is no good way to do so. (p. 119)

Indeed, we are peering at tea leaves, trying to find meaning in drippy green clumps. Unforeseen uses for data certainly exist, but they are by definition difficult to anticipate. Do we really want to collect, store, and retain everything possible on the off chance that it might be useful? Perhaps, instead, we should look for ways to identify data with the greatest potential for future use and focus on collecting that? The primary problem that we still have with data is not the lack of it but our inability to make sense of it.

Does Correlation Alone Suffice With No Concern for Causation?

The authors introduce one of their strangest claims in the following sentence:

The ideal of identifying causal mechanisms is a self-congratulatory illusion; big data overturns this. (p. 18)

Are they arguing that Big Data eliminates our quest for an understanding of cause altogether?

Causality won’t be discarded, but it is being knocked off its pedestal as the primary fountain of meaning. Big data turbocharges non-causal analyses, often replacing causal investigations. (p. 68)

Apparently science can now take a back seat to Big Data.

Correlations exist; we can show them mathematically. We can’t easily do the same for causal links. So we would do well to hold off from trying to explain the reason behind the correlations: the why instead of the what. (p. 67)

This notion scares me. Correlations, although useful in and of themselves, must be used with caution until we understand the causal mechanisms related to them.

We progress by seeking and finding ever better explanations for reality—what is, how it works, and why. Explanations—the commitment to finding them and the process of developing and confirming them—is the essence of science. By rebelling against authority as the basis of knowledge, the Enlightenment began the only sustained period of progress that our species has ever known (see The Beginning of Infinity, by David Deutsch). Trust in established authority was replaced by a search for testable explanations, called science. The information technologies of today are a result of this search for explanations. To say that we should begin to rely on correlations alone without concern for causation encourages a return from the age of science to the age of ignorance that preceded it. Prior to science, we lived in a world of myth. Even then, however, we craved explanations, but we lacked the means to uncover them, so we fabricated explanations that provided comfort or that kept those in power in control. To say that explanations are altogether unnecessary today in the world of Big Data is a departure from the past that holds no hope for the future. Making use of correlations without understanding causation might indeed by useful at times, but it isn’t progress, and it is prone to error. Manipulation of reality without understanding is a formula for disaster.

The authors take this notion even further.

Correlations are powerful not only because they offer insights, but also because the insights they offer are relatively clear. These insights often get obscured when we bring causality back into the picture. (p. 66)

Actually, thinking in terms of causality is the only way that correlations can be fully understood and utilized with confidence. Only when we understand the why (cause) can we intelligently leverage our understanding of the what (correlation). This is essential to science. As such, we dare not diminish its value.

Big data does not tell us anything about causality. (p. 163)

Huh? Without data we cannot gain an understanding of cause. Does Big Data lack information about cause that other data contains? No. Data contains this information no matter what its size.

According to the authors, Big Data leverages correlations alone in an enlightening way that wasn’t possible in the past.

Correlations are useful in a small-data world, but in the context of big data they really shine. Through them we can glean insights more easily, faster, and more clearly than before. (p. 52)

Big data is all about seeing and understanding the relations within and among pieces of information that, until very recently, we struggled to fully grasp. (p. 19)

What exactly is it about Big Data that enables us to see and understand relationships among data that were elusive in the past? What evidence is there that this is happening? Nowhere in the book do the authors answer these questions in a satisfying way. We have always taken advantage of known correlations, even when we have not yet discovered what causes them, but this has never deterred us in our quest to understand causation. God help us if it ever does.

Does Big Data Transform Messiness into a Benefit?

Not only do we no longer need to concern ourselves with causation, according to the authors we can also stop worrying about data quality.

In a world of small data, reducing errors and ensuring high quality of data was a natural and essential impulse. Since we only collected a little information, we made sure that the figures we bothered to record were as accurate as possible…Analyzing only a limited number of data points means errors may get amplified, potentially reducing the accuracy of the overall results…However, in many new situations that are cropping up today, allowing for imprecision—for messiness—may be a positive feature, not a shortcoming. It is a tradeoff. In return for relaxing the standards of allowable errors, one can get ahold of much more data. It isn’t just that “more trumps some,” but that, in fact, sometimes “more trumps better. (pp. 32 and 33)

Not only can we stop worrying about messiness in data, we can embrace it as beneficial.

In dealing with ever more comprehensive datasets, which capture not just a small sliver of the phenomenon at hand but much more or all of it, we no longer need to worry so much about individual data points biasing the overall analysis. Rather than aiming to stamp out every bit of inexactitude at increasingly high cost, we are calculating with messiness in mind…Though it may seem counterintuitive at first, treating data as something imperfect and imprecise lets us make superior forecasts, and thus understand our world better. (pp. 40 and 41)

Hold on. Something is amiss in the authors’ reasoning here. While it is true that a particular amount of error in a set of data becomes less of a problem if that quantity holds steady as the total amount of data increases and becomes huge, a particular rate of error remains just as much of a problem as the data set grows in size. More does not trump better data.

The way we think about using the totality of information compared with smaller slivers of it, and the way we may come to appreciate slackness instead of exactness, will have profound effects on our interaction with the world. As big-data techniques become a regular part of everyday life, we as a society may begin to strive to understand the world from a far larger, more comprehensive perspective than before, a sort of N = all of the mind. And we may tolerate blurriness and ambiguity in areas where we used to demand clarity and certainty, even if it had been a false clarity and an imperfect certainty. We may accept this provided that in return we get a more complete sense of reality—the equivalent of an impressionist painting, wherein each stroke is messy when examined up close, but by stepping back one can see a majestic picture.

Big data, with its emphasis on comprehensive datasets and messiness, helps us get closer to reality than did our dependence on small data and accuracy. (p. 48)

Will impressionist data provide a more accurate and useful view of the world? I love impressionist paintings, but they’re not what I study to get a clear picture of the world.

Now, if you work in the field of data quality, let me warn you that what’s coming next will shock and dismay you. Perhaps you should sit down and take a valium before reading on.

The industry of business intelligence and analytics software was long built on promising clients “a single version of the truth”…But the idea of “a single version of the truth” is doing an about-face. We are beginning to realize not only that it may be impossible for a single version of the truth to exist, but also that its pursuit is a distraction. To reap the benefits of harnessing data at scale, we have to accept messiness as par for the course, not as something we should try to eliminate. (p. 44)

Seriously? Those who work in the realm of data quality realize that if we give up on the idea of consistency in our data and embrace messiness, the problems that are created by inconsistency will remain. When people in different parts of the organization are getting different answers to the same questions because of data inconsistencies, no amount of data will make this go away.

So what is it that we get in exchange for our willingness to embrace messiness?

In return for living with messiness, we get tremendously valuable services that would be impossible at their scope and scale with traditional methods and tools. According to some estimates only 5 percent of all digital data is ‘structured’—that is, in a form that fits neatly into a traditional database. Without accepting messiness, the remaining 95 percent of unstructured data, such as web pages and videos, remain dark. By allowing for imprecision, we open a window into an untapped universe of insights. (p. 47)

The authors seem to ignore the fact that most data cannot be analyzed until it is structured and quantified. Only then can it produce any of the insights that the authors applaud in this book. It doesn’t need to reside in a so-called structured database, but it must at a minimum be structured in a virtual sense.

Does Big Data Reduce the Need for Subject Matter Expertise?

I was surprised when I read these two authors, both subject matter experts in their particular realms, state the following:

We are seeing the waning of subject-matter experts’ influence in many areas. (p. 141)

They argue that subject matter experts will be substantially displaced by Big Data, because data contains a better understanding of the world than the experts.

Yet expertise is like exactitude: appropriate for a small-data world where one never has enough information, or the right information, and thus has to rely on intuition and experience to guide one’s way. (p. 142)

This separation of subject matter expertise on the one hand from what we can learn from data on the other is artificial. All true experts are informed by data. The best experts are well informed by data. Data about things existed long before the digital age.

At one point in the book the authors quote Hal Varian, formerly a professor in the computer science department at UC Berkeley and now Google’s chief economist: “Data is so widely available and so strategically important that the scarce thing is the knowledge to extract wisdom from it. (p. 125)” What they seem to miss is the fact that Varian, in the same interview from which this quotation was derived, talks about the need for subject matter experts such as managers to become better informed by data to do their jobs. People become better subject matter experts when they become better acquainted with pertinent data. These experts will not be displaced by data; they will be enriched by it as they always have, hopefully to an increasing degree.

As we’ve seen, the pioneers in big data often come from fields outside the domain where they make their mark. They are specialists in data analysis, artificial intelligence, mathematics, or statistics, and they apply those skills to specific industries. (p. 142)

These Big Data pioneers don’t perform these wonders independent of domain expertise but in close collaboration with it. Data sensemaking skills do not replace or supplant subject matter expertise, they inform it.

To illustrate how Big Data is displacing the subject matter experts in one industry, the authors write the following about the effects of Big Data in journalism:

This is a humbling reminder to the high priests of mainstream media that the public is in aggregate more knowledgeable than they are, and that cufflinked journalists must compete against bloggers in their bathrobes. (p. 103)

“Cufflinked journalists”? From where did this characterization of journalists come? Perhaps the author Cukier, the data editor of the Economist, has a bone to pick with other journalists who don’t understand or respect his work. Whatever the source of this enmity against mainstream media, I don’t want to rely on bloggers for my news of the world. While bloggers have useful information to share, unless they develop journalistic skills, they will not replace mainstream journalists. This is definitely one of those cases in which the amount of information—noise in the blogosphere—cannot replace thoughtful and skilled reporting.

Perhaps the strangest twist on this theme that the authors promote is contained in the following paragraph:

Perhaps, then, the crux of the value is really in the skills? After all, a gold mine isn’t worth anything if you can’t extract the gold. Yet the history of computing suggests otherwise. Today expertise in database management, data science, analytics, machine-learning algorithms, and the like are in hot demand. But over time, as big data becomes more a part of everyday life, as the tools get better and easier to use, and as more people acquire the expertise, the value of the skills will also diminish in relative terms…Today, in big data’s early stages, the ideas and the skills seem to hold the greatest worth. But eventually most value will be in the data itself. (p. 134)

The value of skills and expertise will not diminish over time. When programming jobs started being offshored, the value of programming wasn’t diminished, even though the value of individual programmers was through competition. No shift in value will occur from skills and expertise to data itself. Data will forever remain untapped, inert, and worthless without the expertise that is required to make sense of it and tie it to existing knowledge.

What Is a Big Data Mindset and Is It New?

Data represents potential. It always has. From back when our brains first evolved to conceive of data (facts) through the development of language, followed by writing, the invention of movable type, the age of enlightenment, the emergence of computers, and the advent of the Internet until now, we have always recognized the potential of data to become knowledge when understood and to be useful when applied. Has a new data mindset arisen in recent years?

Seeing the world as information, as oceans of data that can be explored at ever greater breadth and depth, offers us a perspective on reality that we did not have before. It is a mental outlook that may penetrate all areas of life. Today, we are a numerate society because we presume that the world is understandable with numbers and math. And we take for granted that knowledge can be transmitted across time and space because the idea of the written word is so ingrained. Tomorrow, subsequent generations may have a ‘big-data consciousness”—the presumption that there is a quantitative component to all that we do, and that data is indispensable for society to learn from. The notion of transforming the myriad dimensions of reality into data probably seems novel to most people at present. But in the future, we will surely treat it as a given. (p. 97)

Perhaps this mindset is novel for some, but it is ancient in origin, and it has been my mindset for my entire 30-year career. Nothing about this is new.

In a big-data age, we finally have the mindset, ingenuity, and tools to tap data’s hidden value. (p. 104)

Poppycock! We’ve always searched for hidden value in data. What we haven’t done as much in the past is collect everything in the hope that it contains a goldmine of hidden wealth if we only dig hard and long enough. What has yet to be determined is the net value of this venture.

What Are the Risks of Big Data?

Early in the book the authors point out that they are observers of Big Data, not evangelists. They seem to be both. They certainly promote Big Data with enthusiasm. Their chapter on the risks of Big Data does not negate this fact. What’s odd is that the risk that seems to concern them most is one that is and will probably always remain science fiction. They are concerned that those in power, such as governments, will use Big Data to predict the bad behavior of individuals and groups and then, based on those predictions alone, act preemptively by arresting people for crimes that have not yet been committed.

It is the quintessential slippery slope—leading straight to the society portrayed in Minority Report, a world in which individual choice and free will have been eliminated, in which our individual moral compass has been replaced by predictive algorithms and individuals are exposed to the unencumbered brunt of collective fiat. If so employed, big data threatens to imprison us—perhaps literally—in probabilities. (p. 163)

Despite preemptive acts of government that were later exposed as mistakes (e.g., the invasion of Iraq because they supposedly had weapons of mass destruction) and those of insurance companies or credit agencies that deny coverage or loans by profiling certain groups of people as risky in the aggregate, the threat of being arrested because an algorithm predicted that I would commit a crime does not concern me.

Big data erodes privacy and threatens freedom. But big data also exacerbates a very old problem: relying on the numbers when they are far more fallible than we think. (p. 163)

This is indeed a threat, but the authors’ belief that we can allow messiness in Big Data exacerbates this problem.

The threat is that we will let ourselves be mindlessly bound by the output of our analyses even when we have reasonable grounds for suspecting something is amiss. Or that we will become obsessed with collecting facts and figures for data’s sake. (p. 166)

This obsession with “collecting facts and figures for data’s sake” is precisely what the authors promote in this book.

In Summary

The authors of this book are indeed evangelists for the cause of Big Data. Even though one is an academic and the other an editor, both make their living by observing, using, and promoting technology. There’s nothing wrong with this, but the objective observer’s perspective on Big Data that I was hoping to find in this book wasn’t there.

Is Big Data the paradigm-shifting new development that the authors and technology companies claim it to be, or is the data of today part of a smooth continuum extending from the past? Should we adopt the mindset that all data is valuable in and of itself and that “more trumps better”? Should we dig deep into our wallets to create the ever-growing infrastructure that would be needed to indiscriminately collect, store, and retain more?

Let me put this into perspective. While recently reading the book Predictive Analytics by Eric Siegel, I learned about the research of Larry Smarr, the director of a University of California-based research center, who is “tracking all bodily functions, including the scoop on poop, in order to form a working computational model of the body as an ecosystem.” Smarr asks and answers:

Have you ever figured how information-rich your stool is? There are about 100 billion bacteria per gram. Each bacterium has DNA…This means that human stool has a data capacity of 100,000 terabytes of information stored per gram.

This is fascinating. I’m not being sarcastic; it really is. I think Smarr’s research is worthwhile. I don’t think, however, that everyone should continuously save what we contribute to the toilet, convert its contents into data, and then store that data for the rest of our lives. If we did, we would become quite literally buried in shit. Not all data is of equal importance.

The authors the book Big Data: A Revolution That Will Transform How We Live, Work, and Think, in a thoughtful moment of clarity, included the following note of realism:

What we are able to collect and process will always be just a tiny fraction of the information that exists in the world. It can only be a simulacrum of reality, like the shadows on the wall of Plato’s cave. (p. 197)

This is beautifully expressed and absolutely true. Data exists in a potentially infinite supply. Given this fact, wouldn’t it be wise to determine with great care what we collect, store, retain, and mine for value? To the extent that more people are turning to data for help these days, learning to depend on evidence rather than intuition alone to inform their decisions, should we accept the Big Data campaign as helpful? We can turn people on to data without claiming that something miraculous has changed in the data landscape over the last few years. The benefits of data today are the same benefits that have always existed. The skills that are needed to tap this potential have changed relatively little over the course of my long career. As data continues to increase in volume, velocity, and variety as it has since the advent of the computer, its potential for wise use increases as well, but only if we refine our ability to separate the signals from the noise. More does not trump better. Without the right data and skills, more will only bury us.

Take care,

21 Comments on “A More Thoughtful but No More Convincing View of Big Data”


By Doug Ross. May 14th, 2013 at 11:46 am

Big Data = More Garbage In, Faster Garbage Out.

I read the book. All too much hype for a few cherry picked examples of the power of Big Data. Am I supposed to be impressed that when I buy a cookbook for my daughter on Amazon, that the next time I return, I am presented with another cookbook to purchase? Or else be forced to edit my purchase history to help Amazon improve its recommendations? How about when I go to Google one time to search for “silver coins” to find the value of some coins my wife got from an estate and now I am presented with Gold IRA ads. Is that supposed to prove the power of Big Data?
Wake me up when Big Data can predict the results of NFL football games at > 80%
or tell me with 99.5% certainty whether it will rain next Thursday. Then I’ll be impressed.

By Jordan Goldmeier. May 14th, 2013 at 3:06 pm

You might enjoy he work Evgeny Morozov. I think you and he often overlap in your critiques of technology. Here, for instance, is a quote from his latest book, which a friend recently brought to my attention:

Recasting all complex social situations either as neat problems with definite, computable solutions or as transparent and self-evident processes that can be easily optimized—if only the right algorithms are in place!—this quest is likely to have unexpected consequences that could eventually cause more damage than the problems they seek to address. I call the ideology that legitimizes and sanctions such aspirations “solutionism,”…an unhealthy preoccupation with sexy, monumental, and narrow-minded solutions—the kind of stuff that wows audiences at TED Conferences—to problems that are extremely complex, fluid, and contentious

By Colin Michael. May 15th, 2013 at 8:27 am

This took me two days to get through, but it was worthwhile. Now I finally realize the error of my ways. I’ve spent years trying to help marketers and product developers move away from relying almost solely on “intuition” (a.k.a. informed wild guesses) and toward careful measurement, clean data, and truth-based predictive analytics. Now I see that this has been pure folly! The “real” world is messy and so should our plans for it be. Of course! Developing better products is not the way to go. We just need to develop more products. Someone will find a use for one of them someday. And that might even happen before we go broke! Maybe.

Methinks Big Data is like Big Government… [insert your own anecdote here]

By Dale Lehman. May 20th, 2013 at 3:46 pm

Thank you for a very thoughtful review that has hit on many things I have been preoccupied with lately (I work with data and teach data analysis to MBA students but do not feel up to speed with machine learning/big data approaches though I do work with some very big datasets). I have two thoughts that I’d be very interested in your reactions (as well as others) to:
1. I view some of the current hype as a reaction to the generally poor job we have done in the past teaching people to do data analysis. Indeed, data analysis is a more appropriate focus than “statistics” in a data-rich environment. In particular, inferential statistics is less important relative to good descriptive and visual analysis than traditionally taught (though far from irrelevant). As an example, many business statistics courses are still taught without the use of any software – clearly an invitation for people to look for hype that promises to make it unnecessary to think about causal models.
2. I have entered some data competitions in order to see what it is about – and my best efforts (though admittedly I have not spent a great deal of time building models) don’t come close to the machine learners. So, I believe they are onto something – but, exactly what? They build better predictive models, but I have not yet seen the evidence that these models are stable or yield valuable insight (of course, there are anecdotal successes, but we never hear about the failures – where “big data” was used and resulted in poor decisions). The intriguing thing is the idea that we can train machines to process vast amounts of data, make decisions on the basis of this data, without any human intervention. In theory, it is possible that this will be better than employing subject matter experts to intervene in the process – but again, the evidence has not been provided (and there are clear counterexamples, such as Nate Silver’s description of the evolution of baseball statistical analysis, where scouts have not been replaced by models, but are complementary resources).

By Bert Brijs. May 21st, 2013 at 12:58 am

Couldn’t agree with you more. I will post an article by the end of May making a serious attempt to get the definition of Big Data clearer. That might help to reduce the Big Clutter and give a more realistic look on Big Promises about Big Investments.
Keep up the good work!

By Scott Reida. May 22nd, 2013 at 10:58 am

Great post (as always). You mentioned the issue we have with analyzing data that is in video, sentence, etc. format. I think this is where ‘Big Data’ work will actually help. The tools being developed to perform this conversion is what we should really be excited about in this ‘movement’ (pardon my stool reference). Once this is done…you’re right…we should revert to most of the analytical tools outside of ‘Big Data’. Executives want to know that the recommendations they receive have some credibility behind them. How can this be given in good faith without the analyst understanding causation in the equation?

By Stephen Few. May 25th, 2013 at 8:42 am

Dale,

It is certainly true that too little has been done in the past to equip people with data sensemaking skills, but I don’t think this provides much of the fuel that is feeding Big Data hype today. Big Data is an invention of technology vendors and analysts to market their products and services. People are eager to believe that there is a techno-magical answer to their inability to derive value from data. It’s easier to face than the harsh reality of study, practice, and hard work.

Regarding the predictive models that are currently being produced by “machine learning” (improperly named, in my opinion), especially the complex models that result from the combined models of many groups that join forces to win Big Data competitions, if you haven’t read Eric Siegel’s book Predictive Analytics, you might find it interesting. Siegel does a good job of explaining predictive analytics, including the particular problems that machine learning is best suited to solve.

By Matthew Iles. May 25th, 2013 at 10:38 am

Stephen, I am a huge fan of your work. I’ve read your books and practice your methods in my role as a digital strategy consultant. This post is seriously music to my ears. I just pulled about 10 quotes from here to share with students in my Digital Marketing course. So first off, thank you for injecting some much-needed sanity and reason into this discussion.

As a marketer myself, I am constantly amazed at how prolific this industry is in its production of meaningless but oh-so-shiny buzzwords. I try to choose my words carefully in an effort to clearly and precisely convey my thoughts and opinions, so I am endlessly frustrated when others hijack words and phrases, twist their meaning or apply a thick coat of gloss, and prevent me from using previously apt language for fear of coming across as completely full of shit. Because that’s what happens when you use buzzwords to communicate.

For me, Big Data is the most offensive and annoying buzzword yet. Usually, terms earn BS status through over-application. Like saying a word out loud over and over again until it just loses its meaning. “Engagement” is a classic example. It’s virtually impossible to use this word now when discussing your digital marketing strategy without sounding like a know-nothing jackass.

But Big Data just skipped all that! Think about the boldness of it all. Can you imagine the boardroom meeting where this phrase was cooked up?

Marketing Guru 1: “It seems the smartest, most successful companies are no longer steering their business on intuition, but instead analyze their performance data to learn what works and what doesn’t to make informed decisions about what to augment and what to diminish.”

Marketing Guru 2: “Golly, that was a big mouthful! Did you say data?” **Lightbulb!**

In and of itself, data is worthless. It is puzzle pieces scattered on a tabletop, atoms scattered across the universe, or letters scattered across a page. It’s all just trash if we can’t put it together into something more. The English language wouldn’t be more descriptive or communicable or even poetic if we simply had more letters. In fact, that sounds horrible. Instead, it’s how we fit them together to form new words, new turns of phrase, better and more beautiful descriptions that make all the difference. As one of my colleagues put it, “Big Data is a problem, not a solution.”

What we really need is Simple Data. The potential of data lies in our ability to unearth insights, and the potential of insights lies in our ability to communicate them to others. Gather data, analyze for patterns and trends, tell the story. Simply and clearly and directly. That’s the game, and the best players know this.

Thank you, Stephen, for lighting the way. :)

By Enrico Bertini. June 3rd, 2013 at 10:02 am

Stephen,

I am not a big fan of the term big data either, nonetheless I am not sure I would totally buy the idea that what we are witnessing today is not qualitatively different from what we have had in the past. When I think about the phenomenon people call big data for me the real difference is the pervasiveness and democratization of data. This is for me what has really changed during the last decade. Data used to be a matter for a few geeks/statisticians/analysts but it’s no longer like that.

This fact became clear to me a few years ago when I had a long collaboration with a group of biochemists at the University of Konstanz. These are people that basically overnight when from analyzing a handful of cells in the wet lab to having a little robot in their room (well … maybe the next door) which spits hundreds of thousands of data points in a few hours. You would have to see the look they have in their eyes when they try to sort these data manually in excel spreadsheets.

I don’t like to call it big data but it is “something”. Something changed at the level of professions and society. Pervasiveness is a big thing here and yes, it is new. How many people like those I met in Konstanz are around the world? I guess many many many.

I am curious to hear your opinion on that.

Take care.

By Stephen Few. June 3rd, 2013 at 10:50 am

Enrico,

The pervasiveness of data and the democratization of data are both examples of a change in degree, not a change of state. These characteristics have both steadily increased since the advent of the computer. There is no evidence that either has experienced a sudden leap that constitutes a change of state. In fact, these characteristics have been increasing rapidly since the advent of the printing press, but even that probably did not qualify as a change of state, according to the best historians.

By Enrico Bertini. June 3rd, 2013 at 1:12 pm

I am not sure I understand. Are you suggesting a change in degree (as opposed to a change of state) is not interesting or worth of special consideration? Data pervasiveness and democratization means to me that new layers of our society are experiencing for the first time the challenges and opportunities offered by data analysis. The effect of this is not trivial, even if it’s “just” a change of degree and not state.

For data professionals and teachers like us it also means we have to develop solutions and strategies that target a completely different kind of audience, with different expectations and background.

There is more. As decisions and recommendations are increasingly taken on the basis of what some “experts” do with data we have a huge issue with what people take for “the truth” and how it is perceived and communicated.

Similarly, given the large availability of data and its democratization people can now do their own data analysis and build their own truth so easily. I don’t think this was true even ten years ago or so.

Please note that I am not purposely trying to argue against your point of view. I am just trying to check if I fully understand what you mean when you seem to suggest nothing has really changed.

By Stephen Few. June 3rd, 2013 at 1:39 pm

Enrico,

Every proponent of Big Data that I’ve read argues that it represents a change of state, not a mere change of degree. This, they argue, is why it is appropriate to give data today a new name (i.e., Big Data). Data is forever growing bigger and more available. This has been true throughout my lifetime and yours. It did not suddenly change in character.

I’m not sure that I agree that “new layers of our society are experiencing for the first time the challenges and opportunities offer by data analysis.” Business intelligence software vendors have been making this claim for a generation, but in truth the increase in the breadth of people that analyze data has been minor. Having access to more data and the use of a tool does not make someone a data analyst. I definitely don’t agree that “people can now do their own data analysis and build their own truth easily.” In my experience, if the percentage of people who are capable of analyzing data and determining truth based on data has actually increased, the percentage is extremely small. This is based on many years of observing this phenomenon firsthand.

I’m not sure what you’re saying regarding experts and what people take “for truth.” Perhaps you could explain.

What you are saying is consistent with the marketing hype about Big Data, but not with the actual experience of people in the world. I work with people from organizations of all types who are struggling to learn basic data sensemaking skills. What I’ve found is that little has changed. Even in organizations that we think of as analytically savvy (I’ve worked with many), I’ve found that these organizations are no more analytically sophisticated in general today, with the exception of a handful of skilled individuals, than organizations were 30 years ago.

By Enrico Bertini. June 3rd, 2013 at 3:43 pm

Stephen,

I am not sure I agree. Two examples come into my mind. Let’s go back to my experience with biochemistry. The way the lab looked 30 years ago is very very different from the way it looks today. Analyzing 30 vs. millions of molecules is very different in practice. The whole area of bioinformatics witnessed a huge change with the advent of computing and some of the skills necessary to a biologist today have radically changed. In some areas if you are not technically savvy you are lost.

Another example is the whole area of data journalism. These people might not be what you call “data analysts” yet they do data analysis all the time and publish articles people read and tend to trust. I conjecture they may end up trusting them even more given the allure of presenting a story based on data analysis. Did we have this 20 years ago? No. Not at the scale and speed we are witnessing today at least.

Maybe things have not changed much in business intelligence and in the corporate world but there are segments of the population that are confronted for the first time with data analysis at such a scale. The two cases I have mentioned above are just two examples but I bet there are many more.

By Stephen Few. June 3rd, 2013 at 4:37 pm

Enrico,

Your first example of the change in what biochemists can do is an example of what has gone on with data since the advent of the computer, not an example of so-called Big Data. Since the advent of the computer long ago data has increased at an exponential rate and in variety. This has certainly been true of biochemistry. This was already happening before the computer, just not as fast. I’m not arguing that new data does not become available, nor that we cannot do things specific things today that we couldn’t do in the past. I’m arguing that this has always been the case. During what era did data not increase? During what era since the Enlightenment has technology not improved? What’s happening today is a consistent with what has happened in the past.

At what point in history did journalists not base their reporting on data? Did they do this 20 years ago? Yes, they did. The fact that those with the requisite skills can do it today at a scale and speed that is greater than 20 years ago is not evidence of so-called Big Data. Twenty years ago they could do it at a scale and speed that was greater than 20 years before that.

I and others like me who counter the hype of Big Data are not saying that data hasn’t increased in volume, velocity, and variety, but merely that it has done so since the advent of the computer at roughly the same exponential rate of increase. We are not expressing disrespect for the use of information. Rather, we are pointing out that the basic nature of data has not changed and that the same basic skills that people have used for generations to make sense of data have changed very little. More, faster, and in greater variety does not change the basic nature of data or its use. Until data is understood, it is worthless. The emphasis should be on understanding data, not on its unrestricted growth. While data is increasing at an exponential rate, most of it is noise.

By Enrico Bertini. June 3rd, 2013 at 8:04 pm

Thanks Steve, I understand your point much better now.

By Janne Pyykkö. June 4th, 2013 at 10:49 pm

I have mixed feelings about the term also. Something came into my mind, though, that might be worth considering.

Aleksandr Solzhenitsyn was awarded the Nobel Prize in Literature in 1970 for his writings about Soviet Union labor camps system. He wrote several books, most notably The Gulag Archipelago, which was the term that he used about the issue. While this new term adds nothing new to good old “labor camp system”, it definitely raised awareness (in the Western world) that the volume, velocity, and variety of brutal human society has reached a new level (and still, you can argue that it’s nothing new, this has happened many times, it’s the same old thing but in bigger scale).

So, I have mixed feelings about Big Data. One part of me thinks sometimes we need a new term, though it’s only an increase in quantity.

By Stephen Few. June 5th, 2013 at 9:10 am

Janne,

I would have no concern about the term Big Data if it were only being used to draw people’s attention to the potential of data. Unfortunately, it is being used to spread misinformation, some of which is harmful. It is a vivid example of “technological solutionism.” For insight into the harm this engenders, you might be interested in reading “To Save Everything, Click Here: The Folly of Technological Solutionism” by Evgeny Morozov.

By Enrico Bertini. June 5th, 2013 at 4:51 pm

Did anyone here read “Raw Data Is An Oxymoron” (http://www.amazon.com/dp/0262518287), it looks very much related to the matter we are are discussing here, especially the historical perspective on data production and accumulation.

From the description: ” This book reminds us that data is anything but “raw,” that we shouldn’t think of data as a natural resource but as a cultural one that needs to be generated, protected, and interpreted. The book’s essays describe eight episodes in the history of data from the predigital to the digital.”

By Stephen Few. June 5th, 2013 at 5:12 pm

Enrico,

I haven’t read “Raw Data Is An Oxymoron,” but I’ll order it now and perhaps review it in the future.

By Michael McAllister. June 13th, 2013 at 3:38 pm

I was going down the rabbit hole today from an article elsewhere, and came across a piece Nassim Taleb wrote for Wired. The article is titled “Beware the big errors of ‘big data'” (http://www.wired.com/opinion/2013/02/big-data-means-big-errors-people/). Taleb is the author of books some of us may have read (Fooled by Randomness, The Black Swan), and critiques big data from the perspective of the signal vs noise. I hope this article adds something for all of us to heed and consider.

By stephen black. June 16th, 2013 at 7:39 am

[Note: I, Stephen Few, have responded to Stephen Black’s comments in italics below. All sections of text that appear in italics enclosed in brackets are my responses.]

As someone who is currently working to exploit BigData tools for practical uses, I actually agree with the thrust of your critique. I think many of the evangelists of BigData have not understood why the modern tools emerging to handle it will actually make a difference.

They believe the big changes will come from the Big of big data (the equation is something like more data=more insight). This ignores Nassim Taleb’s criticism that the number of spurious correlations rises faster than the number of new data points.

But I have seen major gains from the use of BigData tools like Google’s BigQuery (especially when linked live to datavisualization tools like Tableau so the datasets can be investigated interactively). But the gains we have seen are the result of speed not size. And we are not getting insights by using new data but from old datasets that have been around for some time.

[Be cautious. Most organizations have not begun to look in useful ways at data sets that they’ve had for ages. We shouldn’t suggest that only new data sets have someting to offer. This is a part of the Big Data hype that is misleading and harmful: the notion that new new data sets are essential. Most organizations would benefit more from first learning how to look at existing data in meaningful ways.]

For example, England’s NHS has been collecting detailed data about hospital admissions for all its hospitals for more than a decade and has been adding or publishing new collections covering more health activity every couple of years (eg they recently a dataset to the public containing summary data at GP practice level of all primary care drug prescriptions which contains about 1.5GB a month of data). Academics and NHS analysts have been deriving insights from this data for some time. But every analysis is time consuming and tedious. This is primarily because accessing the full datasets is slow. We used to run even simple queries as overnight batch jobs and hope our SQL was OK and we had asked the right question so our next day would not be spent in rework. The current datasets are also usually stored in one-year chunks to maximise speed and limit issues with storage and revisions. Understandably this limits the questions you can ask and you have to have a good reason to bother. If you want to answer questions that need data from multiple years, you have a lot of tedious extracting and joining to do. But we recently tested loading all the datasets for all years into single BigQuery tables. The result is that we can now interact with the data even across multiple datasets (relationships between primary care prescriptions and hospital admissions, no problem!) Most queries return results in 15-45 seconds. Stick Tableau on top of this and you have an interactive tool to explore the most complex question without even having to write SQL (Tableau 8 has a connector to BigQuery).

[I want to add another caution here. What you’re describing is not new. I began working in the field of data warehousing when it first emerged 30 years ago, and we’ve been advocating the data structures that you’re describing for well over 20 years. Technologies have enabled easier ways to accomplish this over time, but the data handling abilities that you’re describing are not new.]

The result of this speed is that we now have an agile process for testing interesting hypotheses about health data. Instead of having to throw away most ideas because we don’t have time to test them and having to apply a big “is this question valuable enough to be worth all the effort” filter, we can now go straight from the hypothesis to testing and refining with no significant delay. Cycles of analysis, response, refinement, decision that previously took weeks or even months can now happen interactively during a single meeting. And we can test ideas without worrying about how much it will cost to do it. So far more ideas can be tested.

[And what exactly is a “Big Data tool”? Every data- handling tool on the market is now being called a Big Data tool. Improvements in data handling are not a result of Big Data, they are a result on incremental improvements that have been happening all along. My point is that the name Big Data has been applied to today’s tools even though they are not substantially different from the past. When a new processing chip is released that operates at three or four times the speed of its predecessors, it wouldn’t be appropriate to call it revolutionary or to give it a brand new name to imply that it’s a radical departure from the past.]

[Again, this has been possible for many years. The fact that the NHS is just now getting to this is not a result of Big Data. They could have done this long ago.]

Also, Big data tools don’t require a lot of up-front investment in how the data is structured to achieve this speed (at least for datasets that are not full of bad data that has to be cleaned up).

So, for us, the real benefit of BigData is that its tools fundamentally change the process of data analysis, making it incredibly fast and interactive. So the bottleneck for data analysis is pushed back to our ability to think of interesting ideas to test and interesting ways to analyse, not the tedious process of extracting and processing data.

[Your statement underscores my fundamental concern about Big Data. The tools have not fundamentally changed the process of data analysis. The tools have gradually enabled to an increasing degree a process that has existed for over a generation. Data analysis has not been held back primarily because the tools have been lacking (although, they certainly have) but because people have not developed the requisite skills.]

The gainers from this are the people who do have content knowledge because they will find that knowledge easier to apply (the idea that BigData somehow reduces their role is just silly).

In summary, the gurus of BigData have misunderstood its biggest benefit: speed. It isn’t the new data that matters, it is the ability to look at old data with new eyes, speedily, that matters.

[Although faster data processing speeds are certainly beneficial, they do not constitute anything deserving of a new name (e.g., Big Data), and they have been increasing since the advent of the computer, not suddenly in the last few years as the term Big Data implies.]