I have a problem with Big Data. As someone who makes his living working with data and helping others do the same as effectively as possible, my objection doesn’t stem from a problem with data itself, but instead from the misleading claims that people often make about data when they refer to it as Big Data. I have frequently described Big Data as nothing more than a marketing campaign cooked up by companies that sell information technologies either directly (software and hardware vendors) or indirectly (analyst groups such as Gartner and Forrester). Not everyone who promotes Big Data falls into this specific camp, however. For example, several academics and journalists also write articles and books and give keynotes at conferences about Big Data. Perhaps people who aren’t directly motivated by increased sales revenues talk about Big Data in ways that are more meaningful? To examine this possibility, I recently read the best selling book on the topic, Big Data: A Revolution That Will Transform How We Live, Work, and Think, by Viktor Mayer-Schönberger and Kenneth Cukier. Mayer-Schönberger is an Oxford professor of Internet governance and regulation and Cukier is the data editor of the Economist.
Big Data: A Revolution That Will Transform How We Live, Work, and Think
Viktor Mayer-Schönberger and Kenneth Cukier
Houghton Mifflin Harcourt, 2013
I figured that if anyone had something useful to say about Big Data, these were the guys. What I found in their book, however, left me convinced even more than before that Big Data is a ruse, and one that should concern us.
What Is Big Data?
One of the problems with Big Data, like so many examples of techno-hype, is that it is ill-defined. What is Big Data exactly? The authors address this concern early in the book:
There is no rigorous definition of big data. Initially the idea was that the volume of information had grown so large that the quantity being examined no longer fit into the memory that computers use for processing, so engineers needed to revamp the tools they used for analyzing it all. (p. 6)
Given this state of affairs, I was hoping that the authors would propose a definition of their own to reduce some of the confusion. Unfortunately, they never actually define the term, but they do describe it in various ways. Here’s one of the descriptions:
The sciences like astronomy and genomics, which first experienced the explosion in the 2000s, coined the term “big data.” The concept is now migrating to all areas of human endeavor. (p. 6)
Actually, the term “Big Data” was coined back in 1997 in the proceedings of IEEE’s Visualization conference by Michael Cox and David Ellsworth in a paper titled “Application-controlled demand paging for out-of-core visualization.” Scientific data in the early 2000s was not our first encounter with huge datasets. For instance, I was helping the telecommunications and banking industries handle what they experienced as explosions in data quantity back in the early 1980s. What promoters of Big Data fail to realize is that data has been increasing at an exponential rate since the advent of the computer long ago. We have not experienced any actual explosions in the quantity of data in recent years. The exponential rate of increase has continued unabated all along.
How else do the authors define the term?
[Big Data is]…the ability of society to harness information in novel ways to produce useful insights or goods and services of significant value. (p. 2)
On the contrary, we’ve been finding “novel ways to produce” value from data forever, not just in the last few years. We haven’t crossed any threshold recently.
How else do they define it?
At its core, big data is about predictions. Though it is described as part of the branch of computer science called artificial intelligence, and more specifically, an area called machine learning, this characterization is misleading. Big data is not about trying to “teach” a computer to “think” like humans. Instead, it’s about applying math to huge quantities of data in order to infer probabilities. (pp. 11 and 12)
So, Big Data is essentially about “predictive analytics.” Did we only in recent years begin applying math to huge quantities of data in an attempt to infer probabilities? We neither began this activity recently, nor did data suddenly become huge.
Is Big Data a technology?
But where most people have considered big data as a technological matter, focusing on the hardware or the software, we believe the emphasis needs to shift to what happens when the data speaks. (p. 190)
I wholeheartedly agree. Where I and the authors appear to differ, however, is in our understanding of the methods that are used to find and understand the messages in data. The ways that we do this are not new. The skills that we need to make sense of data—skills that go by the names statistics, business intelligence, analytics, and data science—have been around for a long time. Technologies incrementally improve to help us apply these skills with greater ease to increasingly larger datasets, but the skills themselves have changed relatively little, even though we come up with new names for the folks who do this work every few years.
So what it is exactly that separates Big Data from data of the past?
One way to think about the issue today—and the way we do in this book—is this: big data refers to things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms of value, in ways that change markets, organizations, the relationship between citizens and governments, and more.
But this is just the start. The era of big data challenges the way we live and interact with the world. Most strikingly, society will need to shed some of its obsession for causality in exchange for simpler correlations: not knowing why but only what. This overturns centuries of established practices and challenges our most basic understanding of how to make decisions and comprehend reality. (pp. 6 and 7)
To “make decisions and comprehend reality” we no longer need to understand why things happen together (i.e., causation) but only what things happened together (correlation). When I read this an eerie feeling crawled up my spine. The implications of this line of thinking are scary. Should we really race into the future satisfied with a level of understanding that is relatively superficial?
According to the authors, Big Data consists of “things one can do at a large scale that cannot be done at a smaller one.” And what are these things and how exactly do they change how “we live and interact with the world?” Let’s see if the authors tell us.
Does Big Data Represent a Change of State?
One claim that the authors make, which is shared by other promoters of Big Data, is that data has grown so large and so fast that the increase in quantity constitutes a qualitative change of state.
Not only is the world awash with more information than ever before, but that information is growing faster. The change of scale has led to a change of state. The quantitative change has led to a qualitative one. (p. 6)
The essential point about big data is that change of scale leads to change of state. (p. 151)
All proponents make this claim about Big Data, but I’ve yet to see anyone substantiate it. Perhaps it is true that things can grow to such a size and at such a rate that they break through some quantitative barrier into the realm of qualitative change, but what evidence do we have that this has happened with data? This book contains many examples of data analytics that have been useful during the past 20 years or so, which the authors classify as Big Data, but I believe this attribution is contrived. Not one of the examples demonstrates a radical departure from the past. Here’s one involving Google:
Big data operates at a scale that transcends our ordinary understanding. For example, the correlation Google identified between a handful of search terms and the flu was the result of testing 450 million mathematical models. (p. 179)
I suspect that Google’s discovery of a correlation between search activity and incidents of the flu in particular areas resulted, not from 450 million distinct mathematical models, but rather a predictive analytics algorithm making millions of minor adjustments during the process of building a single. If they really created 450 million different models, or even if they actually made that many tweaks to an evolving model to find this relatively simple correlation, is this really an example of progress. A little statistical thinking by a human being could have found this correlation with the help of a computer much more directly. Regardless of how many models were actually used, the final model was not overly complicated. What was done here does not transcend the ordinary understanding of data analysts.
And now, for the paradigm-shattering implications of this change of state:
Big data is poised to reshape the way we live, work, and think. The change we face is in some ways even greater than those sparked by earlier epochal innovations that dramatically expanded the scope and scale of information in society. The ground beneath our feet is shifting. Old certainties are being questioned. Big data requires fresh discussion of the nature of decision-making, destiny, justice. A worldview we thought was made of causes is being challenged by a preponderance of correlations. The possession of knowledge, which once meant any understanding of the past, is coming to mean an ability to predict the future. (p. 190)
Does any of this strike you as particularly new? Everything that the authors claim as particular and new to Big Data is in fact old news. If you’re wondering what they mean by the “worldview we thought was made of causes is being challenged by a preponderance of correlations,” stay tuned; we’ll look into this soon.
Who Is the Star of Big Data?
Who or what in particular is responsible for the capabilities and potential benefits of big data? Are technologies responsible? Are data scientists responsible? Here’s the authors’ answer:
The real revolution is not in the machines that calculate data but in data itself and how we use it. (p. 7)
What we can do with data today was not primarily enabled by machines, but it is also not intrinsic to the data itself. Nothing about the nature of data has changed. Data is always noise until it provides an answer to a question that is asked to solve a problem or take advantage of an opportunity.
In the age of big data, all data will be regarded as valuable, in and of itself. (p. 100)
With big data, the value of data is changing. In the digital age, data shed its role of supporting transactions and often became the good itself that was traded. In a big-data world, things change again. Data’s value shifts from its primary use to it potential future uses. That has profound consequences. If affects how businesses value the data they hold and who they let access it. It enables, and may force, companies to change their business models. It alters how organizations think about data and how they use it. (p. 99)
God help us. This notion should concern us because most data will always remain noise beyond its initial use. We certainly find new uses for data that was originally generated for another purpose, such as transaction data that we later use for analytical purposes to improve decisions, but in the past we rarely collected data primarily for potential secondary uses. Perhaps this is a characteristic that actually qualifies as new. Regardless, we must ask the question, “Is this a viable business model?” Should all organizations begin collecting and retaining more data in hope of finding unforeseen secondary uses for it in the future? I find it hard to imagine that secondary uses of data will provide enough benefit to warrant collecting everything and keeping it forever, as the authors seem to believe. Despite their argument that this is a no-brainer based on decreasing hardware costs, the price is actually quite high. The price is not based on the cost of hardware alone.
Discarding data may have been appropriate when the cost and complexity of collecting, storing, and analyzing it were high, but this is no longer the case. (p. 60)
Every single dataset is likely to have some intrinsic, hidden, not yet unearthed value, and the race is on to discover and capture all of it. (p. 15)
Data’s true value is like an iceberg floating in the ocean. Only a tiny part of it is visible at first sight, while much of it is hidden beneath the surface. (p. 103)
Imagine the time that will be wasted and how much it will cost. Only a tiny fraction of data that is being generated today will ever be valuable beyond its original use. A few nuggets of gold might exist in that iceberg below the water line, but do we really need to collect and save it all? Even the authors exhibit concern for the persistent value of data.
Most data loses some of it utility over time. In such circumstances, continuing to rely on old data doesn’t just fail to add value; it actually destroys the value of fresher data. (p. 110)
Collecting, storing, and retaining everything will make it harder and harder to focus on the little that actually has value. Nevertheless, the authors believe that the prize will go to those with the most.
Scale still matters, but it has shifted [from technical infrastructure]. What counts is scale in data. This means holding large pools of data and being able to capture even more of it with ease. Thus large data holders will flourish as they gather and store more of the raw material of their business, which they can reuse to create additional value. (p. 146)
If this were true, wouldn’t the organizations with the most data today be the most successful? This isn’t the case. In fact, many organizations with the most data are drowning in it. I know, because I’ve tried to help them change this dynamic. Having lots of data is useless unless you know how to make sense of it and how to apply what you learn.
Attempts to measure the potential value of data used for secondary purposes have so far been little more than wild guesses. Consider the way that Facebook was valued prior to going public.
Doug Laney, vice president of research at Gartner, a market research firm, crunched the numbers during the period before the initial public offering (IPO) and reckoned that Facebook had collected 2.1 trillion pieces of “monetizable content” between 2009 and 2011, such as “likes,” posted material, and comments. Compared against its IPO valuation, this means that each item, considered as a discrete data point, had a value of about five cents. Another way of looking at it is that every Facebook user was worth around $100, since users are the source of the information that Facebook collects.
How to explain the vast divergence between Facebook’s worth under accounting standards ($6.3 billion) and what the market initially valued it at ($104 billion)? There is no good way to do so. (p. 119)
Indeed, we are peering at tea leaves, trying to find meaning in drippy green clumps. Unforeseen uses for data certainly exist, but they are by definition difficult to anticipate. Do we really want to collect, store, and retain everything possible on the off chance that it might be useful? Perhaps, instead, we should look for ways to identify data with the greatest potential for future use and focus on collecting that? The primary problem that we still have with data is not the lack of it but our inability to make sense of it.
Does Correlation Alone Suffice With No Concern for Causation?
The authors introduce one of their strangest claims in the following sentence:
The ideal of identifying causal mechanisms is a self-congratulatory illusion; big data overturns this. (p. 18)
Are they arguing that Big Data eliminates our quest for an understanding of cause altogether?
Causality won’t be discarded, but it is being knocked off its pedestal as the primary fountain of meaning. Big data turbocharges non-causal analyses, often replacing causal investigations. (p. 68)
Apparently science can now take a back seat to Big Data.
Correlations exist; we can show them mathematically. We can’t easily do the same for causal links. So we would do well to hold off from trying to explain the reason behind the correlations: the why instead of the what. (p. 67)
This notion scares me. Correlations, although useful in and of themselves, must be used with caution until we understand the causal mechanisms related to them.
We progress by seeking and finding ever better explanations for reality—what is, how it works, and why. Explanations—the commitment to finding them and the process of developing and confirming them—is the essence of science. By rebelling against authority as the basis of knowledge, the Enlightenment began the only sustained period of progress that our species has ever known (see The Beginning of Infinity, by David Deutsch). Trust in established authority was replaced by a search for testable explanations, called science. The information technologies of today are a result of this search for explanations. To say that we should begin to rely on correlations alone without concern for causation encourages a return from the age of science to the age of ignorance that preceded it. Prior to science, we lived in a world of myth. Even then, however, we craved explanations, but we lacked the means to uncover them, so we fabricated explanations that provided comfort or that kept those in power in control. To say that explanations are altogether unnecessary today in the world of Big Data is a departure from the past that holds no hope for the future. Making use of correlations without understanding causation might indeed by useful at times, but it isn’t progress, and it is prone to error. Manipulation of reality without understanding is a formula for disaster.
The authors take this notion even further.
Correlations are powerful not only because they offer insights, but also because the insights they offer are relatively clear. These insights often get obscured when we bring causality back into the picture. (p. 66)
Actually, thinking in terms of causality is the only way that correlations can be fully understood and utilized with confidence. Only when we understand the why (cause) can we intelligently leverage our understanding of the what (correlation). This is essential to science. As such, we dare not diminish its value.
Big data does not tell us anything about causality. (p. 163)
Huh? Without data we cannot gain an understanding of cause. Does Big Data lack information about cause that other data contains? No. Data contains this information no matter what its size.
According to the authors, Big Data leverages correlations alone in an enlightening way that wasn’t possible in the past.
Correlations are useful in a small-data world, but in the context of big data they really shine. Through them we can glean insights more easily, faster, and more clearly than before. (p. 52)
Big data is all about seeing and understanding the relations within and among pieces of information that, until very recently, we struggled to fully grasp. (p. 19)
What exactly is it about Big Data that enables us to see and understand relationships among data that were elusive in the past? What evidence is there that this is happening? Nowhere in the book do the authors answer these questions in a satisfying way. We have always taken advantage of known correlations, even when we have not yet discovered what causes them, but this has never deterred us in our quest to understand causation. God help us if it ever does.
Does Big Data Transform Messiness into a Benefit?
Not only do we no longer need to concern ourselves with causation, according to the authors we can also stop worrying about data quality.
In a world of small data, reducing errors and ensuring high quality of data was a natural and essential impulse. Since we only collected a little information, we made sure that the figures we bothered to record were as accurate as possible…Analyzing only a limited number of data points means errors may get amplified, potentially reducing the accuracy of the overall results…However, in many new situations that are cropping up today, allowing for imprecision—for messiness—may be a positive feature, not a shortcoming. It is a tradeoff. In return for relaxing the standards of allowable errors, one can get ahold of much more data. It isn’t just that “more trumps some,” but that, in fact, sometimes “more trumps better. (pp. 32 and 33)
Not only can we stop worrying about messiness in data, we can embrace it as beneficial.
In dealing with ever more comprehensive datasets, which capture not just a small sliver of the phenomenon at hand but much more or all of it, we no longer need to worry so much about individual data points biasing the overall analysis. Rather than aiming to stamp out every bit of inexactitude at increasingly high cost, we are calculating with messiness in mind…Though it may seem counterintuitive at first, treating data as something imperfect and imprecise lets us make superior forecasts, and thus understand our world better. (pp. 40 and 41)
Hold on. Something is amiss in the authors’ reasoning here. While it is true that a particular amount of error in a set of data becomes less of a problem if that quantity holds steady as the total amount of data increases and becomes huge, a particular rate of error remains just as much of a problem as the data set grows in size. More does not trump better data.
The way we think about using the totality of information compared with smaller slivers of it, and the way we may come to appreciate slackness instead of exactness, will have profound effects on our interaction with the world. As big-data techniques become a regular part of everyday life, we as a society may begin to strive to understand the world from a far larger, more comprehensive perspective than before, a sort of N = all of the mind. And we may tolerate blurriness and ambiguity in areas where we used to demand clarity and certainty, even if it had been a false clarity and an imperfect certainty. We may accept this provided that in return we get a more complete sense of reality—the equivalent of an impressionist painting, wherein each stroke is messy when examined up close, but by stepping back one can see a majestic picture.
Big data, with its emphasis on comprehensive datasets and messiness, helps us get closer to reality than did our dependence on small data and accuracy. (p. 48)
Will impressionist data provide a more accurate and useful view of the world? I love impressionist paintings, but they’re not what I study to get a clear picture of the world.
Now, if you work in the field of data quality, let me warn you that what’s coming next will shock and dismay you. Perhaps you should sit down and take a valium before reading on.
The industry of business intelligence and analytics software was long built on promising clients “a single version of the truth”…But the idea of “a single version of the truth” is doing an about-face. We are beginning to realize not only that it may be impossible for a single version of the truth to exist, but also that its pursuit is a distraction. To reap the benefits of harnessing data at scale, we have to accept messiness as par for the course, not as something we should try to eliminate. (p. 44)
Seriously? Those who work in the realm of data quality realize that if we give up on the idea of consistency in our data and embrace messiness, the problems that are created by inconsistency will remain. When people in different parts of the organization are getting different answers to the same questions because of data inconsistencies, no amount of data will make this go away.
So what is it that we get in exchange for our willingness to embrace messiness?
In return for living with messiness, we get tremendously valuable services that would be impossible at their scope and scale with traditional methods and tools. According to some estimates only 5 percent of all digital data is ’structured’—that is, in a form that fits neatly into a traditional database. Without accepting messiness, the remaining 95 percent of unstructured data, such as web pages and videos, remain dark. By allowing for imprecision, we open a window into an untapped universe of insights. (p. 47)
The authors seem to ignore the fact that most data cannot be analyzed until it is structured and quantified. Only then can it produce any of the insights that the authors applaud in this book. It doesn’t need to reside in a so-called structured database, but it must at a minimum be structured in a virtual sense.
Does Big Data Reduce the Need for Subject Matter Expertise?
I was surprised when I read these two authors, both subject matter experts in their particular realms, state the following:
We are seeing the waning of subject-matter experts’ influence in many areas. (p. 141)
They argue that subject matter experts will be substantially displaced by Big Data, because data contains a better understanding of the world than the experts.
Yet expertise is like exactitude: appropriate for a small-data world where one never has enough information, or the right information, and thus has to rely on intuition and experience to guide one’s way. (p. 142)
This separation of subject matter expertise on the one hand from what we can learn from data on the other is artificial. All true experts are informed by data. The best experts are well informed by data. Data about things existed long before the digital age.
At one point in the book the authors quote Hal Varian, formerly a professor in the computer science department at UC Berkeley and now Google’s chief economist: “Data is so widely available and so strategically important that the scarce thing is the knowledge to extract wisdom from it. (p. 125)” What they seem to miss is the fact that Varian, in the same interview from which this quotation was derived, talks about the need for subject matter experts such as managers to become better informed by data to do their jobs. People become better subject matter experts when they become better acquainted with pertinent data. These experts will not be displaced by data; they will be enriched by it as they always have, hopefully to an increasing degree.
As we’ve seen, the pioneers in big data often come from fields outside the domain where they make their mark. They are specialists in data analysis, artificial intelligence, mathematics, or statistics, and they apply those skills to specific industries. (p. 142)
These Big Data pioneers don’t perform these wonders independent of domain expertise but in close collaboration with it. Data sensemaking skills do not replace or supplant subject matter expertise, they inform it.
To illustrate how Big Data is displacing the subject matter experts in one industry, the authors write the following about the effects of Big Data in journalism:
This is a humbling reminder to the high priests of mainstream media that the public is in aggregate more knowledgeable than they are, and that cufflinked journalists must compete against bloggers in their bathrobes. (p. 103)
“Cufflinked journalists”? From where did this characterization of journalists come? Perhaps the author Cukier, the data editor of the Economist, has a bone to pick with other journalists who don’t understand or respect his work. Whatever the source of this enmity against mainstream media, I don’t want to rely on bloggers for my news of the world. While bloggers have useful information to share, unless they develop journalistic skills, they will not replace mainstream journalists. This is definitely one of those cases in which the amount of information—noise in the blogosphere—cannot replace thoughtful and skilled reporting.
Perhaps the strangest twist on this theme that the authors promote is contained in the following paragraph:
Perhaps, then, the crux of the value is really in the skills? After all, a gold mine isn’t worth anything if you can’t extract the gold. Yet the history of computing suggests otherwise. Today expertise in database management, data science, analytics, machine-learning algorithms, and the like are in hot demand. But over time, as big data becomes more a part of everyday life, as the tools get better and easier to use, and as more people acquire the expertise, the value of the skills will also diminish in relative terms…Today, in big data’s early stages, the ideas and the skills seem to hold the greatest worth. But eventually most value will be in the data itself. (p. 134)
The value of skills and expertise will not diminish over time. When programming jobs started being offshored, the value of programming wasn’t diminished, even though the value of individual programmers was through competition. No shift in value will occur from skills and expertise to data itself. Data will forever remain untapped, inert, and worthless without the expertise that is required to make sense of it and tie it to existing knowledge.
What Is a Big Data Mindset and Is It New?
Data represents potential. It always has. From back when our brains first evolved to conceive of data (facts) through the development of language, followed by writing, the invention of movable type, the age of enlightenment, the emergence of computers, and the advent of the Internet until now, we have always recognized the potential of data to become knowledge when understood and to be useful when applied. Has a new data mindset arisen in recent years?
Seeing the world as information, as oceans of data that can be explored at ever greater breadth and depth, offers us a perspective on reality that we did not have before. It is a mental outlook that may penetrate all areas of life. Today, we are a numerate society because we presume that the world is understandable with numbers and math. And we take for granted that knowledge can be transmitted across time and space because the idea of the written word is so ingrained. Tomorrow, subsequent generations may have a ‘big-data consciousness”—the presumption that there is a quantitative component to all that we do, and that data is indispensable for society to learn from. The notion of transforming the myriad dimensions of reality into data probably seems novel to most people at present. But in the future, we will surely treat it as a given. (p. 97)
Perhaps this mindset is novel for some, but it is ancient in origin, and it has been my mindset for my entire 30-year career. Nothing about this is new.
In a big-data age, we finally have the mindset, ingenuity, and tools to tap data’s hidden value. (p. 104)
Poppycock! We’ve always searched for hidden value in data. What we haven’t done as much in the past is collect everything in the hope that it contains a goldmine of hidden wealth if we only dig hard and long enough. What has yet to be determined is the net value of this venture.
What Are the Risks of Big Data?
Early in the book the authors point out that they are observers of Big Data, not evangelists. They seem to be both. They certainly promote Big Data with enthusiasm. Their chapter on the risks of Big Data does not negate this fact. What’s odd is that the risk that seems to concern them most is one that is and will probably always remain science fiction. They are concerned that those in power, such as governments, will use Big Data to predict the bad behavior of individuals and groups and then, based on those predictions alone, act preemptively by arresting people for crimes that have not yet been committed.
It is the quintessential slippery slope—leading straight to the society portrayed in Minority Report, a world in which individual choice and free will have been eliminated, in which our individual moral compass has been replaced by predictive algorithms and individuals are exposed to the unencumbered brunt of collective fiat. If so employed, big data threatens to imprison us—perhaps literally—in probabilities. (p. 163)
Despite preemptive acts of government that were later exposed as mistakes (e.g., the invasion of Iraq because they supposedly had weapons of mass destruction) and those of insurance companies or credit agencies that deny coverage or loans by profiling certain groups of people as risky in the aggregate, the threat of being arrested because an algorithm predicted that I would commit a crime does not concern me.
Big data erodes privacy and threatens freedom. But big data also exacerbates a very old problem: relying on the numbers when they are far more fallible than we think. (p. 163)
This is indeed a threat, but the authors’ belief that we can allow messiness in Big Data exacerbates this problem.
The threat is that we will let ourselves be mindlessly bound by the output of our analyses even when we have reasonable grounds for suspecting something is amiss. Or that we will become obsessed with collecting facts and figures for data’s sake. (p. 166)
This obsession with “collecting facts and figures for data’s sake” is precisely what the authors promote in this book.
The authors of this book are indeed evangelists for the cause of Big Data. Even though one is an academic and the other an editor, both make their living by observing, using, and promoting technology. There’s nothing wrong with this, but the objective observer’s perspective on Big Data that I was hoping to find in this book wasn’t there.
Is Big Data the paradigm-shifting new development that the authors and technology companies claim it to be, or is the data of today part of a smooth continuum extending from the past? Should we adopt the mindset that all data is valuable in and of itself and that “more trumps better”? Should we dig deep into our wallets to create the ever-growing infrastructure that would be needed to indiscriminately collect, store, and retain more?
Let me put this into perspective. While recently reading the book Predictive Analytics by Eric Siegel, I learned about the research of Larry Smarr, the director of a University of California-based research center, who is “tracking all bodily functions, including the scoop on poop, in order to form a working computational model of the body as an ecosystem.” Smarr asks and answers:
Have you ever figured how information-rich your stool is? There are about 100 billion bacteria per gram. Each bacterium has DNA…This means that human stool has a data capacity of 100,000 terabytes of information stored per gram.
This is fascinating. I’m not being sarcastic; it really is. I think Smarr’s research is worthwhile. I don’t think, however, that everyone should continuously save what we contribute to the toilet, convert its contents into data, and then store that data for the rest of our lives. If we did, we would become quite literally buried in shit. Not all data is of equal importance.
The authors the book Big Data: A Revolution That Will Transform How We Live, Work, and Think, in a thoughtful moment of clarity, included the following note of realism:
What we are able to collect and process will always be just a tiny fraction of the information that exists in the world. It can only be a simulacrum of reality, like the shadows on the wall of Plato’s cave. (p. 197)
This is beautifully expressed and absolutely true. Data exists in a potentially infinite supply. Given this fact, wouldn’t it be wise to determine with great care what we collect, store, retain, and mine for value? To the extent that more people are turning to data for help these days, learning to depend on evidence rather than intuition alone to inform their decisions, should we accept the Big Data campaign as helpful? We can turn people on to data without claiming that something miraculous has changed in the data landscape over the last few years. The benefits of data today are the same benefits that have always existed. The skills that are needed to tap this potential have changed relatively little over the course of my long career. As data continues to increase in volume, velocity, and variety as it has since the advent of the computer, its potential for wise use increases as well, but only if we refine our ability to separate the signals from the noise. More does not trump better. Without the right data and skills, more will only bury us.