Thanks for taking the time to read my thoughts about Visual Business Intelligence. This blog provides me (and others on occasion) with a venue for ideas and opinions that are either too urgent to wait for a full-blown article or too limited in length, scope, or development to require the larger venue. For a selection of articles, white papers, and books, please visit my library.

 

Big Data and the NSA

June 18th, 2013

In a recent blog post titled “Big data NSA spying is not even an effective strategy,” Francis Gouillart raised concerns about Big Data that are very much in line with mine. Gouillart’s is a refreshing and rare voice of sanity. He’s been around long enough to recognize marketing hype when he sees it, and as an independent thinker with ethics, not a shill for technology vendors, he is one among few who are speaking the truth. Here’s a sample:

The evidence for big data is scant at best. To date, large fields of data have generated meaningful insights at times, but not on the scale many have promised…Yet, for years now, corporations and public organizations have been busy buying huge servers and business intelligence software, pushed by technology providers and consultants armed with sales pitches with colorful anecdotes such as the Moneyball story in which general manager Billy Beane triumphed by using player statistics to predict the winning strategies for the Oakland A’s baseball team. If it worked for Billy Beane, it will work for your global multinational, too, right? Well, no.

The worship of big data is not new. Twenty-five years ago, technology salespeople peddled data using an old story about a retailer that spotted a correlation between diaper purchases and beer drinking, allowing a juicy cross-promotion of the two products for young fathers. Today, most data warehouses are glorified repositories of transaction data, with very little intelligence.

Working with multinationals as a management consultant, I have chased big data insights all my life and have never found them. What I have learned, however, is that local data has a lot of value. Put another way, big data is pretty useless, but small data is a rich source of insights. The probability of discovering new relationships at a local, highly contextual level and expanding it to universal insights is much higher than of uncovering a new law from the massive crunching of large amounts of data.

Read Gouilart’s article in full and pass it on. It’s time to usher in a quiet voice of sanity in this noisy, naive world of “more is better.”

Take care,

Predictive Analytics – Eric Siegel Lights the Way

May 31st, 2013

Predictive analytics is one of the most popular IT terms of our day, and like the others (Big Data, Data Science, etc.), it’s often defined far too loosely. People who work in the field of predictive analytics, however, use the term fairly precisely and meaningfully. No one, in my experience, does a better job of explaining predictive analytics—what it is, how it works, and why it’s important—than Eric Siegel, the founder of Predictive Analytics World, Executive Editor of the Predictive Analytics Times, and author of the new best-selling book in the field, Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die.

Predictive analytics is a computer-based application of statistics that has grown out of an academic discipline that is traditionally called machine learning. Yes, even though computers can’t think, they can learn (i.e., acquire useful knowledge from data). Siegel defines predictive analytics as “technology that learns from experience (data) to predict the future behavior of individuals in order to drive better decisions.” (p. 11)

I appreciate the fact that Siegel doesn’t gush about the wonders of data and technology to the hyperbolic degree that is common today; he keeps a level head as he describes what can be done in realistic and practical terms. Here’s what he says about data:

As data piles up, we have ourselves a genuine gold rush. But data isn’t the gold. I repeat, data in its raw form is boring crud. The gold is what’s discovered therein. (p. 4)

And again here:

Big data does not exist. The elephant in the room is that there is no elephant in the room. What’s exciting about data isn’t how much of it there is, but how quickly it is growing. We’re in a persistent state of awe at data’s sheer quantity because of one thing that does not change: There’s always so much more today than yesterday. Size is relative, not absolute. If we use the word big today, we’ll quickly run out of adjectives: “big data,” “bigger data,” “even bigger data,” and “biggest data,” The International Conference on Very Large Databases has bee running since 1975. We have a dearth of vocabulary with which to describe a wealth of data…

There’s a ton of it—so what? What guarantees that all this residual rubbish, this by-product of organizational functions, holds value? It’s no more than an extremely long list of observed events, an obsessive-compulsive enumeration of things that have happened.

The answer is simple. Everything is connected to everything else—if only indirectly—and this is reflected in data…

Data always speaks. It always has a story to tell, and there’s always something to learn from it…Pull some data together and, although you can never be certain what you’ll find, you can be sure you’ll discover valuable connections by decoding the language it speaks and listening. (pp. 78 and 79)

Siegel demonstrates that you can embrace technology without becoming a drooling idiot sitting around the campfire singing Kumbayah and toasting the imminence of the Singularity while chugging homemade wine produced by an algorithm:

I have good news: a little prediction goes a long way. I call this The Prediction Effect, a theme that runs throughout the book. The potency of prediction is pronounced—as long as the predictions are better than guessing. The Effect renders predictive analytics believable. We don’t have to do the impossible and attain true clairvoyance. The story is exciting yet credible: Putting odds on the future to lift the fog just a bit off our hazy view of tomorrow means pay dirt. In this way, predictive analytics combats financial risk, fortifies healthcare, conquers spam, toughens crime fighting, and boosts sales. (p. XVI)

This is a great introduction to predictive analytics. It won’t teach you how to develop predictive models, but it surveys the territory, explains why it’s worthwhile, and points you in the right direction if you want to claim some of this territory as your own.

Take care,

Are You a Data Scientist?

May 30th, 2013

I’ve written a great deal during the past few months about Big Data, which is the most annoying, constantly-in-your-face information technology term of recent history. Another term has arisen in connection with Big Data that has generated its own share of hype and confusion: Data Science. I haven’t written about data science until now, but the following email from a data analyst named Kelly Martin has spurred me into action.

Dear Sir,

I am writing to ask you to please compose a post specifically addressing the new ‘data scientist’ hype, as this term is filtering down into organizations and causing all kinds of havoc. As a data analyst with a solid combination of education and experience, I am used to having terms thrown about by management who think ‘data mining’ is using VLOOKUP in Excel and love to present all their metric results as ‘significant’. Usually we can ignore it; they’ll find some poor kid to do what they want, and hopefully someone will point out that the Emperor has no clothes before they present their latest innovative analysis to a VP.

The Data Scientist buzz is powerful right now—perhaps it’s the exposure of Nate Silver’s election predictions. Managers who have zero statistical or data understanding are now expecting their analysts to produce inferential statistics believing this to be the solution to all their problems (or at least a career builder). These same managers don’t even know the difference between descriptive and inferential statistics or that they are asking for a research project that requires methodological rigor and time. (Research design to these guys means searching the Internet.)

Sometimes I wonder if your courses shouldn’t be specifically directed to middle and upper management. Most of the people data analysts are managed by have no real understanding of data, analysis, or statistics. This is why they are such easy marks for BI Vendors and Consultants.

I personally am no longer willing to educate my bosses. Organizations restructure so frequently now, who has time to train 3 managers a year? Besides, they often just grab on to terminology and use it inappropriately in meetings to sound knowledgeable. But I believe if you were to do a post explaining what a data analyst is and what the ‘data scientist’ hype is all about, we analysts could quietly forward it to our bosses in hopes they could learn something other than what they hear from Vendors and Consultants.

By the way, I love the Aptitudes and Attitudes of Effective Analysts section in your book Now You See It—you nailed it. Nowhere have I read a better description—it should be used for job descriptions. I just wish I could have gotten one boss to read it.

Thank-you

Kelly Martin

Kelly raises a legitimate concern, shared by many. She articulated it so well and with so much feeling, I couldn’t ignore her plea.

Similar to the term Big Data, which was coined in 1997, the term data science isn’t new. Gil Press did a great job of tracing its historical roots in a recent Forbes article titled “A Very Short History of Data Science.” Precursors of the term can be found in the writings of Princeton statistician John Tukey dating back to the 1960s. The precise term first appeared in a book by Peter Naur’s titled Concise Survey of Computer Methods in 1974 and it was later used in the title of a conference for the first time in 1996 (“Data science, classification, and related methods” at the biennial conference of the International Federation of Classification Societies).

It was in a paper written in 2001, however, by statistician William S. Cleveland, one of data visualization’s great pioneers, that the term was first used in a manner that’s fairly consistent with its current use: “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics.”

This document describes a plan “to enlarge the major areas of technical work of the field of statistics. Because the plan is ambitious and implies substantial change, the altered field will be called “data science.” The focus of the plan is the practicing data analyst. A basic premise is that technical areas of data science should be judged by the extent to which they enable the analyst to learn from data.

Cleveland wanted to expand the perspective and toolkit of statisticians and apply their efforts to a broader range of real-world problems. He spoke of data science as a multidisciplinary effort, rooted in historical precedents.

The single biggest stimulus of new tools and theories of data science is the analysis of data to solve problems posed in terms of the subject matter under investigation. Creative researchers, faced with problems posed by data, will respond with a wealth of new ideas that often apply much more widely than the particular data sets that gave rise to the ideas. If we look back on the history of statistics—for example, R. A. Fisher inventing the design of experiments stimulated by agriculture data, John Tukey inventing numerical spectrum analysis stimulated by physical science and engineering data, and George Box inventing response surface analysis based on chemical process data—we see that the greatest advances have been made by people close to the analysis of data.

[Data science] carries statistical thinking to subject matter disciplines. This is vital. A very limited view of data science is that it is practiced by statisticians. The wide view is that data science is practiced by statisticians and subject matter analysts alike, blurring exactly who is and who in not a statistician.

Cleveland believed that statistics should be integrated more intimately into the real world and embrace a broader range of tools to explore and make sense of data. His perspective echoed that of Tukey who many years earlier expressed concern that the term statistician reflected a narrow approach to data sensemaking, so he encouraged use of the term data analyst to promote a more open-minded statistics that embraced the use of software (still in its infancy at the time, long before personal computers) and data visualization. Tukey’s use of the term data analysis and Cleveland’s use of the term data science encouraged statisticians to embrace all of the technologies, techniques, and skills that were needed to derive greater value from a rapidly growing body of data. In the same spirit, today we can work to free data sensemaking from compartmentalization into separate disciplines of statistics, financial analysis, computer programming, data warehousing, data mining, business intelligence, data storytelling, data visualization, predictive analytics, and any other confined specialty. We can accomplish more by collaborating, tearing down the boundaries that have traditionally and in many respects artificially separated these disciplines, sometimes creating conflict and competition between them.

When defined in this way, the term data science is meaningful and useful. Understood in this way, the work of data science is not new; it is what good data analysts by any name have been doing for quite awhile. As the term is used by technology vendors and technology analysts, or on resumes, however, data science is seldom more than marketing hype—a new name for old technologies and skills that rarely rise to the level of science.

When uttered from the lips of technology vendors, technology analysts, and suddenly rebranded data practitioners, like many marketing terms, data science is an ill-defined miasma of confusion. They usually use the term synonymously with Big Data. Here’s the description that appears on the website www.datasciencecentral.com: “Data Science Central is the industry’s online resource for big data practitioners.” New York University’s initiative in data science intimately joins these terms as well: “In order to unlock the powerful potential of this big data, the world needs researchers and professionals skilled in developing and utilizing automated methods of analyzing it. These individuals are called ‘data scientists’…” (http://datascience.nyu.edu/about/). Insight Data Source, which offers a training program that in six weeks equips people to “succeed as data scientists,” enthusiastically describes the term more broadly, but no more meaningfully.

Nowhere has the benefits of analyzing data been felt more strongly than at top technology companies. Silicon Valley companies are not only leading in the production of data, they are also on the cutting edge of using insights from that data to benefit their users. In fact, the role of data scientist, now used throughout industry to describe highly specialized analysts with deep quantitative abilities, was coined by the heads of the early data teams at Facebook and LinkedIn. They realized the process of asking questions about product use cases, taking measurements, verifying hypotheses and building upon those results closely mirrored the process by which science is done. The individuals, therefore, who apply their curiosity, quantitative skills and intellect toward understanding big data are now known as data scientists—a job title that is one of the most in-demand job roles at today’s leading technology companies.

Imagine developing the opportunity to become “highly specialized analysts with deep quantitative abilities” in six weeks. Despite the fact that it has been obvious to data sensemakers all along that their work is scientific in nature and method, applying the term “data scientist” to those who do this work exploded in the IT industry’s consciousness as a revelation: one that it could harness to make money.

In some cases, vendors offer incoherent descriptions of data science beyond an association with Big Data that seem intentionally designed to keep the term nebulous. It is often to a vendor’s advantage to define what they sell in broad, vague strokes to seemingly fit the diverse expectations of customers. Here’s how data science is described on IBM’s website:

So what does a data scientist do?

A data scientist represents an evolution from the business or data analyst role. The formal training is similar, with a solid foundation typically in computer science and applications, modeling, statistics, analytics and math. What sets the data scientist apart is strong business acumen, coupled with the ability to communicate findings to both business and IT leaders in a way that can influence how an organization approaches a business challenge. Good data scientists will not just address business problems, they will pick the right problems that have the most value to the organization.

The data scientist role has been described as “part analyst, part artist.” Anjul Bhambhri, vice president of big data products at IBM, says, “A data scientist is somebody who is inquisitive, who can stare at data and spot trends. It’s almost like a Renaissance individual who really wants to learn and bring change to an organization.”

Whereas a traditional data analyst may look only at data from a single source—a CRM system, for example—a data scientist will most likely explore and examine data from multiple disparate sources. The data scientist will sift through all incoming data with the goal of discovering a previously hidden insight, which in turn can provide a competitive advantage or address a pressing business problem. A data scientist does not simply collect and report on data, but also looks at it from many angles, determines what it means, then recommends ways to apply the data.

Data scientists are inquisitive: exploring, asking questions, doing “what if” analysis, questioning existing assumptions and processes. Armed with data and analytical results, a top-tier data scientist will then communicate informed conclusions and recommendations across an organization’s leadership structure.

So, according to IBM, a data scientist is someone who exhibits the following characteristics:

  • An evolved state of business analysis or data analysis expertise
  • Business knowledge
  • An ability to communicate to technical and business people alike
  • Inquisitiveness
  • Influence
  • A focus on problems that matter
  • Artistry
  • Pattern detection
  • A Jack of all trades (Renaissance individual)
  • A lifelong learner
  • A facilitator of change
  • A multi-source data integrator
  • A seeker of hidden gems
  • An multi-perspective observer
  • A person of action (insight applier)
  • Skepticism
  • Someone who asks “what if?”
  • Someone to whom leaders listen

Have I missed any? These are all great qualities, but they don’t constitute a useful definition of a data scientist. In fact, we could replace the title “data scientist” with “data analyst,” “statistician,” “BI professional,” or any of the other job titles that have been used over the years for expert data sensemakers and the list would work as well.

According to Wikipedia, data scientists are more narrowly defined: they must be trained in a specific scientific discipline and work as members of a team.

Data scientists solve complex data problems through employing deep expertise in some scientific discipline. It is generally expected that data scientists are able to work with various elements of mathematics, statistics and computer science, although expertise in these subjects are not required. However, a data scientist is most likely to be an expert in only one or two of these disciplines and proficient in another two or three. There is probably no living person who is an expert in all of these disciplines—if so they would be extremely rare. This means that data science must be practiced as a team, where across the membership of the team there is expertise and proficiency across all the disciplines.

There is no general agreement about the meaning of the term. Despite the good intentions of some who use it, such as William Cleveland when he first introduced it in 2001, it has led to confusion through freewheeling, self-serving use. When all is said and done, deriving value from data comes down to effective data sensemaking, resulting in greater understanding and better decisions. Regarding those who do the work, it only matters that they’re qualified to do it well. If you’re a hiring manager looking for someone to glean meaningful insights from your data, keep in mind that the title “data scientist” on a resume means nothing. IT titles are notoriously inflated.

In a recent article (“Data scientists don’t scale“, ZDNet, May 22, 2013) Andrew Brust described the situation poignantly:

There’s a risk that many technologists will become “data scientists” in the name of finding a better gig, in exactly the same way that happened with other lofty titles in technology (“architect,” for example). Title inflation happens in any field, but in the tech field, terms and titles are in any case viewed as metaphors, more than literal descriptions. Tech folks tend to take poetic license with titles, and those who don’t do so find themselves at a disadvantage compared to those who do.

If you manage people who work as data sensemakers, take the time to understand what they do and what’s required to do it well. Learn enough to see past the hype that vendors attach to this work to get you to buy their products and hire their consultants. Know enough to support your employees in their work, and then get out of their way. Most of all, never forget that only people can make sense of data. At best, tools can augment the abilities of talented people. There is indeed a science to data sensemaking, but data science by any other name (and there are many) would smell as sweet.

Take care,

A Preview of Tableau 9: Gauges?!

May 20th, 2013

If George Peck has his way (Peck wrote the only authorized book about Tableau 8 and has also served as the featured speaker in many of the Tableau 8 Roadshow events), the next version of Tableau will add flashy gauges to its library of charts. Here are his thoughts on the matter, as recorded in the latest newsletter from Peck’s consultancy The Ablaze Group:

Why Tableau Should Add a Gauge to Version 9

We’re wrapping up the Tableau 8 Roadshow (having now been shut out of two cities, including our own hometown of Denver, by airport weather cancellations). Tableau 8 is available and is enjoying rave reviews. And, while I was just getting around to fully digesting the old Tableau controversy about removing WikiLeaks visualizations, I just now heard about the new one that erupted when Stephen Few dissed Tableau about version 8. Despite my behind-the-time-ness, I simply must offer a contrast to Mr. Few’s thoughts.

There’s a place for visualization “experts.” Varying points of view are good. Educated opinions on visual best practices contribute to improved toolsets. But, can we all remember that there’s not any one person who knows all, or sees all, about any particular topic? My philosophy about “informed opinions,” including mine, is “Put this in your bucket of thoughts, shake or stir thoroughly, and benefit from the mix.” With this spirit of “mix of opinions” in mind, add the following to your bucket and shake it up.

Tableau should add a gauge mark type to Version 9.

We had an existing SAP BusinessObjects customer (Stephen Few’s never-ending scorn for this product is legendary) who approached us a while back inquiring about Tableau. “Our existing BI system has some issues. Some parts of it are slow and difficult to maintain. Can you give us an idea of where Tableau might improve this?” My reply was, “Sure… let me know where you have particular issues — where are your most painful areas?” But, before we could even begin to address these basic salient points, the prospect took it upon themselves to download a Tableau demo and begin to explore the product. The first follow-up was almost immediate; “How do I create a gauge in Tableau?”

I tried to move the customer back to the initial issues that had, theoretically, been the impetus for their initial inquiry. “Well, we can explore that. But, rather than just trying to mirror your current visuals, can we talk about where you have problems? Maybe there’s a better way, such as use of a bullet chart, to analyze those types of metrics. When can we talk?” The response, “Yeah, let me see when I can work a call into my schedule. But, for now, can you tell me how to create a gauge in Tableau?” I responded with a fairly extensive comparison of gauges versus bullet charts to analyze actual/goal data (this particular customer’s application of gauges). The next, final response: “Sorry, Tableau isn’t right for us.”

Yes, we can talk all day about this customer’s lack of insight, inability/unwillingness to look at anything other than “the way we’ve always done things,” and their refusal to sit down for even a basic discussion of their issues. In other words, this was a fairly normal situation (so, I’m admittedly now adopting the Stephen Few approach of “A few insults never hurt anybody”). In the final evaluation, Tableau’s lack of this ubiquitous mark type immediately prevented this (admittedly uninformed) prospect from discovering the beautiful, blazingly fast, tool that’s Tableau. Once they would have gotten over their gauge-itis, they too would have come on board.

Here’s the bottom line (expressed with an appropriate metaphor relating to New York Mayor Bloomberg’s recent come-uppance from a State Supreme Court): It’s not Tableau’s place to keep the 18 ounce sugary drink off the Mark Type Cafe menu. Visualization Seat Belts may help at the time of the Data Discovery collision, but Stephen Few simply doesn’t have the authority to mandate self-protection by all BI passengers.

Even if they choose to add the following confirmation dialog:

Tableau should offer a gauge in their next major release.

So many wise lessons can be found in this short article. Here’s what I’ve learned from Mr. Peck:

  1. “There’s a place for visualization ‘experts.'” Thank God for this. Previously, I had my doubts. (I have the impression, however, that my place, according to Peck, is in one of the rings of Dante’s Inferno.)
  2. To my great disappointment, I have now discovered that I do not know all or see all. At best, I must be satisfied with the status of semi-omniscient demigod.
  3. All opinions are of equal value. We needn’t vet them, but should just put them all in a hat, shake them up, and then…well I don’t actually know what we’re supposed to do with them next, but somehow we will “benefit from the mix.”
  4. Software vendors should give customers what they want, even when those things don’t work and are potentially harmful. If you give your customers crap, they will eventually figure out that it’s crap and reject it in favor of the other stuff that you gave them that isn’t crap.
  5. Teaching data visualization best practices is an attempt to make it illegal for people to do otherwise. Apparently, I don’t have the right to make people do what I want. Damn! Once again my divine self-image has been dismissed as an illusion.
  6. Visualizing data ineffectively isn’t such a big deal. It’s a lot like using a dirty word. At worst you might offend some prickly data Nazi like me, so go right ahead and do your worst.

You might think that I shouldn’t need to be taught these same lessons over and over again. How many times have I been told that I should shut up and stop caring about visualizing data in ways that actually work? Too many times to count. Chances are, I’ll never learn. I suspect that I will once again toss these lessons into that round object (no, not a hat) that keeps my life uncluttered by nonsense.

As Peck so graciously concedes, everyone has the right to an opinion, even pesky “informed opinions,” and he certainly has a right to his. What concerns me, however, is the degree to which his opinion reflects the perspective of Tableau. I recently learned that when my review of Tableau 8 was published, Tableau employees were forbidden from responding publicly. That makes this brief article by Peck the closest thing to a public response from Tableau that I’ve received.

Not all that long ago, I would have said that George Peck, someone who has supported SAP BusinessObjects for many years and continues to do so today, couldn’t possibly represent the position of Tableau. Now, I’m no longer sure. Unless the ban at Tableau on responding to me publicly is perpetual, I’d love to find out to what extent Tableau is still committed to best practices, which is what once made the product great and unique. Tableau’s customers, especially those who fell in love with the product because it avoided the silly stuff and made it easy to derive real value from data, deserve to know if Tableau has changed course. Who is calling the shots at Tableau these days: sales and marketing, from a near-term perspective of the quick win, or the information visualization experts such as Chris Stolte, Pat Hanrahan, and Jock Mackinlay who built the product with a clear vision rooted in best practices?

Take care,

A More Thoughtful but No More Convincing View of Big Data

May 13th, 2013

I have a problem with Big Data. As someone who makes his living working with data and helping others do the same as effectively as possible, my objection doesn’t stem from a problem with data itself, but instead from the misleading claims that people often make about data when they refer to it as Big Data. I have frequently described Big Data as nothing more than a marketing campaign cooked up by companies that sell information technologies either directly (software and hardware vendors) or indirectly (analyst groups such as Gartner and Forrester). Not everyone who promotes Big Data falls into this specific camp, however. For example, several academics and journalists also write articles and books and give keynotes at conferences about Big Data. Perhaps people who aren’t directly motivated by increased sales revenues talk about Big Data in ways that are more meaningful? To examine this possibility, I recently read the best selling book on the topic, Big Data: A Revolution That Will Transform How We Live, Work, and Think, by Viktor Mayer-Schönberger and Kenneth Cukier. Mayer-Schönberger is an Oxford professor of Internet governance and regulation and Cukier is the data editor of the Economist.


Big Data: A Revolution That Will Transform How We Live, Work, and Think
Viktor Mayer-Schönberger and Kenneth Cukier
Houghton Mifflin Harcourt, 2013

I figured that if anyone had something useful to say about Big Data, these were the guys. What I found in their book, however, left me convinced even more than before that Big Data is a ruse, and one that should concern us.

What Is Big Data?

One of the problems with Big Data, like so many examples of techno-hype, is that it is ill-defined. What is Big Data exactly? The authors address this concern early in the book:

There is no rigorous definition of big data. Initially the idea was that the volume of information had grown so large that the quantity being examined no longer fit into the memory that computers use for processing, so engineers needed to revamp the tools they used for analyzing it all. (p. 6)

Given this state of affairs, I was hoping that the authors would propose a definition of their own to reduce some of the confusion. Unfortunately, they never actually define the term, but they do describe it in various ways. Here’s one of the descriptions:

The sciences like astronomy and genomics, which first experienced the explosion in the 2000s, coined the term “big data.” The concept is now migrating to all areas of human endeavor. (p. 6)

Actually, the term “Big Data” was coined back in 1997 in the proceedings of IEEE’s Visualization conference by Michael Cox and David Ellsworth in a paper titled “Application-controlled demand paging for out-of-core visualization.” Scientific data in the early 2000s was not our first encounter with huge datasets. For instance, I was helping the telecommunications and banking industries handle what they experienced as explosions in data quantity back in the early 1980s. What promoters of Big Data fail to realize is that data has been increasing at an exponential rate since the advent of the computer long ago. We have not experienced any actual explosions in the quantity of data in recent years. The exponential rate of increase has continued unabated all along.

How else do the authors define the term?

[Big Data is]…the ability of society to harness information in novel ways to produce useful insights or goods and services of significant value. (p. 2)

On the contrary, we’ve been finding “novel ways to produce” value from data forever, not just in the last few years. We haven’t crossed any threshold recently.

How else do they define it?

At its core, big data is about predictions. Though it is described as part of the branch of computer science called artificial intelligence, and more specifically, an area called machine learning, this characterization is misleading. Big data is not about trying to “teach” a computer to “think” like humans. Instead, it’s about applying math to huge quantities of data in order to infer probabilities. (pp. 11 and 12)

So, Big Data is essentially about “predictive analytics.” Did we only in recent years begin applying math to huge quantities of data in an attempt to infer probabilities? We neither began this activity recently, nor did data suddenly become huge.

Is Big Data a technology?

But where most people have considered big data as a technological matter, focusing on the hardware or the software, we believe the emphasis needs to shift to what happens when the data speaks. (p. 190)

I wholeheartedly agree. Where I and the authors appear to differ, however, is in our understanding of the methods that are used to find and understand the messages in data. The ways that we do this are not new. The skills that we need to make sense of data—skills that go by the names statistics, business intelligence, analytics, and data science—have been around for a long time. Technologies incrementally improve to help us apply these skills with greater ease to increasingly larger datasets, but the skills themselves have changed relatively little, even though we come up with new names for the folks who do this work every few years.

So what it is exactly that separates Big Data from data of the past?

One way to think about the issue today—and the way we do in this book—is this: big data refers to things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms of value, in ways that change markets, organizations, the relationship between citizens and governments, and more.

But this is just the start. The era of big data challenges the way we live and interact with the world. Most strikingly, society will need to shed some of its obsession for causality in exchange for simpler correlations: not knowing why but only what. This overturns centuries of established practices and challenges our most basic understanding of how to make decisions and comprehend reality. (pp. 6 and 7)

To “make decisions and comprehend reality” we no longer need to understand why things happen together (i.e., causation) but only what things happened together (correlation). When I read this an eerie feeling crawled up my spine. The implications of this line of thinking are scary. Should we really race into the future satisfied with a level of understanding that is relatively superficial?

According to the authors, Big Data consists of “things one can do at a large scale that cannot be done at a smaller one.” And what are these things and how exactly do they change how “we live and interact with the world?” Let’s see if the authors tell us.

Does Big Data Represent a Change of State?

One claim that the authors make, which is shared by other promoters of Big Data, is that data has grown so large and so fast that the increase in quantity constitutes a qualitative change of state.

Not only is the world awash with more information than ever before, but that information is growing faster. The change of scale has led to a change of state. The quantitative change has led to a qualitative one. (p. 6)

The essential point about big data is that change of scale leads to change of state. (p. 151)

All proponents make this claim about Big Data, but I’ve yet to see anyone substantiate it. Perhaps it is true that things can grow to such a size and at such a rate that they break through some quantitative barrier into the realm of qualitative change, but what evidence do we have that this has happened with data? This book contains many examples of data analytics that have been useful during the past 20 years or so, which the authors classify as Big Data, but I believe this attribution is contrived. Not one of the examples demonstrates a radical departure from the past. Here’s one involving Google:

Big data operates at a scale that transcends our ordinary understanding. For example, the correlation Google identified between a handful of search terms and the flu was the result of testing 450 million mathematical models. (p. 179)

I suspect that Google’s discovery of a correlation between search activity and incidents of the flu in particular areas resulted, not from 450 million distinct mathematical models, but rather a predictive analytics algorithm making millions of minor adjustments during the process of building a single. If they really created 450 million different models, or even if they actually made that many tweaks to an evolving model to find this relatively simple correlation, is this really an example of progress. A little statistical thinking by a human being could have found this correlation with the help of a computer much more directly. Regardless of how many models were actually used, the final model was not overly complicated. What was done here does not transcend the ordinary understanding of data analysts.

And now, for the paradigm-shattering implications of this change of state:

Big data is poised to reshape the way we live, work, and think. The change we face is in some ways even greater than those sparked by earlier epochal innovations that dramatically expanded the scope and scale of information in society. The ground beneath our feet is shifting. Old certainties are being questioned. Big data requires fresh discussion of the nature of decision-making, destiny, justice. A worldview we thought was made of causes is being challenged by a preponderance of correlations. The possession of knowledge, which once meant any understanding of the past, is coming to mean an ability to predict the future. (p. 190)

Does any of this strike you as particularly new? Everything that the authors claim as particular and new to Big Data is in fact old news. If you’re wondering what they mean by the “worldview we thought was made of causes is being challenged by a preponderance of correlations,” stay tuned; we’ll look into this soon.

Who Is the Star of Big Data?

Who or what in particular is responsible for the capabilities and potential benefits of big data? Are technologies responsible? Are data scientists responsible? Here’s the authors’ answer:

The real revolution is not in the machines that calculate data but in data itself and how we use it. (p. 7)

What we can do with data today was not primarily enabled by machines, but it is also not intrinsic to the data itself. Nothing about the nature of data has changed. Data is always noise until it provides an answer to a question that is asked to solve a problem or take advantage of an opportunity.

In the age of big data, all data will be regarded as valuable, in and of itself. (p. 100)

With big data, the value of data is changing. In the digital age, data shed its role of supporting transactions and often became the good itself that was traded. In a big-data world, things change again. Data’s value shifts from its primary use to it potential future uses. That has profound consequences. If affects how businesses value the data they hold and who they let access it. It enables, and may force, companies to change their business models. It alters how organizations think about data and how they use it. (p. 99)

God help us. This notion should concern us because most data will always remain noise beyond its initial use. We certainly find new uses for data that was originally generated for another purpose, such as transaction data that we later use for analytical purposes to improve decisions, but in the past we rarely collected data primarily for potential secondary uses. Perhaps this is a characteristic that actually qualifies as new. Regardless, we must ask the question, “Is this a viable business model?” Should all organizations begin collecting and retaining more data in hope of finding unforeseen secondary uses for it in the future? I find it hard to imagine that secondary uses of data will provide enough benefit to warrant collecting everything and keeping it forever, as the authors seem to believe. Despite their argument that this is a no-brainer based on decreasing hardware costs, the price is actually quite high. The price is not based on the cost of hardware alone.

Discarding data may have been appropriate when the cost and complexity of collecting, storing, and analyzing it were high, but this is no longer the case. (p. 60)

Every single dataset is likely to have some intrinsic, hidden, not yet unearthed value, and the race is on to discover and capture all of it. (p. 15)

Data’s true value is like an iceberg floating in the ocean. Only a tiny part of it is visible at first sight, while much of it is hidden beneath the surface. (p. 103)

Imagine the time that will be wasted and how much it will cost. Only a tiny fraction of data that is being generated today will ever be valuable beyond its original use. A few nuggets of gold might exist in that iceberg below the water line, but do we really need to collect and save it all? Even the authors exhibit concern for the persistent value of data.

Most data loses some of it utility over time. In such circumstances, continuing to rely on old data doesn’t just fail to add value; it actually destroys the value of fresher data. (p. 110)

Collecting, storing, and retaining everything will make it harder and harder to focus on the little that actually has value. Nevertheless, the authors believe that the prize will go to those with the most.

Scale still matters, but it has shifted [from technical infrastructure]. What counts is scale in data. This means holding large pools of data and being able to capture even more of it with ease. Thus large data holders will flourish as they gather and store more of the raw material of their business, which they can reuse to create additional value. (p. 146)

If this were true, wouldn’t the organizations with the most data today be the most successful? This isn’t the case. In fact, many organizations with the most data are drowning in it. I know, because I’ve tried to help them change this dynamic. Having lots of data is useless unless you know how to make sense of it and how to apply what you learn.

Attempts to measure the potential value of data used for secondary purposes have so far been little more than wild guesses. Consider the way that Facebook was valued prior to going public.

Doug Laney, vice president of research at Gartner, a market research firm, crunched the numbers during the period before the initial public offering (IPO) and reckoned that Facebook had collected 2.1 trillion pieces of “monetizable content” between 2009 and 2011, such as “likes,” posted material, and comments. Compared against its IPO valuation, this means that each item, considered as a discrete data point, had a value of about five cents. Another way of looking at it is that every Facebook user was worth around $100, since users are the source of the information that Facebook collects.

How to explain the vast divergence between Facebook’s worth under accounting standards ($6.3 billion) and what the market initially valued it at ($104 billion)? There is no good way to do so. (p. 119)

Indeed, we are peering at tea leaves, trying to find meaning in drippy green clumps. Unforeseen uses for data certainly exist, but they are by definition difficult to anticipate. Do we really want to collect, store, and retain everything possible on the off chance that it might be useful? Perhaps, instead, we should look for ways to identify data with the greatest potential for future use and focus on collecting that? The primary problem that we still have with data is not the lack of it but our inability to make sense of it.

Does Correlation Alone Suffice With No Concern for Causation?

The authors introduce one of their strangest claims in the following sentence:

The ideal of identifying causal mechanisms is a self-congratulatory illusion; big data overturns this. (p. 18)

Are they arguing that Big Data eliminates our quest for an understanding of cause altogether?

Causality won’t be discarded, but it is being knocked off its pedestal as the primary fountain of meaning. Big data turbocharges non-causal analyses, often replacing causal investigations. (p. 68)

Apparently science can now take a back seat to Big Data.

Correlations exist; we can show them mathematically. We can’t easily do the same for causal links. So we would do well to hold off from trying to explain the reason behind the correlations: the why instead of the what. (p. 67)

This notion scares me. Correlations, although useful in and of themselves, must be used with caution until we understand the causal mechanisms related to them.

We progress by seeking and finding ever better explanations for reality—what is, how it works, and why. Explanations—the commitment to finding them and the process of developing and confirming them—is the essence of science. By rebelling against authority as the basis of knowledge, the Enlightenment began the only sustained period of progress that our species has ever known (see The Beginning of Infinity, by David Deutsch). Trust in established authority was replaced by a search for testable explanations, called science. The information technologies of today are a result of this search for explanations. To say that we should begin to rely on correlations alone without concern for causation encourages a return from the age of science to the age of ignorance that preceded it. Prior to science, we lived in a world of myth. Even then, however, we craved explanations, but we lacked the means to uncover them, so we fabricated explanations that provided comfort or that kept those in power in control. To say that explanations are altogether unnecessary today in the world of Big Data is a departure from the past that holds no hope for the future. Making use of correlations without understanding causation might indeed by useful at times, but it isn’t progress, and it is prone to error. Manipulation of reality without understanding is a formula for disaster.

The authors take this notion even further.

Correlations are powerful not only because they offer insights, but also because the insights they offer are relatively clear. These insights often get obscured when we bring causality back into the picture. (p. 66)

Actually, thinking in terms of causality is the only way that correlations can be fully understood and utilized with confidence. Only when we understand the why (cause) can we intelligently leverage our understanding of the what (correlation). This is essential to science. As such, we dare not diminish its value.

Big data does not tell us anything about causality. (p. 163)

Huh? Without data we cannot gain an understanding of cause. Does Big Data lack information about cause that other data contains? No. Data contains this information no matter what its size.

According to the authors, Big Data leverages correlations alone in an enlightening way that wasn’t possible in the past.

Correlations are useful in a small-data world, but in the context of big data they really shine. Through them we can glean insights more easily, faster, and more clearly than before. (p. 52)

Big data is all about seeing and understanding the relations within and among pieces of information that, until very recently, we struggled to fully grasp. (p. 19)

What exactly is it about Big Data that enables us to see and understand relationships among data that were elusive in the past? What evidence is there that this is happening? Nowhere in the book do the authors answer these questions in a satisfying way. We have always taken advantage of known correlations, even when we have not yet discovered what causes them, but this has never deterred us in our quest to understand causation. God help us if it ever does.

Does Big Data Transform Messiness into a Benefit?

Not only do we no longer need to concern ourselves with causation, according to the authors we can also stop worrying about data quality.

In a world of small data, reducing errors and ensuring high quality of data was a natural and essential impulse. Since we only collected a little information, we made sure that the figures we bothered to record were as accurate as possible…Analyzing only a limited number of data points means errors may get amplified, potentially reducing the accuracy of the overall results…However, in many new situations that are cropping up today, allowing for imprecision—for messiness—may be a positive feature, not a shortcoming. It is a tradeoff. In return for relaxing the standards of allowable errors, one can get ahold of much more data. It isn’t just that “more trumps some,” but that, in fact, sometimes “more trumps better. (pp. 32 and 33)

Not only can we stop worrying about messiness in data, we can embrace it as beneficial.

In dealing with ever more comprehensive datasets, which capture not just a small sliver of the phenomenon at hand but much more or all of it, we no longer need to worry so much about individual data points biasing the overall analysis. Rather than aiming to stamp out every bit of inexactitude at increasingly high cost, we are calculating with messiness in mind…Though it may seem counterintuitive at first, treating data as something imperfect and imprecise lets us make superior forecasts, and thus understand our world better. (pp. 40 and 41)

Hold on. Something is amiss in the authors’ reasoning here. While it is true that a particular amount of error in a set of data becomes less of a problem if that quantity holds steady as the total amount of data increases and becomes huge, a particular rate of error remains just as much of a problem as the data set grows in size. More does not trump better data.

The way we think about using the totality of information compared with smaller slivers of it, and the way we may come to appreciate slackness instead of exactness, will have profound effects on our interaction with the world. As big-data techniques become a regular part of everyday life, we as a society may begin to strive to understand the world from a far larger, more comprehensive perspective than before, a sort of N = all of the mind. And we may tolerate blurriness and ambiguity in areas where we used to demand clarity and certainty, even if it had been a false clarity and an imperfect certainty. We may accept this provided that in return we get a more complete sense of reality—the equivalent of an impressionist painting, wherein each stroke is messy when examined up close, but by stepping back one can see a majestic picture.

Big data, with its emphasis on comprehensive datasets and messiness, helps us get closer to reality than did our dependence on small data and accuracy. (p. 48)

Will impressionist data provide a more accurate and useful view of the world? I love impressionist paintings, but they’re not what I study to get a clear picture of the world.

Now, if you work in the field of data quality, let me warn you that what’s coming next will shock and dismay you. Perhaps you should sit down and take a valium before reading on.

The industry of business intelligence and analytics software was long built on promising clients “a single version of the truth”…But the idea of “a single version of the truth” is doing an about-face. We are beginning to realize not only that it may be impossible for a single version of the truth to exist, but also that its pursuit is a distraction. To reap the benefits of harnessing data at scale, we have to accept messiness as par for the course, not as something we should try to eliminate. (p. 44)

Seriously? Those who work in the realm of data quality realize that if we give up on the idea of consistency in our data and embrace messiness, the problems that are created by inconsistency will remain. When people in different parts of the organization are getting different answers to the same questions because of data inconsistencies, no amount of data will make this go away.

So what is it that we get in exchange for our willingness to embrace messiness?

In return for living with messiness, we get tremendously valuable services that would be impossible at their scope and scale with traditional methods and tools. According to some estimates only 5 percent of all digital data is ‘structured’—that is, in a form that fits neatly into a traditional database. Without accepting messiness, the remaining 95 percent of unstructured data, such as web pages and videos, remain dark. By allowing for imprecision, we open a window into an untapped universe of insights. (p. 47)

The authors seem to ignore the fact that most data cannot be analyzed until it is structured and quantified. Only then can it produce any of the insights that the authors applaud in this book. It doesn’t need to reside in a so-called structured database, but it must at a minimum be structured in a virtual sense.

Does Big Data Reduce the Need for Subject Matter Expertise?

I was surprised when I read these two authors, both subject matter experts in their particular realms, state the following:

We are seeing the waning of subject-matter experts’ influence in many areas. (p. 141)

They argue that subject matter experts will be substantially displaced by Big Data, because data contains a better understanding of the world than the experts.

Yet expertise is like exactitude: appropriate for a small-data world where one never has enough information, or the right information, and thus has to rely on intuition and experience to guide one’s way. (p. 142)

This separation of subject matter expertise on the one hand from what we can learn from data on the other is artificial. All true experts are informed by data. The best experts are well informed by data. Data about things existed long before the digital age.

At one point in the book the authors quote Hal Varian, formerly a professor in the computer science department at UC Berkeley and now Google’s chief economist: “Data is so widely available and so strategically important that the scarce thing is the knowledge to extract wisdom from it. (p. 125)” What they seem to miss is the fact that Varian, in the same interview from which this quotation was derived, talks about the need for subject matter experts such as managers to become better informed by data to do their jobs. People become better subject matter experts when they become better acquainted with pertinent data. These experts will not be displaced by data; they will be enriched by it as they always have, hopefully to an increasing degree.

As we’ve seen, the pioneers in big data often come from fields outside the domain where they make their mark. They are specialists in data analysis, artificial intelligence, mathematics, or statistics, and they apply those skills to specific industries. (p. 142)

These Big Data pioneers don’t perform these wonders independent of domain expertise but in close collaboration with it. Data sensemaking skills do not replace or supplant subject matter expertise, they inform it.

To illustrate how Big Data is displacing the subject matter experts in one industry, the authors write the following about the effects of Big Data in journalism:

This is a humbling reminder to the high priests of mainstream media that the public is in aggregate more knowledgeable than they are, and that cufflinked journalists must compete against bloggers in their bathrobes. (p. 103)

“Cufflinked journalists”? From where did this characterization of journalists come? Perhaps the author Cukier, the data editor of the Economist, has a bone to pick with other journalists who don’t understand or respect his work. Whatever the source of this enmity against mainstream media, I don’t want to rely on bloggers for my news of the world. While bloggers have useful information to share, unless they develop journalistic skills, they will not replace mainstream journalists. This is definitely one of those cases in which the amount of information—noise in the blogosphere—cannot replace thoughtful and skilled reporting.

Perhaps the strangest twist on this theme that the authors promote is contained in the following paragraph:

Perhaps, then, the crux of the value is really in the skills? After all, a gold mine isn’t worth anything if you can’t extract the gold. Yet the history of computing suggests otherwise. Today expertise in database management, data science, analytics, machine-learning algorithms, and the like are in hot demand. But over time, as big data becomes more a part of everyday life, as the tools get better and easier to use, and as more people acquire the expertise, the value of the skills will also diminish in relative terms…Today, in big data’s early stages, the ideas and the skills seem to hold the greatest worth. But eventually most value will be in the data itself. (p. 134)

The value of skills and expertise will not diminish over time. When programming jobs started being offshored, the value of programming wasn’t diminished, even though the value of individual programmers was through competition. No shift in value will occur from skills and expertise to data itself. Data will forever remain untapped, inert, and worthless without the expertise that is required to make sense of it and tie it to existing knowledge.

What Is a Big Data Mindset and Is It New?

Data represents potential. It always has. From back when our brains first evolved to conceive of data (facts) through the development of language, followed by writing, the invention of movable type, the age of enlightenment, the emergence of computers, and the advent of the Internet until now, we have always recognized the potential of data to become knowledge when understood and to be useful when applied. Has a new data mindset arisen in recent years?

Seeing the world as information, as oceans of data that can be explored at ever greater breadth and depth, offers us a perspective on reality that we did not have before. It is a mental outlook that may penetrate all areas of life. Today, we are a numerate society because we presume that the world is understandable with numbers and math. And we take for granted that knowledge can be transmitted across time and space because the idea of the written word is so ingrained. Tomorrow, subsequent generations may have a ‘big-data consciousness”—the presumption that there is a quantitative component to all that we do, and that data is indispensable for society to learn from. The notion of transforming the myriad dimensions of reality into data probably seems novel to most people at present. But in the future, we will surely treat it as a given. (p. 97)

Perhaps this mindset is novel for some, but it is ancient in origin, and it has been my mindset for my entire 30-year career. Nothing about this is new.

In a big-data age, we finally have the mindset, ingenuity, and tools to tap data’s hidden value. (p. 104)

Poppycock! We’ve always searched for hidden value in data. What we haven’t done as much in the past is collect everything in the hope that it contains a goldmine of hidden wealth if we only dig hard and long enough. What has yet to be determined is the net value of this venture.

What Are the Risks of Big Data?

Early in the book the authors point out that they are observers of Big Data, not evangelists. They seem to be both. They certainly promote Big Data with enthusiasm. Their chapter on the risks of Big Data does not negate this fact. What’s odd is that the risk that seems to concern them most is one that is and will probably always remain science fiction. They are concerned that those in power, such as governments, will use Big Data to predict the bad behavior of individuals and groups and then, based on those predictions alone, act preemptively by arresting people for crimes that have not yet been committed.

It is the quintessential slippery slope—leading straight to the society portrayed in Minority Report, a world in which individual choice and free will have been eliminated, in which our individual moral compass has been replaced by predictive algorithms and individuals are exposed to the unencumbered brunt of collective fiat. If so employed, big data threatens to imprison us—perhaps literally—in probabilities. (p. 163)

Despite preemptive acts of government that were later exposed as mistakes (e.g., the invasion of Iraq because they supposedly had weapons of mass destruction) and those of insurance companies or credit agencies that deny coverage or loans by profiling certain groups of people as risky in the aggregate, the threat of being arrested because an algorithm predicted that I would commit a crime does not concern me.

Big data erodes privacy and threatens freedom. But big data also exacerbates a very old problem: relying on the numbers when they are far more fallible than we think. (p. 163)

This is indeed a threat, but the authors’ belief that we can allow messiness in Big Data exacerbates this problem.

The threat is that we will let ourselves be mindlessly bound by the output of our analyses even when we have reasonable grounds for suspecting something is amiss. Or that we will become obsessed with collecting facts and figures for data’s sake. (p. 166)

This obsession with “collecting facts and figures for data’s sake” is precisely what the authors promote in this book.

In Summary

The authors of this book are indeed evangelists for the cause of Big Data. Even though one is an academic and the other an editor, both make their living by observing, using, and promoting technology. There’s nothing wrong with this, but the objective observer’s perspective on Big Data that I was hoping to find in this book wasn’t there.

Is Big Data the paradigm-shifting new development that the authors and technology companies claim it to be, or is the data of today part of a smooth continuum extending from the past? Should we adopt the mindset that all data is valuable in and of itself and that “more trumps better”? Should we dig deep into our wallets to create the ever-growing infrastructure that would be needed to indiscriminately collect, store, and retain more?

Let me put this into perspective. While recently reading the book Predictive Analytics by Eric Siegel, I learned about the research of Larry Smarr, the director of a University of California-based research center, who is “tracking all bodily functions, including the scoop on poop, in order to form a working computational model of the body as an ecosystem.” Smarr asks and answers:

Have you ever figured how information-rich your stool is? There are about 100 billion bacteria per gram. Each bacterium has DNA…This means that human stool has a data capacity of 100,000 terabytes of information stored per gram.

This is fascinating. I’m not being sarcastic; it really is. I think Smarr’s research is worthwhile. I don’t think, however, that everyone should continuously save what we contribute to the toilet, convert its contents into data, and then store that data for the rest of our lives. If we did, we would become quite literally buried in shit. Not all data is of equal importance.

The authors the book Big Data: A Revolution That Will Transform How We Live, Work, and Think, in a thoughtful moment of clarity, included the following note of realism:

What we are able to collect and process will always be just a tiny fraction of the information that exists in the world. It can only be a simulacrum of reality, like the shadows on the wall of Plato’s cave. (p. 197)

This is beautifully expressed and absolutely true. Data exists in a potentially infinite supply. Given this fact, wouldn’t it be wise to determine with great care what we collect, store, retain, and mine for value? To the extent that more people are turning to data for help these days, learning to depend on evidence rather than intuition alone to inform their decisions, should we accept the Big Data campaign as helpful? We can turn people on to data without claiming that something miraculous has changed in the data landscape over the last few years. The benefits of data today are the same benefits that have always existed. The skills that are needed to tap this potential have changed relatively little over the course of my long career. As data continues to increase in volume, velocity, and variety as it has since the advent of the computer, its potential for wise use increases as well, but only if we refine our ability to separate the signals from the noise. More does not trump better. Without the right data and skills, more will only bury us.

Take care,