Big Mouths on Big Data

Yesterday, I read an article on the website of Scientific American titled “Saving Big Data from Big Mouths” by Cesar A. Hidalgo. As you know if you read this blog regularly, I have grave concerns about the hyperbolic claims of Big Data and believe that it is little more than a marketing campaign to sell expensive technology products and services. In his article, Dr. Hidalgo, who teaches in the MIT Media Lab, challenges several recent articles in prominent publications that criticize the claims of Big Data. The fact that he characterizes naysayers as “big mouths” clues you into his perspective on the matter. In reading his article, I discovered that Dr. Hidalgo’s understanding of Big Data is limited, as is often the case with academics, and his position suffers from a fundamental problem with Big Data, which is that Big Data, as he defines it, doesn’t actually exist. When I say that it doesn’t exist, I’m arguing that Big Data, as he’s defined it, isn’t new or qualitatively different from data in the past. Big Data is just data.

I expressed my concerns to Dr. Hidalgo by posting the following comments in response to his article:

Dr. Hidalgo,

I’m one of the naysayers in response to the claims of so-called Big Data. I’m concerned primarily with the hype that leads organizations to waste money chasing new technologies rather than developing the skills that are needed to glean value from data. One of the fundamental problems with Big Data is the fact that no two people define it in the same way, so it is difficult to discuss it intelligently. In this article, you praised the benefits of Big Data, but did not define it. What do you mean by Big Data? How is Big Data different from other data? When did data become big? Are the means of gleaning value from so-called Big Data different from the means of gleaning value from data in general?

Thanks,

Stephen Few

He was kind enough to respond with the following.

Dear Stephen Few,

These are all very good questions.

First, regarding the definition of big data:

As you probably know well, the term big data is used colloquially to refer mostly to digital traces of human activities. These include cell phone data, credit card records and social media activity. Big data is also used occasionally to refer to data generated by some scientific experiments (like CERN or genomic data), although this is not the most common use of the phrase so I will stick to the “digital traces of human activity” definition for now.

Beyond the colloquial definition, I have a working definition of big data that I use on occasion. To keep things simple I say that big data needs to be three times big, meaning that it needs to be big in size, resolution and scope. The size dimension is relatively obvious (data on 20 or 30 individuals is not the same as data on hundreds of thousands of them). The resolution dimension is better explained by an example. Consider having credit card data on a million people. A low resolution version of this dataset would consist only on the total yearly expenditure of each individual. A high resolution version, would include information on when and where the purchases where made. In this example, it is the resolution of the data what allows us to use it to study, for instance, the mobility of this particular group of people (notice that I am not generalizing to the general population, since this subpopulation might be worthy of study on its own). Finally, I require data to be big in scope. By this, I require data to be useful for applications other than the ones for which it was originally collected. For instance, mobile phone records are used by operators for billing purposes, but could be used to forecast traffic or to identify the location of mobile phone users prior to a natural disaster (and use this information to help speed up search and rescue operations). When data is big in size, resolution and scope, I am comfortable saying that it is big.

Second: Your question about when big data become big?

This is an interesting question because it points to the evolution of language. During the last decade the word data has begot at least two children: “metadata” and “big data”. The word metadata grew in popularity in the wake of the NSA scandal, as people needed to differentiate between the content of messages and their metadata. Big data, on the other hand, emerged as people searched for a short way to refer to the digital traces of human activity that were collected for operational purposes by service providers serving large populations, and that could be used for purposes that were beyond those for which the data was originally collected. Certainly, the phrase “big data” provides an economy of language, and as someone that enjoys writing I always appreciate that.

Regarding the time at which this transition happen, I remember that when I started working with mobile phone data (in 2004) people were not using the word big data. As more people entered the field, the word begun to gain force (around 2008). With the financial crisis, the hype of big data entered full swing, as many framed big data as a the new asset, or technology, that could save the economy. I guess at that time, everyone wanted to believe them :-).

Third and Final: You ask whether the means of gleaning value from big data include the methods used to glean value from data in general.

In short, the answer is yes (it is data after all). Multivariate regressions in all of its forms and specifications are still useful and welcome (I use them often). Yet, these new datasets have also stimulated the proliferation of some additional techniques. For instance, visualizations have progressed enormously during recent years since exploring these datasets is not easy, and large datasets involve more exploration. As an example, check dataviva.info . This site makes available more than 100 million visualizations to help people explore Brazil’s formal sector economy. By taking different combinations of visualizations it is possible to weave stories about industries and locations. An example of these stories for a related project, The Observatory of Economic Complexity (atlas.media.mit.edu), can be seen in this video (http://vimeo.com/40565955). Here you will be able to see how these visualization techniques allow people to quickly compose stories about a topic.

Finally, it is worth noting that different people might mean different things when they refer to gleaning value from data. For some people, this might involve explaining the mechanisms that gave rise to the observed patterns, or use the data to learn about an aspect of the world. This is a common approach on the social sciences. For other people value might emerge from predictions that are not cognitively penetrable but nevertheless accurate, such as the ones people obtain with different machine learning techniques, such as neural networks or those based on abstract features. The latter of these approaches, which is often used by computer scientists, can be very useful for sites that require accurate predictions, such as Netflix or Amazon. Here, the value is certainly more commercial, but is also a valid answer to the question.

I hope these answer help clarify your questions.

All the best

C

I wanted to respond in kind, but for some unknown reason the Scientific American website is rejecting my comments, so I’ll continue the discussion here in my own blog.

Dr. Hidalgo,

Your response regarding the definition of Big Data demonstrates the problem that I’m trying to expose: Big Data has not been defined in a manner that lends itself to intelligent discussion. Your definition does not at all represent a generally accepted definition of Big Data. It is possible that the naysayers with whom you disagree define Big Data differently than you do. I’ve observed a great many false promises and much wasted effort in the name of Big Data. Unless you’re involved with a broad audience of people who work with data in organizations of all sorts (not just academia), you might not be aware of some of the problems that exist with Big Data.

Your working definition of Big Data is somewhat similar to the popular definition involving the 3 Vs (volume, velocity, and variety) that is often cited. The problem with the 3 Vs and your “size, resolution, and scope” definition is that they define Big Data in a way that could be applied to the data that I worked with when I began my career 30 years ago. Back then I routinely worked with data that was big in size (a.k.a., volume), detailed in resolution, and useful for purposes other than that for which it was originally generated. By defining Big Data as you have, you are supporting the case that I’ve been making for years that Big Data has always existed and therefore doesn’t deserve a new name.

I don’t agree that the term Big Data emerged as a “way to refer to digital traces of human activity that were collected for operational purposes by service providers serving large populations, and that could be used for purposes that were beyond those for which the data was originally collected.” What you’ve described has been going on for many years. In the past we called it data, with no need for the new term “Big Data.” What I’ve observed is that the term Big Data emerged as a marketing campaign by technology vendors and those who support them (e.g., large analyst firms such as Gartner) to promote sales. Every few years vendors come up with a new name for the same thing. Thirty years ago, we called it decision support. Not long after that we called it data warehousing. Later, the term business intelligence came into vogue. Since then we’ve been subjected to marketing campaigns associated with analytics and data science. These campaigns keep organizations chasing the latest technologies, believing that they’re new and necessary, which is rarely the case. All the while, they never slow down long enough to develop the basic skills of data sensemaking.

When you talk about data visualization, you’re venturing into territory that I know well. It is definitely not true that data visualization has “progressed enormously during recent years.” As a leading practitioner in the field, I am painfully aware that progress in data visualization has been slow and, in actual practice, is taking two steps backwards, repeating past mistakes, for every useful step forwards.

What various people and organizations value from data certainly differs, as you’ve said. The question that I asked, however, is whether or not the means of gleaning value from data, regardless of what we deem valuable, are significantly different from the past. I believe that the answer is “No.” While it is true that we are always making gradual progress in the development of analytical techniques and technologies, what we do today is largely the same as what we did when I first began my work in the field 30 years ago. Little has changed, and what has changed is an extension of the past, not a revolutionary or qualitative departure.

I hope that Dr. Hidalgo will continue our discussion here and that many of you will contribute as well.

Take care,

12 Comments on “Big Mouths on Big Data”


By Jason Beyer. May 1st, 2014 at 5:29 am

I enjoy seeing both side of this.
I see the defense of big data as someone trying to defend using a knife to scoop your peas when you have a wonderfully perfect spoon right next to you.

One piece I would like to add that comes to mind is that Stephen mentions the following right at the end, “While it is true that we are always making gradual progress in the development of analytical techniques and technologies, what we do today is largely the same as what we did when I first began my work in the field 30 years ago. Little has changed, and what has changed is an extension of the past, not a revolutionary or qualitative departure.”

One piece I would like to add to this that I’ve seen is with the analytics techniques mentioned…even thought they are the same (or at least should be) I believe that there are many more people exposed to them on a daily basis. Are they actually applying the techniques, maybe not, but there is a great exposure to the field of analytics.

But this is just my observation….thanks for the post Stephen.
Jason

By Gretchen Peterson. May 1st, 2014 at 9:33 am

I like the idea of simply getting people to slow down and develop the basic skills to make sense out of data. Good discussion, thank you.

By David Leppik. May 1st, 2014 at 1:31 pm

While the data and most of the visualization techniques are not new, they are far more accessible than they used to be. When I worked at Net Perceptions during the dot-com boom, I was in a truly rarefied place by having access to large customer transaction datasets; our clients considered them top secret. Only a few years before, a marketing company claimed it could accurately describe your personality and (more importantly) buying habits, given just your gender and zip code– and people believed it!

These days, we are awash in large public datasets. And the public is getting used to not just seeing a map or bar graph, but being able to interact with it: applying arbitrary filters and zooming in to see their own neighborhoods.

You may not see this as a huge change, since many of the visualization techniques date back to at least Schneiderman’s starfield displays (1994). But Schneiderman didn’t anticipate being able to scale from a globe down to a street view. And trust me, back then it was hard to find a public dataset that was big enough for that to be useful.

Similarly, two presidential elections ago people paid attention to individual polls. As a result, every few months you’d have (as you’d expect) an outlier poll: barely statistically significant, with surprising results– and people took it seriously. Last election, Nate Silver popularized Monte Carlo Simulation meta-analysis. And by “popularized” I mean to the public; real analysts had been using that technique for years. And by “real analysts” I don’t mean “most analysts”, but more like the top 1%.

One last case in point. When I was in college in the 1990s, very few social sciences required a knowledge of statistics. Today, very few don’t.

Big data– where by “big” I simply mean “large enough to filter out 90% and still have statistically significant data”– is by no means new, but big AND public is. Novices are trading their Excel spreadsheets for Google Maps. That’s leading to a proliferation of big data that’s actually useful to the public.

So it’s not the novelty, its the accessibility. Which makes your job even more important, as being able to analyze big data has gone from an elite specialty to a mainstream undergraduate requirement. And being able to judge big data visualizations has become important to being an informed citizen.

By Stephen Few. May 1st, 2014 at 3:29 pm

David,

Thanks for sharing these observations. They match my own, with only two exceptions:

1) I suspect that Ben Shneiderman (notice that there is no “c” in Shneiderman) would not have found it difficult to imagine zooming from a global view down to a street view 20 years ago.

2) My own experience suggests that back as long ago as the late 1970s students in the social sciences were required to take courses in statistics. I certainly was. It might be true, however, that more extensive training in statistics is required today, but I’m not sure that this is true. A common problem in published research papers today in both the social sciences and hard sciences is a misunderstanding of basic statistics, resulting in erroneous conclusions. Despite the exposure of fine work by people like Nate Silver, I doubt that the statistical aptitude of society in general is any greater today than it was 30 years ago.

Your observations prompt me to point out that I am by no means suggesting that nothing related to data and data technology has changed over the last 30 years. What I’m arguing is that (1) most of the claims of so-called Big Data are erroneous (many of them intentionally fabricated to market products and services), (2) Big Data, as it is usually defined, is not qualitatively different from data in the past, and (3) when organizations pursue Big Data technologies without first developing basic skills in data sensemaking, they are wasting their money and postponing the time when they will eventually begin to derive real value from data.

By Stacey Barr. May 2nd, 2014 at 12:23 am

Steve, have you read Bernard Marr’s definition of Big Data?

http://theintelligentcompany.blogspot.com.au/2013/08/what-hell-is-big-data.html

My undergrad and post-grad degrees were in mathematical statistics, so I really don’t see any difference between big data and data in general. As to be expected, as society evolves so will our sources and amounts and uses of data and its analysis. But the fundamental procedures and principles we apply to collect, analyse and make sense of it, should be roughly the same. Frame a question or hypothesis, design the analysis that will explore or answer it, design the data collection/capture method, get the data, clean the data, do the analysis, discuss the findings, and if needed, accept or reject the hypothesis.

By Stephen Few. May 2nd, 2014 at 9:02 am

Stacey,

Until now, I hadn’t read Bernad Marr’s definition of Big Data. What I found in reading it, however, is fairly typical. He begins by saying that the term “‘big data’ is not very well defined and is, in fact, not well chosen.” With this, I definitely agree. He then went on to say that he would “explain what’s behind the massive ‘big data’ buzz and demystify some of the hype.” At this point in reading the article, I got excited. What he then said, however, was fairly consistent with most of the definitions that I’ve read.

Here’s his definition: “Basically, big data refers to our ability to collect and analyze the vast amounts of data we are now generating in the world. The ability to harness the ever-expanding amounts of data is completely transforming our ability to understand the world and everything within it.” He then went on to describe the four V’s–volume, velocity, variety, and veracity–which are part and parcel of most of the Big Data promotional literature.

The problem with this definition is that it merely describes data, not anything new about data that deserves a new term. The data that we deal with today and the ways that we deal with it are on a continuum with the past, not something qualitatively different. There’s nothing wrong with growing excitement about the potential of data, and it might make sense to describe this growing excitement as Big Data, but not to suggest that it is based on anything intrinsic to data itself. In fact, I believe that the only phenomenon that the term Big Data truly describes has little to do with data and everything to do with an emerging, more widespread consciousness about its potential. I and many others who have worked with data to support decision making for many years have always possessed this consciousness. It’s great that it is spreading as long as it remains focused on data that actually matters, ways of interacting with data that actually enlighten, and uses of data that benefit humankind.

By PeggySue Werthessen. May 5th, 2014 at 5:58 am

Hi Stephen,

Thank you so much for this discussion.

As part of my work, I frequently speak to groups of IT/business leaders on the topic of Big Data. It is enough to say that I have struggled greatly with this task because of the very reasons that you speak of. Everyone seems to be anxious for someone to tell them something magic. But there is no magic.

In my presentation, I always show a picture of W. Edwards Deming and remind everyone that we have striving to use data to managed and control business for ages. This is not new.

As technology has become cheaper, faster and easier to deploy, we have always leveraged it in new and often interesting ways – creating new hardware and software models along the way. This is not new.

Data and math minded individuals have always used their craft to glean new insight. This was true well before the days of “big”. The planning for the human genome project began in 1984. This is not new.

We have reached a tipping point though.

The costs have gotten low enough and the tools have gotten mature enough that businesses can decide to “invest” without purpose. We can datify almost anything as the ability to apply sensors to the everyday becomes near ‘free’. (As an example, Oral-B has a new toothbrush coming to market which datifies tooth-brushing activity. Now, I wouldn’t want one for myself but since my 10 year old habitually just chews his toothbrush for two minutes instead of brushing, I could see the use for one in my house.)

So what is new is the accessibility and ubiquity. But, at the end of the day, the same rules apply regarding the need to apply business acumen. There is no magic.

Two lines from your posts really resonated with me:

– “In fact, I believe that the only phenomenon that the term Big Data truly describes has little to do with data and everything to do with an emerging, more widespread consciousness about its potential.”
– “All the while, they never slow down long enough to develop the basic skills of data sensemaking.”
– “I do believe that Big Data simply describes the mainstreaming idea that data has value… something that many of us have known for ages. But I worry that we are not educating our children to survive in this world.”

We stress the idea of literacy in our country and all but ignore real numeracy. Sure, we teach math in the schools but not as ‘poetry’ or ‘literature’. We teach students to execute math problems but not to think mathematically.

I wish I had a specific example off hand but I know that I am often frustrated when listening to NPR as they quote numbers/statistics as part of the news without context, questioning or understanding. They provide a number as if it has meaning on its own. But Romeo and Juliet wouldn’t stand for tragic love if you had never read the story. And, numbers without a story are the same way.

I guess that was more rant than reply. But again, thank you for being a voice of reason in a sea of hype!

P.S. I have always like the term Decision Support best.

By Stephen Few. May 5th, 2014 at 6:20 am

PeggySue,

I appreciate your comments, but I don’t agree with your assessment of what is new with Big Data. You wrote:

“The costs have gotten low enough and the tools have gotten mature enough that businesses can decide to ‘invest’ without purpose. We can datify almost anything as the ability to apply sensors to the everyday becomes near ‘free’. (As an example, Oral-B has a new toothbrush coming to market which datifies tooth-brushing activity. Now, I wouldn’t want one for myself but since my 10 year old habitually just chews his toothbrush for two minutes instead of brushing, I could see the use for one in my house.)

So what is new is the accessibility and ubiquity.”

The costs associated with pursuing so-called Big Data are in fact enormous. The notion that data technologies have become so inexpensive that they can be pursued “without purpose” is a myth propagated by technology vendors. The costs of these technologies are high and the time that’s spent implementing them is huge and often wasted. Neither accessibility nor ubiquity are new. They are just a bit more than in the past.

By Kathryn Hurchla. May 6th, 2014 at 3:06 am

I’d just like to say that this statement by PeggySue Werthessen fascinates me.

“We stress the idea of literacy in our country and all but ignore real numeracy. Sure, we teach math in the schools but not as ‘poetry’ or ‘literature’. We teach students to execute math problems but not to think mathematically.”

From school age, I have tended towards art and math. As a young professional who works with data and also the mother of a young child, it’s very important to me that early education prepare kids for a climate where data comprehension and wrangling (for lack of a more fitting term) is increasingly essential to near any field.

I excelled in reading, writing, art and math in high school. I earned a Bachelor of Fine Arts in visual art, and have found my way in the workforce. It’s taken me a decade or more to figure this out, but creative and dimensional thinking combined with research driven and conceptually based practices uniquely prepared me to do more that just accurately represent data.

Will my further studies in so called data science and statistics make my work all the stronger? Of course. But, surely balanced education in arts (literary, visual, music/sound) will develop numeracy. Unfortunately these are disciplines not supported throughout our current K-12 education system in the US, especially in many financially strained and test driven urban public school districts.

A growing effort to change the S.T.E.M. field approach to S.T.E.A.M. (science, technology, engineering, arts, mathematics) will benefit our kids and the future workforce. When I became a parent, I first wanted to drive my daughter away from art, even jokingly forbidding “coloring” on her chalkboard in favor of algorithms and taxonomies. Your mention of true numeracy v. math proficiency vividly describes what I’ve learned, far more concisely than my personal story.

Art & Design do not only complicate or eeringly steer data visualizations (or decision support); they can and do in many visualizations I’ve seen. These disciplines can also contribute to the most effective and far reaching work. If you look I’m certain you’ll find many a jazz or literary enthusiast behind statisticians, report designers, programmers and every sort of business end user. The arts are rooted in communication, and offer powerful methodologies and frameworks for organizing, analyzing, and sharing date for information. S.T.E.A.M. is publicized by Rhode Island School of Design at http://stemtosteam.org/.

I’m glad I found Perceptual Edge – the content is very informative and critical, as with this particular post and comments.

By PeggySue Werthessen. May 6th, 2014 at 5:51 am

Hi Stephen,

Thanks for your reply. I agree that the money being invested in big data is absolutely staggering.

To provide some context, my background is 20 years of corporate IT. Most of this time was in support of investment professionals where money was more accessible than for most IT organizations. And yet, I have still have deep scars from years of fighting through the financial approval process for IT resources.

What I have seen through my continued contacts in the corporate world and in surveys (such as from Gartner) is this willingness – even eagerness – to invest in big data prior to proving the business case. This feels very new to me.

Thanks
PeggySue

By PeggySue Werthessen. May 6th, 2014 at 6:09 am

Hi Kathryn,

Thanks for your comments and for the S.T.E.A.M. link. What a fantastic program.

I have a son in the 5th grade who is very math minded but we have really struggled with the school system. He consumes math and logic puzzle books at home but really dislikes math at school.

I just thought I would point you at Destination Imagination in case you have it in your area. http://www.destinationimagination.org/ My son started at age 8 and it has been extremely rewarding for both of us.

All my best.

By Dan. May 30th, 2014 at 8:08 am

Perhaps a better name for Big Data is Bad Data. The Internet has made questionable data freely available to everyone. So now making sense of all of this low quality data requires more vigilance and filtering than ever before (and it always required this). Unfortunately, a lot of people seem to think that Big Data means Better Data, which it isn’t.