Yesterday, I read an article on the website of Scientific American titled “Saving Big Data from Big Mouths” by Cesar A. Hidalgo. As you know if you read this blog regularly, I have grave concerns about the hyperbolic claims of Big Data and believe that it is little more than a marketing campaign to sell expensive technology products and services. In his article, Dr. Hidalgo, who teaches in the MIT Media Lab, challenges several recent articles in prominent publications that criticize the claims of Big Data. The fact that he characterizes naysayers as “big mouths” clues you into his perspective on the matter. In reading his article, I discovered that Dr. Hidalgo’s understanding of Big Data is limited, as is often the case with academics, and his position suffers from a fundamental problem with Big Data, which is that Big Data, as he defines it, doesn’t actually exist. When I say that it doesn’t exist, I’m arguing that Big Data, as he’s defined it, isn’t new or qualitatively different from data in the past. Big Data is just data.
I expressed my concerns to Dr. Hidalgo by posting the following comments in response to his article:
I’m one of the naysayers in response to the claims of so-called Big Data. I’m concerned primarily with the hype that leads organizations to waste money chasing new technologies rather than developing the skills that are needed to glean value from data. One of the fundamental problems with Big Data is the fact that no two people define it in the same way, so it is difficult to discuss it intelligently. In this article, you praised the benefits of Big Data, but did not define it. What do you mean by Big Data? How is Big Data different from other data? When did data become big? Are the means of gleaning value from so-called Big Data different from the means of gleaning value from data in general?
He was kind enough to respond with the following.
Dear Stephen Few,
These are all very good questions.
First, regarding the definition of big data:
As you probably know well, the term big data is used colloquially to refer mostly to digital traces of human activities. These include cell phone data, credit card records and social media activity. Big data is also used occasionally to refer to data generated by some scientific experiments (like CERN or genomic data), although this is not the most common use of the phrase so I will stick to the “digital traces of human activity” definition for now.
Beyond the colloquial definition, I have a working definition of big data that I use on occasion. To keep things simple I say that big data needs to be three times big, meaning that it needs to be big in size, resolution and scope. The size dimension is relatively obvious (data on 20 or 30 individuals is not the same as data on hundreds of thousands of them). The resolution dimension is better explained by an example. Consider having credit card data on a million people. A low resolution version of this dataset would consist only on the total yearly expenditure of each individual. A high resolution version, would include information on when and where the purchases where made. In this example, it is the resolution of the data what allows us to use it to study, for instance, the mobility of this particular group of people (notice that I am not generalizing to the general population, since this subpopulation might be worthy of study on its own). Finally, I require data to be big in scope. By this, I require data to be useful for applications other than the ones for which it was originally collected. For instance, mobile phone records are used by operators for billing purposes, but could be used to forecast traffic or to identify the location of mobile phone users prior to a natural disaster (and use this information to help speed up search and rescue operations). When data is big in size, resolution and scope, I am comfortable saying that it is big.
Second: Your question about when big data become big?
This is an interesting question because it points to the evolution of language. During the last decade the word data has begot at least two children: “metadata” and “big data”. The word metadata grew in popularity in the wake of the NSA scandal, as people needed to differentiate between the content of messages and their metadata. Big data, on the other hand, emerged as people searched for a short way to refer to the digital traces of human activity that were collected for operational purposes by service providers serving large populations, and that could be used for purposes that were beyond those for which the data was originally collected. Certainly, the phrase “big data” provides an economy of language, and as someone that enjoys writing I always appreciate that.
Regarding the time at which this transition happen, I remember that when I started working with mobile phone data (in 2004) people were not using the word big data. As more people entered the field, the word begun to gain force (around 2008). With the financial crisis, the hype of big data entered full swing, as many framed big data as a the new asset, or technology, that could save the economy. I guess at that time, everyone wanted to believe them :-).
Third and Final: You ask whether the means of gleaning value from big data include the methods used to glean value from data in general.
In short, the answer is yes (it is data after all). Multivariate regressions in all of its forms and specifications are still useful and welcome (I use them often). Yet, these new datasets have also stimulated the proliferation of some additional techniques. For instance, visualizations have progressed enormously during recent years since exploring these datasets is not easy, and large datasets involve more exploration. As an example, check dataviva.info . This site makes available more than 100 million visualizations to help people explore Brazil’s formal sector economy. By taking different combinations of visualizations it is possible to weave stories about industries and locations. An example of these stories for a related project, The Observatory of Economic Complexity (atlas.media.mit.edu), can be seen in this video (http://vimeo.com/40565955). Here you will be able to see how these visualization techniques allow people to quickly compose stories about a topic.
Finally, it is worth noting that different people might mean different things when they refer to gleaning value from data. For some people, this might involve explaining the mechanisms that gave rise to the observed patterns, or use the data to learn about an aspect of the world. This is a common approach on the social sciences. For other people value might emerge from predictions that are not cognitively penetrable but nevertheless accurate, such as the ones people obtain with different machine learning techniques, such as neural networks or those based on abstract features. The latter of these approaches, which is often used by computer scientists, can be very useful for sites that require accurate predictions, such as Netflix or Amazon. Here, the value is certainly more commercial, but is also a valid answer to the question.
I hope these answer help clarify your questions.
All the best
I wanted to respond in kind, but for some unknown reason the Scientific American website is rejecting my comments, so I’ll continue the discussion here in my own blog.
Your response regarding the definition of Big Data demonstrates the problem that I’m trying to expose: Big Data has not been defined in a manner that lends itself to intelligent discussion. Your definition does not at all represent a generally accepted definition of Big Data. It is possible that the naysayers with whom you disagree define Big Data differently than you do. I’ve observed a great many false promises and much wasted effort in the name of Big Data. Unless you’re involved with a broad audience of people who work with data in organizations of all sorts (not just academia), you might not be aware of some of the problems that exist with Big Data.
Your working definition of Big Data is somewhat similar to the popular definition involving the 3 Vs (volume, velocity, and variety) that is often cited. The problem with the 3 Vs and your “size, resolution, and scope” definition is that they define Big Data in a way that could be applied to the data that I worked with when I began my career 30 years ago. Back then I routinely worked with data that was big in size (a.k.a., volume), detailed in resolution, and useful for purposes other than that for which it was originally generated. By defining Big Data as you have, you are supporting the case that I’ve been making for years that Big Data has always existed and therefore doesn’t deserve a new name.
I don’t agree that the term Big Data emerged as a “way to refer to digital traces of human activity that were collected for operational purposes by service providers serving large populations, and that could be used for purposes that were beyond those for which the data was originally collected.” What you’ve described has been going on for many years. In the past we called it data, with no need for the new term “Big Data.” What I’ve observed is that the term Big Data emerged as a marketing campaign by technology vendors and those who support them (e.g., large analyst firms such as Gartner) to promote sales. Every few years vendors come up with a new name for the same thing. Thirty years ago, we called it decision support. Not long after that we called it data warehousing. Later, the term business intelligence came into vogue. Since then we’ve been subjected to marketing campaigns associated with analytics and data science. These campaigns keep organizations chasing the latest technologies, believing that they’re new and necessary, which is rarely the case. All the while, they never slow down long enough to develop the basic skills of data sensemaking.
When you talk about data visualization, you’re venturing into territory that I know well. It is definitely not true that data visualization has “progressed enormously during recent years.” As a leading practitioner in the field, I am painfully aware that progress in data visualization has been slow and, in actual practice, is taking two steps backwards, repeating past mistakes, for every useful step forwards.
What various people and organizations value from data certainly differs, as you’ve said. The question that I asked, however, is whether or not the means of gleaning value from data, regardless of what we deem valuable, are significantly different from the past. I believe that the answer is “No.” While it is true that we are always making gradual progress in the development of analytical techniques and technologies, what we do today is largely the same as what we did when I first began my work in the field 30 years ago. Little has changed, and what has changed is an extension of the past, not a revolutionary or qualitative departure.
I hope that Dr. Hidalgo will continue our discussion here and that many of you will contribute as well.