The Three Vs and the Big O
It’s often useful to take a fresh look at things through the eyes of an outsider. My friend Leanne recently provided me with an outsider’s perspective after reading a blog article of mine regarding Big Data. In it I referred to the three Vs—volume, velocity, and variety—as a common theme of Big Data definitions, which struck Leanne as misapplied. Being trained in health care and, perhaps more importantly, being a woman, Leanne pointed out that the three Vs don’t seem to offer any obvious advantages to data, but they’re highly desirable when applied to the Big O. What’s the Big O? Leanne was referring to the “oh, oh, oh, my God” Big O more commonly known as the female ORGASM. When it comes to the rock-my-world experience of the Big O:
- Volume is desirable—the more the better;
- Velocity is desirable—reaching terminal velocity quickly with little effort is hard to beat; and
- Variety is desirable—getting there through varied and novel means is a glorious adventure.
The three Vs are a perfect fit for the Big O, but not for data. More data coming at us faster from an ever-growing variety of sources offers few advantages and often distracts from the ultimate goal. Leanne doesn’t understand why data geeks (her words, not mine) are spending so much time arguing about terminology and technology instead of focusing on content—what data has to say—and putting that content to good use. I couldn’t agree more.
Take care,
7 Comments on “The Three Vs and the Big O”
Nice frame and clever post ;-)
Methinks you will get a different class of comment on this post from what you are used to receiving. However, since many of those responding are likely to be male, the usefulness of the comments may fall well below that to which your audience is accustomed.
Wonderful! And useful!
Now, then… about Big Data.
The term Big Data was born of our new ability to distribute and process in parallel. The ability to crunch all the data is better than being forced to choose (sample) what we are going to process. More data is better.
We now have the ability to look at streaming data, leaving batch behind. We don’t have to wait until it’s all brought together and then start crunching data that gets older by the minute. Faster is better.
A wider variety of sources lets us correlate data sets we simply could not in the past. The spice of life applies to data as well.
Hence, being able to look at whole data sets, closer in time to their creation and across more types of data, allows us to ask new and exciting questions.
Or, as Leanne might have put it:
1. Volume is desirable—the more the better;
2. Velocity is desirable—reaching new heights of insight quickly with little effort is hard to beat; and
3. Variety is desirable—getting there through varied and novel means is a glorious adventure that is sure to be ever more revealing.
Jim,
Your comments in response to this tongue-in-cheek article exhibit an ironic obsession with terminology and technology rather than data content and use. Leanne finds this both amusing and sad.
Your understanding of Big Data underscores the problem that I’ve been writing about for years now: the term means something different to everyone who speaks of it, which makes debates fruitless. It also underscores the fact that most definitions of Big Data identify characteristics that have been true of data and data processing since long before than the notion of so-called Big Data arose. As such, so-called Big Data is just data, an unbroken connection with the past that exhibits no qualitative differences.
Regarding your particular understanding of the term, Big Data did not grow out of our ability to process data in parallel as you believe. We’ve been processing in parallel for many years beginning long before the term Big Data came into use. The same is true of streaming data. Regarding complete data sets versus samples, most statistical studies that have required sampling in the past still require sampling today because the data sets are not available and are costly to collect. This hasn’t changed. Facebook and Twitter data supports relatively few scientific endeavors. Also, using complete data sets is hardly new in many realms. For instance, in the world of business where I’ve worked most of my life, we’ve used complete data sets since businesses first became computerized. Regarding increases in our ability to make new correlations based on new sources of data, the claim of many Big Data advocates that we can now rely on correlations without concern for causation is getting us into trouble. Most correlations that are being found are of little use and are bound to mislead if we don’t take time to understand their cause.
Our obsession with data volume, velocity, and variety is distracting us from what matters. For my opinion on this, you might find it interesting to read my blog article titled “The Slow Data Movement“.
Stephen –
I was an English major in college. That means my obsession with terminology is not ironic but borne of necessity to solve the very problem you identified: “Big Data means something different to everyone which makes debate *necessary*. Once we can agree on terms, we can have a rational conversation.
Spoiler Alert: We are going to end up in violent agreement at the end of this,
but the journey is important.
Qualifier: I too am focused on business data as well. I cannot speak to issues
of bio-data, astrophysics or highest-energy particle physics, etc.
You say that Big Data is “just data, an unbroken connection with the past that exhibits no qualitative differences.” I disagree. I feel we have crossed two thresholds that changed the game. The first relates to variety (and hence volume) and the second relates to technology.
First, the business data we had been crunching for generations was transactional and structured. The ‘new’ data is behavioral (clickthroughs and pageviews) and unstructured (social media sentiment). The sheer volume of both of these prompted the word “Big” to enter the vocabulary which propelled the hype machine into overdrive.
[Note: The following comments in italics are responses to Jim Sterne’s comment by Stephen Few.]
Instead of thousands of transactions per day, we’re working with millions of data points per second. This step-change in volume was first experienced in the online world by the likes of Google
The second threshold is technical. In response to this big-ness, Google published their approach called MapReduce (MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat 2004, http://research.google.com/archive/mapreduce.html). Their approach was codified, became open source and is now a mainstay of pre-processing large volumes of unstructured data into rows and columns so that classical analytics techniques can be brought to bear. This sea-change in technology is a new style of massively parallel processing made possible by and taking advantage of dramatic drops in the cost of storage and processing.
If we can agree on the above, we can have a substantive conversation about causation discovery.
To begin, your statement that, “the claim of many Big Data advocates that we can now rely on correlations without concern for causation is getting us into trouble.” You are righter than right. This is the very heart of the problem.
This new data and these new tools allow us to see patterns and correlations that may have nothing to do with reality aside from the human brain’s desire to see patterns. Worse still, “Most correlations that are being found are of little use and are bound to mislead if we don’t take time to understand their cause.”
Yesterday. I learned the word iatrogenic: of or relating to illness caused by medical examination or treatment. I believe we are about to see a myriad of very bad decisions based on the misinterpretation of data. But these will not be the fault of the amount of data nor the technology. Just because we can drive fast, does no exempt us from driving well.
Your blog article “The Slow Data Movement” hits the nail smack on the head; We must separate the signals from the noise and data sense-making and decision-making are human endeavors. However, I feel we might be losing the baby with the bath water when you suggest that we have to choose between spending more time making use of the data we have rather than getting wrapped up in the acquisition of more.
These are not mutually exclusive.
We will surely agree with David Spiegelhalter when he says, “There are a lot of small data problems that occur in big data. They don’t disappear because you’ve got lots of the stuff. They get worse.” (http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz312xwVrbK).
I contend that with new ways to correlate new types of data, we have increased the opportunities to suss out what might be causative. We have new tools with which to decipher the mysteries of the marketplace, the comedy of human interactions and the world around us.
I am delighted Leanne found this amusing and saddened that she found it sad. In truth, more is not always better, faster is not always better and lots of different kinds are not always better – even with sex.
Jim,
First of all, I appreciate your thoughtful comments. Nevertheless, I also disagree with most of them. I’ve posted my responses above as italicized statements in the midst of your comments.
Very interesting and intellectually spicy debate on big data.
As a newcomer in the area of data and its use for decision support……i can see one main point of alignment in all the debates which is the relatively weak ability to transform big data or just data to useful understanding of root causes and future actions/decisions. This should be the key focus.