The Three Vs and the Big O

It’s often useful to take a fresh look at things through the eyes of an outsider. My friend Leanne recently provided me with an outsider’s perspective after reading a blog article of mine regarding Big Data. In it I referred to the three Vs—volume, velocity, and variety—as a common theme of Big Data definitions, which struck Leanne as misapplied. Being trained in health care and, perhaps more importantly, being a woman, Leanne pointed out that the three Vs don’t seem to offer any obvious advantages to data, but they’re highly desirable when applied to the Big O. What’s the Big O? Leanne was referring to the “oh, oh, oh, my God” Big O more commonly known as the female ORGASM. When it comes to the rock-my-world experience of the Big O:

  1. Volume is desirable—the more the better;
  2. Velocity is desirable—reaching terminal velocity quickly with little effort is hard to beat; and
  3. Variety is desirable—getting there through varied and novel means is a glorious adventure.

The three Vs are a perfect fit for the Big O, but not for data. More data coming at us faster from an ever-growing variety of sources offers few advantages and often distracts from the ultimate goal. Leanne doesn’t understand why data geeks (her words, not mine) are spending so much time arguing about terminology and technology instead of focusing on content—what data has to say—and putting that content to good use. I couldn’t agree more.

Take care,

7 Comments on “The Three Vs and the Big O”


By Steve Ardire. May 6th, 2014 at 9:11 am

Nice frame and clever post ;-)

By Colin Michael. May 6th, 2014 at 9:54 am

Methinks you will get a different class of comment on this post from what you are used to receiving. However, since many of those responding are likely to be male, the usefulness of the comments may fall well below that to which your audience is accustomed.

By Jim Sterne. May 6th, 2014 at 6:48 pm

Wonderful! And useful!
Now, then… about Big Data.

The term Big Data was born of our new ability to distribute and process in parallel. The ability to crunch all the data is better than being forced to choose (sample) what we are going to process. More data is better.

We now have the ability to look at streaming data, leaving batch behind. We don’t have to wait until it’s all brought together and then start crunching data that gets older by the minute. Faster is better.

A wider variety of sources lets us correlate data sets we simply could not in the past. The spice of life applies to data as well.

Hence, being able to look at whole data sets, closer in time to their creation and across more types of data, allows us to ask new and exciting questions.

Or, as Leanne might have put it:

1. Volume is desirable—the more the better;
2. Velocity is desirable—reaching new heights of insight quickly with little effort is hard to beat; and
3. Variety is desirable—getting there through varied and novel means is a glorious adventure that is sure to be ever more revealing.

By Stephen Few. May 6th, 2014 at 9:04 pm

Jim,

Your comments in response to this tongue-in-cheek article exhibit an ironic obsession with terminology and technology rather than data content and use. Leanne finds this both amusing and sad.

Your understanding of Big Data underscores the problem that I’ve been writing about for years now: the term means something different to everyone who speaks of it, which makes debates fruitless. It also underscores the fact that most definitions of Big Data identify characteristics that have been true of data and data processing since long before than the notion of so-called Big Data arose. As such, so-called Big Data is just data, an unbroken connection with the past that exhibits no qualitative differences.

Regarding your particular understanding of the term, Big Data did not grow out of our ability to process data in parallel as you believe. We’ve been processing in parallel for many years beginning long before the term Big Data came into use. The same is true of streaming data. Regarding complete data sets versus samples, most statistical studies that have required sampling in the past still require sampling today because the data sets are not available and are costly to collect. This hasn’t changed. Facebook and Twitter data supports relatively few scientific endeavors. Also, using complete data sets is hardly new in many realms. For instance, in the world of business where I’ve worked most of my life, we’ve used complete data sets since businesses first became computerized. Regarding increases in our ability to make new correlations based on new sources of data, the claim of many Big Data advocates that we can now rely on correlations without concern for causation is getting us into trouble. Most correlations that are being found are of little use and are bound to mislead if we don’t take time to understand their cause.

Our obsession with data volume, velocity, and variety is distracting us from what matters. For my opinion on this, you might find it interesting to read my blog article titled “The Slow Data Movement“.

By Jim Sterne. May 7th, 2014 at 9:42 am

Stephen –

I was an English major in college. That means my obsession with terminology is not ironic but borne of necessity to solve the very problem you identified: “Big Data means something different to everyone which makes debate *necessary*. Once we can agree on terms, we can have a rational conversation.

Spoiler Alert: We are going to end up in violent agreement at the end of this,
but the journey is important.

Qualifier: I too am focused on business data as well. I cannot speak to issues
of bio-data, astrophysics or highest-energy particle physics, etc.

You say that Big Data is “just data, an unbroken connection with the past that exhibits no qualitative differences.” I disagree. I feel we have crossed two thresholds that changed the game. The first relates to variety (and hence volume) and the second relates to technology.

First, the business data we had been crunching for generations was transactional and structured. The ‘new’ data is behavioral (clickthroughs and pageviews) and unstructured (social media sentiment). The sheer volume of both of these prompted the word “Big” to enter the vocabulary which propelled the hype machine into overdrive.

[Note: The following comments in italics are responses to Jim Sterne’s comment by Stephen Few.]

Jim—This is not a qualitative change from the past. I dealt with clickthrough and page view data in the early days of the Web, long before the term Big Data was used. Unstructured data is not new, nor is sentiment data. The volume of unstructured data did not prompt the popularity of the term Big Data. If you trace the origins, you will find that the term became popular because technology vendors and their supporters, such as analyst groups like Gartner, began to market the term.

Instead of thousands of transactions per day, we’re working with millions of data points per second. This step-change in volume was first experienced in the online world by the likes of Google

This increase in volume is not a step-change, it is on a continuum that connects to the past without significant change. Data volume has been increasing exponentially since the advent of the computer. Take a look at historical changes in data volume and you find no evidence to the contrary.

The second threshold is technical. In response to this big-ness, Google published their approach called MapReduce (MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat 2004, http://research.google.com/archive/mapreduce.html). Their approach was codified, became open source and is now a mainstay of pre-processing large volumes of unstructured data into rows and columns so that classical analytics techniques can be brought to bear. This sea-change in technology is a new style of massively parallel processing made possible by and taking advantage of dramatic drops in the cost of storage and processing.

Drops in the cost of storage and processing have been occurring at a consistent rate since the advent of the computer.

If we can agree on the above, we can have a substantive conversation about causation discovery.

To begin, your statement that, “the claim of many Big Data advocates that we can now rely on correlations without concern for causation is getting us into trouble.” You are righter than right. This is the very heart of the problem.

This new data and these new tools allow us to see patterns and correlations that may have nothing to do with reality aside from the human brain’s desire to see patterns. Worse still, “Most correlations that are being found are of little use and are bound to mislead if we don’t take time to understand their cause.”

Yesterday. I learned the word iatrogenic: of or relating to illness caused by medical examination or treatment. I believe we are about to see a myriad of very bad decisions based on the misinterpretation of data. But these will not be the fault of the amount of data nor the technology. Just because we can drive fast, does no exempt us from driving well.

Driving fast does not exempt us from driving well, but it makes it more difficult. In the realm of driving, we know that speed kills. Although a few people can drive well at high speeds, they can do so only through the exercise of extreme skill and attention. The faster we drive, the less time we have to react to conditions and consider our choices. This is just as true of data sensemaking as it is of driving.

Your blog article “The Slow Data Movement” hits the nail smack on the head; We must separate the signals from the noise and data sense-making and decision-making are human endeavors. However, I feel we might be losing the baby with the bath water when you suggest that we have to choose between spending more time making use of the data we have rather than getting wrapped up in the acquisition of more.

These are not mutually exclusive.

I am not arguing that we should stop accumulating data. I’m arguing that the accumulation of data is harmful if we have not already developed the skills that are needed to make sense of it and use it wisely. Those who can’t handle what they already have will lose ground if if they add more to the heap.

We will surely agree with David Spiegelhalter when he says, “There are a lot of small data problems that occur in big data. They don’t disappear because you’ve got lots of the stuff. They get worse.” (http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz312xwVrbK).

I contend that with new ways to correlate new types of data, we have increased the opportunities to suss out what might be causative. We have new tools with which to decipher the mysteries of the marketplace, the comedy of human interactions and the world around us.

More of the right data can be useful in detecting useful correlations. The problem with so-called Big Data is the belief that we no longer need to be concerned with getting the right data when we can acquire everything. First of all, we cannot acquire everything. Secondly, the gratuitous acquisition of everything possible will reduce our focus on getting useful data.

I am delighted Leanne found this amusing and saddened that she found it sad. In truth, more is not always better, faster is not always better and lots of different kinds are not always better – even with sex.

Jim—With this, I agree.

By Stephen Few. May 7th, 2014 at 12:03 pm

Jim,

First of all, I appreciate your thoughtful comments. Nevertheless, I also disagree with most of them. I’ve posted my responses above as italicized statements in the midst of your comments.

By Ashraf A.. May 10th, 2014 at 3:17 am

Very interesting and intellectually spicy debate on big data.

As a newcomer in the area of data and its use for decision support……i can see one main point of alignment in all the debates which is the relatively weak ability to transform big data or just data to useful understanding of root causes and future actions/decisions. This should be the key focus.