How Scalable Do Analytics Solutions Need to Be?

While being briefed on a product earlier this week, the company’s founder and I agreed on one point only: most of the people who are currently tasked with data analysis lack the skills that are required to do the work. He and I, however, imagine conflicting solutions to this problem. He believes that technology must come to the rescue by doing the work for these people who can’t do it for themselves. I believe that even the best technologies cannot do the work of skilled data analysts and that the problem can only be effectively addressed by helping people develop analytical skills. He agreed that equipping people with the necessary skills would work better, but dismissed it because it is not a “scalable solution.” The essence of his case went something like this: “Data is increasing at an exponential rate, so our need for analytics cannot be solved by investing in human resources because humans are not sufficiently scalable, but technologies are.” Consider this line of reasoning for a moment. It relies on the following premise: “Exponential increases in data can only be addressed by exponential increases in analytical horsepower.” This premise is fallacious. Nate Silver made this point in his book The Signal and the Noise when he wrote:

If the quantity of information is increasing by 2.5 quintillion bytes per day, the amount of useful information certainly isn’t. Most of it is just noise, and the noise is increasing faster than the signal. There are so many hypotheses to test, so many data sets to mine—but a relatively constant amount of objective truth.

The exponential growth in raw data that we’re experiencing is mostly producing noise. The amount of useful information is not increasing exponentially, therefore the need for analytical horsepower is also not increasing exponentially. Data sensemaking is a human activity that can at best be augmented and assisted by analytical tools. The only viable solution to the analytical challenges that we face is to develop the human resources that we need. This is where our attention and our investments should be focused. Don’t trust a technology vendor who claims that skilled data analysts can be replaced with his product. That analytical product does not exist.

This company’s founder claims that his product can analyze a data set and present all of the potentially useful findings in a series of simple graphs and plain English explanations without any human involvement. During the briefing, he made an off-the-cuff comment that caused the hairs on the back of my neck to bristle. He said that his product “empowers users.” He must understand empowerment quite differently than I do. As I understand it, empowerment involves an increase in ability. Software that does for you what you could do better yourself with proper training isn’t empowering.

This fellow’s notion of empowerment bothered me because I work hard to actually empower people by teaching them analytical skills. I know how much it means to people to become truly empowered with useful abilities that enable them to affect the world in beneficial ways. No one with an ounce of integrity wants to bear the title “data analyst” while doing nothing but delivering a computer’s findings to someone else without adding any value. If this is the future that analytics technologies promise, count me out. Fortunately, this isn’t a future that technologies are likely to achieve.

Take care,

Signature

10 Comments on “How Scalable Do Analytics Solutions Need to Be?”


By Jan Kurianski. August 6th, 2015 at 3:24 pm

An increase of sample size (from 100 to 1,000,000 to “big data”) does not increase complexity of analysis. If you have a categorical variable with a limited choice of responses, the only thing a bigger sample does is help confirm the significance of results by bringing your sample closer to the truth of the actual population.

What is complex and non scalable is the number of variables that humans can hold in our head (for such things as multivariate analysis, or for comparing results on a chart). However it is never made clear in most articles whether the increase of data in the world is an increase of sample size or an increase in interacting variables. I wonder how many Google or Facebook-like graph problems most companies really have? It would be interesting to hear which was being referred to by the person in your article.

By Stephen Few. August 6th, 2015 at 3:43 pm

Jan,

Thanks for the astute observation. You’ve exposed one of the logical flaws in Big Data hype.

I decided to keep the product that I was briefed on anonymous because I don’t want to give it any exposure. Their claims are not valid, but their marketing is slick.

By James Pearce. August 6th, 2015 at 11:38 pm

We have heard these kinds of arguments for years from technology vendors, and their promised future never arrives. The hardest thing about analytics is framing the problem properly. Technology is never going help in that regard.

The role of technology as I see it, is to make the data analyst more powerful: able to spend more time thinking more about problems and how to solve them.

By Stian Lågstad. August 7th, 2015 at 4:42 am

Not having read Nate Silver’s book; How does he (and/or you) back the claim that the exponential growth in raw data is mostly noise?

By Stephen Few. August 7th, 2015 at 10:49 am

Stian,

I always appreciate it when someone asks me to back up my claims if I haven’t already done so sufficiently. Let me begin by defining what I mean by a signal as opposed to noise in the context of analytics. A signal is a fact that is useful. By useful, I mean that it informs us of something that matters and deserves a response. Items of data are just facts. Most of the facts that are observed by our senses or are recorded in our information systems are not useful. Of those that are useful, most are rarely useful. Only a tiny, tiny fraction of facts are often useful. This is the nature of data.

The vast majority of the data that exists in your environment at any one moment remains unobserved by your senses. Of the data that is observed, the vast majority of that is filtered out as useless by unconscious sensory processes in your brain. Of the data that isn’t filtered out, the vast majority of that is used by unconscious processes in your brain to enable you to automatically function in the world. And finally, only a tiny fraction of data is passed on from unconscious processes in your brain to conscious awareness. A primary function of our brains is to act as a filter.

I believe that the way that our brains process data is analogous to the way that data analysts should process data. Only a tiny portion of the data that is being electronically generated by humans and machines is actually useful. This is more true today than ever in the past. Think about the nature of the data that is responsible for exponential growth during recent years, such as social media data, photos, videos, and measurements by sensors. I think its’ safe to say that for analytical purposes most of this data is noise. Even a sensor that monitors a critical process by taking frequent measurements is mostly noise—business as usual—and not important. The fundamental challenge of analytics is to separate the signals from the noise. What actually matters in the world is not increasing at an exponential rate.

By Andrew. August 7th, 2015 at 10:59 am

Technology can be a great thing, if it really “empowers users” – but if it is designed by people who lack applicable understanding of the intended user’s skills (as most BI software seems to be designed), it will only empower the user to shoot herself in the foot.

Among the soul-crushing details in the average cubicle-dweller’s daily experience, one of the worst is to have your skills hampered by bad technology. I see a LOT of this in BI software. Someone should really explain this to BI vendors: Good technology stays out of the way.

And also – because I’m so very tired of hearing about Big Data – while technology does help us better handle the fact that “data is increasing at an exponential rate”, shouldn’t that just be a given? Don’t all BI vendors claim their software does this? Do BI vendors really think they’re setting themselves apart from competition by offering THE EXACT SAME THING that every other BI product claims to do?

By Stian Lågstad. August 7th, 2015 at 11:52 am

Thank you for answering. I like the metaphor with the brain – I’ll remember it!

By Kyle Hale. August 12th, 2015 at 8:17 am

@ Stian :

You can also consider all the data feeding a decision to be like a system of planetary objects. Each data point exerts a certain force on the decision, and the equation for force between two objects is just

mass 1 * mass 2 / distance squared

In this case, you get to choose the values of the equation, but you should be as objective as possible.

Most traditional proprietary enterprise data is the solar system. Sales/CRM is the Sun, ERP and supply is the Moon, etc.

Most Big Data should be treated like completely different galaxies. And occasionally something useful can be found out there and then you can move it into your solar system.

So even though we’re discovering new galaxies all the time, it’s not really changing our understanding of the universe in the same way that that tiny apple hitting Newton’s head did.

By Steve Emberly. September 21st, 2015 at 8:12 am

I agree with you completely… it’s far better to have tools in the tool kit than a technology “hammer” that people may be ill equipped to take best advantage of. There’s a saying… when all you have is a hammer, everything becomes a nail.

One question about scalability… I’ve always felt that in the right hands technology would be most useful in categorizing the ever increasing volumes of data and aid in separating the “signal from the noise”. With increasingly larger data sets, some tools start to choke. What are your thoughts on this aspect of the scalability argument?

By Stephen Few. September 21st, 2015 at 8:43 am

Steve,

Data analysis tools have always struggled to keep up with increasing volumes of data. This is a challenge that vendors relish, however, because it keeps everyone’s attention on something that the vendors feel comfortable addressing with incremental improvements in data handling performance and away from what really matters–the needs and abilities of their users–which they rarely know how to address. By focusing exclusively on analytical technologies, we will always remain behind the curve, struggling to catch up. By focusing on human skills and tools that actually support those skills well, we might actually catch up.

Leave a Reply