Big Data, Big Errors in Prediction: Nate Silver’s “The Signal and the Noise”

I found a kindred spirit when I recently read Nate Silver’s new book The Signal and the Noise (Penguin Press, 2012). I want to give you a sense of the book and it’s powerful message by sharing a few excerpts from the introduction.

This is a book about information, technology, and scientific progress. This is a book about competition, free markets, and the evolution of ideas. This is a book about the things that make us smarter than any computer, and a book about human error. This is a book about how we learn, one step at a time, to come to knowledge of the objective world, and why we sometimes take a step back.

This is a book about prediction, which sits at the intersection of all these things. It is a study of why some predictions succeed and why some fail. My hope is that we might gain a little more insight into planning out futures and become a little less likely to repeat our mistakes.

He talks about the greatest revolution in information technology since the invention of writing—not the so-called information age of today, but the invention of the printing press in 1440.

The amount of information was increasing much more rapidly than our understanding of what to do with it, or our ability to differentiate the useful information from the mistruths. Paradoxically, the result of having so much more shared knowledge was increasing isolation along national and religious lines. The instinctual shortcut that we take when we have “too much information” is to engage with it selectively, picking out the parts we like and ignoring the remainder, making allies with those who have made the same choices and enemies of the rest.

But somehow in the midst of this, the printing press was starting to produce scientific and literary progress. Galileo was sharing his (censored) ideas, and Shakespeare was producing his plays.

“[But] men may construe things after their fashion / Clean from the purpose of the things themselves,” Shakespeare warns us through the voice of Cicero—good advice for anyone seeking to pluck through their newfound wealth of information. It was hard to tell the signal from the noise. The story the data tells us is often the one we’d like to hear, and we usually make sure that it has a happy ending.

Silver goes on to describe the impact of the printing press over the next few centuries, and concludes with this note of caution:

The explosion of information produced by the printing press had done us a world of good, it turned out. It had just taken 330 years—and millions dead in battlefields around Europe—for those advantages to take hold.

Here’s the clincher:

We face danger whenever information growth outpaces our understanding of how to process it. The last forty years of human history imply that it can still take a long time to translate information into useful knowledge, and that if we are not careful, we may take a step back in the meantime.

I’ve been working to squeeze value from information technology for 30 of these last 40 years. I’ve watched with great dismay as we’ve taken many steps back and want to scream as I see this happening to an unprecedented degree under the banner of Big Data. Progress can be made—the opportunity is ripe for the plucking—but the answers do not lie in the solutions that BI vendors are selling; they lie within us.

Silver goes on:

The exponential growth in information is sometimes seen as a cure-all, as computers were in the 1970s. Chris Anderson, the editor of Wired magazine, wrote in 2008 that the sheer volume of data would obviate the need for theory, and even the scientific method.

This is an emphatically pro-science and pro-technology book, and I think of it as a very optimistic one. But it argues that these views are badly mistaken. The numbers have no way of speaking for themselves. We speak for them.

Data-driven predictions can succeed—and they can fail. It is when we deny our role in the process that the odds of failure rise. Before we demand more of our data, we need to demand more of ourselves.

And here’s one final excerpt from the book’s introduction:

Big Data will produce progress—eventually. How quickly it does, and whether we regress in the meantime, will depend on us.

Meanwhile, if the quantity of information is increasing by 2.5 quintillion bytes per day, the amount of useful information almost certainly isn’t. Most of it is just noise, and the noise is increasing faster than the signal. There are so many hypotheses to test, so many data sets to mine—but a relatively constant amount of objective truth.

The signal is the truth. The noise is what distracts us from the truth.

After reading these words in the book’s introduction, you can imagine how excited I was to dive headlong into the book. It was time well spent.

Some chapters interested me more than others (I care little about baseball), but each illustrates revealing failures and successes of prediction. The one disappointment for me was the chapter on climate change. This falls outside of Silver’s expertise and I’m assured by colleagues who know climate science well that Silver didn’t get a balanced perspective when he relied on others to bring him up to speed, including one fellow in particular who is paid by powerful interests to cast misleading doubt on the merits of climate models.

If you’re looking for a book that will teach you how to develop reliable predictive models, you’ll need to look elsewhere. This is not a how-to book. Silver points to Bayesian thinking as the approach that will inform and continually improve good predictive models, but he provides only a rough introduction to this approach. This is intentional. The skills that are required to build good predictive models cannot be learned from a few chapters. The Signal and the Noise will raise awareness and point the way to better predictions. It is up to us to develop the skills to make this happen.

Take care,

3 Comments on “Big Data, Big Errors in Prediction: Nate Silver’s “The Signal and the Noise””


By Jason. November 5th, 2012 at 7:14 pm

Ah yes, he was interviewed on The Daily Show maybe a month ago with this book. Funny and quite informatively interesting too. Worth watching.

By rkw. November 6th, 2012 at 8:26 am

It’s important to digest the material on baseball. I have repeatedly seen trends in statistical analysis, data presentation, and visualization incubate in baseball before spilling over to other fields including business data.

For more information, explore Nate Silver’s background in Sabermetrics and how it relates to his current FiveThirtyEight blog on political polling. Visit fangraphs.com for intelligent community discussion on statistical methods. Consider the infrastructure that MLB had to invest in to automatically collect and catalog PITCHf/x data on every single pitch, or the business decision to make most of that data publicly available.

I know it seems like a tenuous connection on the surface, but the world of “baseball data” is a huge, mostly transparent case study that could hold many good lessons for the world “business data.”

By Faisal. December 6th, 2012 at 1:40 pm

I was looking for a technical review of this book in order to understand if Nate was explaining any statistical modeling in this book. Thanks for this write up!