Signal Detection: An Important Skill in a Noisy World

This summer I’ve been spending most of my time working on a new book. The current working title is Signal. As the title suggests, this book will focus on analytical techniques for detecting signals in the midst of noisy data. And guess what? All data sets are noisy. In fact, at any given moment, most of the data that we collect are noise. This will always be true, because signals in data are the exception, not the rule.

Signal detection is actually getting harder with the advent of so-called Big Data. By its very nature, most Big Data will never be anything but noise. Collecting everything possible, based on the Big Data argument that the costs of doing so are negligible and that even data that you can’t imagine as useful today could become useful tomorrow, is a dangerous premise. The costs of collecting and storing everything extend far beyond the hardware that’s used to store it. People already struggle to use data effectively. This will become dramatically harder as the volume of data grows. Finding a needle in a haystack doesn’t get easier as you’re tossing more and more hay on the pile.

Most people who are responsible for data analysis in organizations have never been trained to do this work. An insidious assumption exists, promoted by software vendors, that knowing how to use a particular data analysis software product “auto-magically” imbues one with the skills of a data analyst. Even with good software—something that’s rare—this is far from true. Just as with any area of expertise, data analysis requires training and practice, practice, practice. Because few people whose work involves data analysis possess the required skills, much time is wasted and money lost as analysts pore over data without knowing what to look for. They end up chasing patterns that mean nothing and missing those that are gold. Essentially, data analysis is the process of signal detection.

Data that do not convey useful knowledge are noise. When data are displayed, noise can exist both as data that don’t provide useful knowledge and also as useless non-data elements of the display (e.g., irrelevant visual attributes, such as a third dimension of depth in bars, meaningless color variation, and effects of light and shadow). Both sources of noise must be filtered to find and focus on the signals.

When we rely on data for decision making, what qualifies as a signal and what is merely noise? In and of themselves, data are neither. Data are merely facts. When facts are useful, they serve as signals. When they aren’t useful, data clutter the environment with distracting noise.

For data to be useful, they must:

  • Address something that matters
  • Promote understanding
  • Provide an opportunity for action to achieve or maintain a desired state

When any of these qualities are missing, data remain noise.

Signals are always signs of something in particular. In a sense, a signal is not a thing but a relationship. Data become useful knowledge of something that matters when they connect understanding to a question to form an answer. This connection (relationship) is the signal.

As I work on this book to define the nature of signals and to describe techniques for detecting them, I could benefit from your thoughts on the matter. In your experience, what data qualify as signals? How do you find them? What do you do to understand them? What do you do about them once found? What examples have you seen in your own organization or others of time wasted chasing noise. What can we do to reduce noise? Please share with me any thoughts that you have along these lines.

Take care

20 Comments on “Signal Detection: An Important Skill in a Noisy World”


By Colin Michael. August 13th, 2013 at 6:38 am

You are right that the signal is getting buried in more hay than ever. I can remember setting up instruments to send us hourly readings from key systems (fluid temp, pressure, and flow) in the nuclear power plant where I worked. It was a big step up from the once-per-shift reading we were getting. We had an argument at the time about how we might get lazy about analyzing the data that would just flow in all day and night, whereas before it would take over an hour just to gather readings from various gauges, make calculations, etc., which seemed to make the results much more prominent and important. Those same readings are recorded several time per second, now, even though it make take several days to establish a trend or any significance. Not to mention that all of the other reading we took once per shift are also coming in multiple times per second. There is so much data that we just program computers to look for trends and then disengage our brains and wait for the Klaxon to break the stupor.

By Neil Barrett. August 14th, 2013 at 9:05 am

There’s the noise of the ‘ill-formed performance indicator that must *never* be red’ — it’s difficult to interpret meaningfully but easy to draw ‘light’ conclusions from, with much time wasted on teasing out what it means, why it’s good/bad, why the result shown isn’t the real result.

And a personal favourite is the ‘spurious detail associated with something that is important’ — see it in all its multilayered splits.

And of course, there’s the ‘show me all the numbers in a table because I’m an accountant and incapable of reasoned thinking’ type of noise. Even when faced with something straightforward (a line chart) this isn’t tractable to techniques of the accountants (does it reconcile…), and hence the pretty pictures are just noise to them.

In fact, I think quite a lot of noise comes down to the needs of a few individuals to have the right contextual information to understand something that may not be framed in the way that they become accustomed. The context replaces the essence of the signal. And can, indeed, replace the signal with false interpretations and misprioritisation. Take the classic market research report — the data doesn’t say a lot, but as soon as you’ve done all the biographic splits and organisational splits, done some regression stats, worked out some confidence indicator here and there, and there’s always a ‘just-so’ story to tell. All noise.

Cathartic, thank you ;-)

By Meic Goodyear. August 16th, 2013 at 6:35 am

What is signal and what is noise varies from reader to reader to reader, and from time to time for the same reader. For example, in the deprived inner city borough where I work the population experiences high all-age and premature mortality rates from a number of potentially manageable long term conditions, and we know our General Practitioners (family physicians in US terms) identify far less prevalence of these conditions than the best available predictive models implied should be present. We wanted to see whether these two facts were linked, so I calculated directly age standardised mortality rates (DASR) for each GP Practice’s 3-year data and regressed these against recorded to expected prevalence ratios. For this exercise the month to month figures and uncertainty around each GPP’s DASR was effectively noise, though I calculate 95% CIs for DASRs as matter of course. However, the actual results of the calculations threw up serious questions about mortality rates for two specific GP practices that meant the former “noise” became the main signals, via caterpillar charts, funnel plots, and cusum graphs, leading to detailed investigation into who was actually responsible for care at certain particular nursing homes.

By Stephen Few. August 16th, 2013 at 7:24 am

Great example Meic. Signals aren’t always anticipated. They sometimes provide an answer to a worthwhile question that we haven’t yet thought to ask. You mentioned funnel plots, which is a valuable form of display that most people haven’t encountered (not to be confused with those silly sales funnel charts). For this reason, they’ll have a place in my new book.

By Cecelia. August 22nd, 2013 at 2:17 pm

In your experience, what data qualify as signals? How do you find them? What do you do to understand them? What do you do about them once found? What examples have you seen in your own organization or others of time wasted chasing noise. What can we do to reduce noise?
I’m a thin film engineer for an industrial manufacturer. We use large vacuum systems to apply thin films on large parts, and there are many different factors to control and many different responses to observe. Identifying control and effects is a big challenge when we want to make a product.

In a large vacuum system, things like gas pressure, line speed, electrical power, and conditioning time are common controls. Things like film thickness, density, uniformity, and color are common responses. A signal we often look for is a change in a response from a change in a control. To understand these signals, we will perform screening DOEs. The output of a well-designed DOE is evidence that one or more factors have a significant effect on the responses, commonly shown in a group of effects plots. Once you have identified the factor/response relationship, you can optimize that relationship and manufacture with the greatest possible efficiency.

An example of wasted time: Often there is no effect from a single factor, but there may be an effect from the interaction of two factors. Like maybe a low linespeed and a high conditioning time gives you the thickest most uniform film possible, but you wouldn’t know it by looking at either factor individually. Ignoring these combined effects is a big risk.

Reducing this noise: increase awareness of good “signal detection” techniques. Explore a wider range of operating spaces, and show the entire capability map.

By David Dotson. August 23rd, 2013 at 9:01 am

Steve,
It’s always a welcomed experience to read about any topic that generates reflection and when the author is in command of the subject matter. For that I say thank you. Having consulted executives and major ERP providers about the value of real-time business intelligence, I believe your observations about “Signals” are right on the money.
I would add that the value in memory computing and other big data paradigm advancements present us with improved ways to not only identify Signals but also distill the “meaningful” data from the enterprise with greater accuracy. The increase in “Meaningful Data”, taking three forms; historical, current and predictive can be tailored to meet the needs of specific industries, organizations and departments. This “meaningful data” will foster greater collaboration and effectiveness. Just as look back with a smile at Bill Gates who once pondered how anyone would need more computing power than the technology of the mid-90’s PC’s delivered, people will look back at this day and chuckle how they thought dashboards reflecting KPI’s was the pinnacle of business intelligence. Big data consumer sophistication will soon advance and users will ask why we aren’t linking dashboards to industry level and macroeconomic indicators so we can dramatically improve our ability to predict the future and identify new, meaningful, opportunities to reach our vision.

By Stephen Few. August 23rd, 2013 at 9:15 am

David,

In my blog post I argued that so-called Big Data is actually making it harder to detect signals in data. You seem believe the opposite. You are applauding the potential of Big Data to “not only identify Signals but also distill the ‘meaningful’ data from the enterprise with greater accuracy.” This sounds like a line that we might read in a vendor’s promotional literature. Do you have any evidence that we’re doing a better job of detecting signals in data today than we did five years ago, ten years ago, or even twenty years ago? Perhaps we’re making more mistakes today in our efforts to detect signals as the volume, velocity, and variety of data grows. On what does your confidence in Big Data rely?

By Naveen Michaud-Agrawal. August 23rd, 2013 at 10:25 am

Hi Cecelia,

That’s a very common engineering problem (optimize a number of controls to get responses within a certain tolerance with correlation between controls). I’ve never really seen a good commercial tool to easily explore this kind of dataset. There’s some interesting research prototypes from the 90s (for example, see this brief paper from 1995 – http://www.ee.ic.ac.uk/r.spence/pubs/TSDS95.pdf, and other research from Lisa Tweedie – http://scholar.google.com/scholar?q=Tweedie%2C+Lisa&btnG=&hl=en&as_sdt=0%2C33). Unfortunately none of those ideas ever made it out of the lab.

By Larry Keller. September 2nd, 2013 at 5:57 am

Perhaps the content will include an approach to a visual signal to noise ratio.

By Rich Bradford. September 3rd, 2013 at 8:08 am

Stephen,

Thank you for the insightful article – an excellent example of signal in and of itself. One of the most meaningful learnings for me came in analyzing “door-to-balloon” (DTB)times for a group of cardiologists. In 2011, their patients had an mean DTB time of 64 minutes – some 4 minutes above their desired time. The group implemented a number of process improvement initiatives and the 2012 mean DTB time increased to 65 minutes – much to the physicians dismay. However when the variability in times was presented with a simple box blot – a completely different picture emerged. Although the mean DTB time increased, the variability in times decreased dramatically. This decrease in the DTB time variability validated their process improvement efforts and highlighted the fact that a single value – such as the mean may be misleading (and in this case disheartening (no pun intended)). Importantly, the cardiologists came to apprecaite that the mean DTB time alone might martingale around a stationary value and be misleading if not accompanied by some measures of dispersion. It has been my experience that too often presentations overlook the signal embedded in the variability of a distribution. This exercise also reminded me of the value professional analysts bring to an organization.

Regards,

Rich

By Stephen Few. September 3rd, 2013 at 8:51 am

Rich,

Thanks for the great example of the signals that reside in variation. The importance of understanding variation will be featured in my book (variation within categories, variation within measures, variation through time, variation across space, and variation within relationships between measures). If we understand these forms of variation, we understand our data.

By Michelle. September 9th, 2013 at 4:18 am

There is one little thin book which taught me so much about seeing through noise when it comes to data: Understanding Variation by Donald Wheeler.

By Stephen Few. September 9th, 2013 at 7:37 am

Michelle,

All of Donald Wheeler’s books are excellent, especially Understanding Variation. Much that I’ve learned from Wheeler’s books will be reflecting in my own.

By Peter Z. September 19th, 2013 at 9:48 am

Speaking of Donald Wheeler, here’s a link (url) to Wheeler’s online articles at Quality Digest: http://www.qualitydigest.com/read/content_by_author/12852. The articles are a fairly easy read, content-rich, and insightful.

By Kris Erickson. September 23rd, 2013 at 6:10 am

Steven,

In this article (http://www.livescience.com/23709-blind-people-picture-reality.html) it says that “When blind people read Braille using touch, the sensory data is being sent to and processed in the visual cortex.” Could touch, and temperature be used to allow the blind a parallelism that would allow the visual cortex to receive and perceive two different stimuli?

By Bill Dean. September 26th, 2013 at 11:06 am

I’m surprised that no one’s mentioned social media yet. Here’s a paradoxical resource that is said to be unfailingly “noisy”, but is a series of signals for the discovery of events (concerts, flash deals), problems (product issues, illness, natural disasters, “amber” alerts), or new products (XBOX, iPhone, etc.). A signal to one person is noise to another (e.g., the 49ers losing is just noise that works to drown out my XBOX signal). The key is to do this at the right context and altitude. Because social creates an environment that connects related people and companies, context is somewhat curated naturally.
The mathematical challenges are both in keeping it as simple as possible (and still in its useful form) for those who consume it, and the complex computation options (are the boundaries for what qualifies signal derived by variation, absolute units, percentage change, or some more complex algorithm…and calculated/compared at what time units…and is it compared to the last data points or to the same time period for the same day of week over the past few weeks). The right model will differ by domain (percent change might be ok when building airplanes because few units exist, but something more complicated is required with measuring Twitter mentions).
The real challenge here is in both being accurate and quick to identify signals. If you identify them quickly, you might have too many false-positives. Minimizing false-positives means you likely are waiting too long to declare it a “signal”. You can’t wait until waves are hitting buildings to confirm the signal of a potential tsunami. Ring the bell too often when there’s no danger, then the “signal” bell will soon be ignored.
In my business, we focus on product launches as well as identifying issues that can happen “out in the wild”. Signals across social media, support calls, etc. are critical in knowing when a signal is begging for an action. The action is wholly dependent on the signal, which can include download, installation, or account problems. These potential issues are defined in a generic enough way to cast a wide net but not to be so big you can’t see it move (perfectly defining these things isn’t necessary). A baseline defined by the typical (read average) volume for that hour (Fridays and Sundays aren’t necessarily equal for all measures—therefore Friday 10-11 is compared to the previous Fridays at 10-11). If every hour is homogeneous, then this isn’t necessary…depends on the domain.
I’m looking forward to this upcoming book. I’ve loved the design and content of the previous ones and expectations are high.
The books “2-second Advantage” (by Ranadive) and “Pulse” (by Hubbard) are nice complements to “Predictive Analytics” with a focus on signals and have great examples.

By Mike Sharkey. October 2nd, 2013 at 9:02 am

Thanks for bringing up the ‘Big Data’ topic. As the founder of an analytics company, I try to shy away from using that term. The focus is really on the value you deliver, not the data/medium/tool that you use to achieve the goal. It’s the same way that the terms “report” and “dashboard” get convolved. I don’t care what you call the thing…as long as it conveys some valuable bit to me in a simple fashion.

As far as your signal/noise topic goes, I’ve done a fair amount of work in delivering data-driven results in higher education (e.g. a model to determine what students are at risk of failure). One of my takeaways is that the hard part isn’t the analysis (separating the signal from the noise). It’s work, it’s just the core part of the problem. The nuts and bolts are getting the data in (from multiple sources) and getting it back out (so someone can act on it). It’s a subtle difference, but for those of us in the trenches, it’s vital.

By Stephen Few. October 2nd, 2013 at 7:54 pm

Mike,

Getting the data and whipping into usable shape can be challenging, but while focusing on this aspect of data management the BI/data warehousing industry has never fully appreciated the fact that effective data analysis requires a level of skill that is more formidable than the work of data extraction, transformation, and loading (ETL). I say this having done a great deal of both. The fact that more time is typically spent on ETL to prepare data than the time that is later spent on data analysis is a misallocation or resources, based on a skewed sense of priorities. We dare not skimp on ETL, but the real work only begins once the data is available for exploration and analysis. What passes for data analysis in most organizations, however, is little more than standard report development. The kind of data analysis that is required to separate true signals from the noise and make sense of those signals requires more sophisticated thinking skills than the back-end tasks of data management. Both require a great deal of skill, but back-end work is more specialized and technical (i.e., learning to use particular tools) whereas data analysis requires a broader array and subtler collaboration of cognitive abilities.

By Glenn Exton. October 17th, 2013 at 8:52 am

Steve,

Many years ago whilst I was in the Department of Defence, I was heavily involved in surveillance and analysis of ‘digital communication signals’ (phase shift keying / spread spectrum etc). We conducted many experiments in order extract meaning out of signals that seemed so random and in sometimes, just seemed like noise. However, by applying various techniques such as cross correlation, auto correlation and FFT to name a few, we were able to find meaning. Hence, I’m quite interested how Big Data is now entering similar grounds we faced over 20+ years ago. I’m very curious about the angle that you are exploring and how to detect meaning when the ‘noise’ floor is rising fast masking true information, hence lower S/N ratio. Regards Glenn

By Stephen Few. October 17th, 2013 at 5:24 pm

Glenn,

I suspect that the point of your surveillance activity was better defined than what Big Data advocates. The basic tenet of Big Data is “collect everything without worrying about actual use, which may or may not eventually reveal itself, because the cost of data collection and storage is inconsequential.” I don’t espouse this wasteful and unfocused hope. I also know that the cost of data collection and storage is far from inconsequential. The methods that I teach for signal detection today are basically the same techniques that have been used to detect signals in data for years. All I’m doing is paring them down to the most generally useful techniques, explaining them as clearly as possible, and showing how today’s tools can be used to do the work more easily than in the past. Relatively little, if anything, that goes by the name Big Data provides new ways of detecting signals. Instead, Big Data is mostly just making the pile of data bigger and harder to navigate.