This summer I’ve been spending most of my time working on a new book. The current working title is Signal. As the title suggests, this book will focus on analytical techniques for detecting signals in the midst of noisy data. And guess what? All data sets are noisy. In fact, at any given moment, most of the data that we collect are noise. This will always be true, because signals in data are the exception, not the rule.
Signal detection is actually getting harder with the advent of so-called Big Data. By its very nature, most Big Data will never be anything but noise. Collecting everything possible, based on the Big Data argument that the costs of doing so are negligible and that even data that you can’t imagine as useful today could become useful tomorrow, is a dangerous premise. The costs of collecting and storing everything extend far beyond the hardware that’s used to store it. People already struggle to use data effectively. This will become dramatically harder as the volume of data grows. Finding a needle in a haystack doesn’t get easier as you’re tossing more and more hay on the pile.
Most people who are responsible for data analysis in organizations have never been trained to do this work. An insidious assumption exists, promoted by software vendors, that knowing how to use a particular data analysis software product “auto-magically” imbues one with the skills of a data analyst. Even with good software—something that’s rare—this is far from true. Just as with any area of expertise, data analysis requires training and practice, practice, practice. Because few people whose work involves data analysis possess the required skills, much time is wasted and money lost as analysts pore over data without knowing what to look for. They end up chasing patterns that mean nothing and missing those that are gold. Essentially, data analysis is the process of signal detection.
Data that do not convey useful knowledge are noise. When data are displayed, noise can exist both as data that don’t provide useful knowledge and also as useless non-data elements of the display (e.g., irrelevant visual attributes, such as a third dimension of depth in bars, meaningless color variation, and effects of light and shadow). Both sources of noise must be filtered to find and focus on the signals.
When we rely on data for decision making, what qualifies as a signal and what is merely noise? In and of themselves, data are neither. Data are merely facts. When facts are useful, they serve as signals. When they aren’t useful, data clutter the environment with distracting noise.
For data to be useful, they must:
- Address something that matters
- Promote understanding
- Provide an opportunity for action to achieve or maintain a desired state
When any of these qualities are missing, data remain noise.
Signals are always signs of something in particular. In a sense, a signal is not a thing but a relationship. Data become useful knowledge of something that matters when they connect understanding to a question to form an answer. This connection (relationship) is the signal.
As I work on this book to define the nature of signals and to describe techniques for detecting them, I could benefit from your thoughts on the matter. In your experience, what data qualify as signals? How do you find them? What do you do to understand them? What do you do about them once found? What examples have you seen in your own organization or others of time wasted chasing noise. What can we do to reduce noise? Please share with me any thoughts that you have along these lines.