Thanks for taking the time to read my thoughts about Visual Business Intelligence. This blog provides me (and others on occasion) with a venue for ideas and opinions that are either too urgent to wait for a full-blown article or too limited in length, scope, or development to require the larger venue. For a selection of articles, white papers, and books, please visit my library.


How Scalable Do Analytics Solutions Need to Be?

August 6th, 2015

While being briefed on a product earlier this week, the company’s founder and I agreed on one point only: most of the people who are currently tasked with data analysis lack the skills that are required to do the work. He and I, however, imagine conflicting solutions to this problem. He believes that technology must come to the rescue by doing the work for these people who can’t do it for themselves. I believe that even the best technologies cannot do the work of skilled data analysts and that the problem can only be effectively addressed by helping people develop analytical skills. He agreed that equipping people with the necessary skills would work better, but dismissed it because it is not a “scalable solution.” The essence of his case went something like this: “Data is increasing at an exponential rate, so our need for analytics cannot be solved by investing in human resources because humans are not sufficiently scalable, but technologies are.” Consider this line of reasoning for a moment. It relies on the following premise: “Exponential increases in data can only be addressed by exponential increases in analytical horsepower.” This premise is fallacious. Nate Silver made this point in his book The Signal and the Noise when he wrote:

If the quantity of information is increasing by 2.5 quintillion bytes per day, the amount of useful information certainly isn’t. Most of it is just noise, and the noise is increasing faster than the signal. There are so many hypotheses to test, so many data sets to mine—but a relatively constant amount of objective truth.

The exponential growth in raw data that we’re experiencing is mostly producing noise. The amount of useful information is not increasing exponentially, therefore the need for analytical horsepower is also not increasing exponentially. Data sensemaking is a human activity that can at best be augmented and assisted by analytical tools. The only viable solution to the analytical challenges that we face is to develop the human resources that we need. This is where our attention and our investments should be focused. Don’t trust a technology vendor who claims that skilled data analysts can be replaced with his product. That analytical product does not exist.

This company’s founder claims that his product can analyze a data set and present all of the potentially useful findings in a series of simple graphs and plain English explanations without any human involvement. During the briefing, he made an off-the-cuff comment that caused the hairs on the back of my neck to bristle. He said that his product “empowers users.” He must understand empowerment quite differently than I do. As I understand it, empowerment involves an increase in ability. Software that does for you what you could do better yourself with proper training isn’t empowering.

This fellow’s notion of empowerment bothered me because I work hard to actually empower people by teaching them analytical skills. I know how much it means to people to become truly empowered with useful abilities that enable them to affect the world in beneficial ways. No one with an ounce of integrity wants to bear the title “data analyst” while doing nothing but delivering a computer’s findings to someone else without adding any value. If this is the future that analytics technologies promise, count me out. Fortunately, this isn’t a future that technologies are likely to achieve.

Take care,


Morality, Fast and Slow

July 28th, 2015

Morality is a function of the brain. When we make a distinction between matters of the heart and head, we are in fact distinguishing two modes of thinking that take place in our brains: System 1 thinking, which is fast, emotional, and intuitive (heart), and System 2 thinking, which is slow, rational, and deliberative. Daniel Kahneman beautifully describes these two modes of thinking in his book Thinking, Fast and Slow. Both modes are useful, but they are best suited for different tasks. In some situations we should go with our hearts (or guts, as in “gut feelings”), and some require the higher-order rational thinking that the prefrontal cortex (PFC) evolved to handle. In his book Moral Tribes: Emotion, Reason, and the Gap Between Us and Them, Joshua Greene, who heads the Moral Cognition Lab in Harvard University’s department of psychology, explains how this distinction applies to morality.

Moral Tribes

The moral sense that we feel in our guts and experience intuitively is a product of System 1 thinking. Some things just feel wrong and others just feel right. This moral sense evolved in our species to enable cooperation within groups. Social cooperation created an Us (our tribe) that could better compete against Them (other tribes). This automatic sense of morality takes on somewhat different forms from tribe to tribe (i.e., cultural groups, including distinct subcultures within societies, such as liberals and conservatives), but it is largely universal in nature, dissuading us from cheating our neighbors and killing our friends. Because it evolved to help us compete against other groups to give us an advantage for propagating our own kind, this moral sense pits Us against Them in a way that complicates matters in the modern world. The kind of morality that is needed to embrace a global Us is a product of System 2 thinking. Just as we need to know when to shift into System 2 thinking to solve our personal and group problems, we must do the same to solve our global problems by creating a shared metamorality for the modern world.

In his book Moral Tribes, Joshua Greene explains how morality evolved, how it works in our brains, and how it can be shaped in rational ways to enable our species and the world that we share to survive and perhaps even flourish. Here’s an excerpt from the book:

The human brain is like a dual-mode camera with both automatic settings and a manual mode. A camera’s automatic settings are optimized for typical photographic situations (“portrait,” “action,” “landscape”). The user hits a single button and the camera automatically configures the ISO, aperture, exposure, et cetera — point and shoot. A dual-mode camera also has a manual mode that allows the user to adjust all of the camera’s settings by hand. A camera with both automatic settings and a manual mode exemplifies an elegant solution to a ubiquitous design problem, namely the trade-off between efficiency and flexibility. The automatic settings are highly efficient, but not very flexible, and the reverse is true of the manual mode. Put them together, however, and you get the best of both worlds, provided that you know when to manually adjust your settings and when to point and shoot.

The rational means that Greene proposes as the basis for System 2 (manual camera mode) moral thinking is an old philosophy with an unfortunate name: utilitarianism. Despite the name, utilitarianism doesn’t frame life in cold, mechanistic terms, but strives to achieve the greatest life experiences for the most people possible without partiality. It is its impartiality that allows us to exceed the boundaries of our separate tribes. This 19th century philosophy of Jeremy Bentham and John Stuart Mill offers new hope for our species to shape a truly moral world.

Moral Tribes is an important book. This is not merely because it is thoughtful, well written, and innovative, but also because it teaches a lesson that we desperately need to learn. Despite tremendous historical strides in reducing violence in our world, our potential for doing harm to the earth and one another due to the power of our modern technologies is far greater than in the past. We who work with information technologies dare not ignore the concerns of morality by compartmentalizing them as irrelevant to our work. What we do with information has a moral dimension that is considered far too seldom. The moral thinking that we need today is not the morality of our forebears. We owe it to future generations to get this right.

Take care,


Abela’s Folly – A Thought Confuser

July 21st, 2015

On the home page of my website, I quote the mathematician and philosopher Alfred North Whitehead who said, “Seek simplicity and distrust it.” This is wise advice. We want to keep things as simple as possible, but we should never oversimplify to the point of losing essential complexity. As data visualization has become increasingly popular during the last decade, efforts to explain it have often become simplistic (i.e., oversimplified) to a harmful degree. We humans long for simple answers. The world, however, is in many ways complex. Data sensemaking and presentation skills are easy to learn, and once we’ve learned them, they seem simple and even obvious, but there is no denying that the concepts, principles, and practices are complex. We should “seek simplicity and distrust it.”

During the last year or so I’ve come across several people and organizations that were promoting the use of a chart selection diagram that was developed by Dr. Andrew Abela for his book Advanced Presentation by Design. The diagram, titled “Chart Suggestion—A Thought Starter,” serves as a guide for selecting an appropriate graph. This guide is simplistic and misleading. To be blunt, it is a confusing mess of internal contradictions and errors. While Abela might understand many aspects of effective presentation, his knowledge of data visualization is cursory at best.

Most recently, I encountered this diagram in an otherwise sane blog article by the software company iCharts. In the article, ironically titled “How to Avoid Misleading Your Audience,” they recommend the diagram as a useful guide. I responded immediately by warning them against it. Several months ago, I encountered the diagram when a fellow who was writing a book asked for my advice regarding a chapter about data visualization. He was planning to include Abela’s diagram in that chapter. Why? It conveniently fit on a page.

One-page diagrams are a tempting way to teach people new skills, but they often result in confusion. Those of you who are familiar with my work might be thinking, “But Steve, don’t you provide a one-page Graph Selection Matrix as a guide for novices?” I do provide a Graph Selection Matrix, but not as a guide for novices. I provide it as a single-page summary of the information about graph selection that’s covered in my book Show Me the Numbers. It’s only useful if you’ve already learned the concepts that it summarizes by reading the book or taking the corresponding course.

I first learned of Abela’s work several years ago when a large corporation asked me to provide ongoing data visualization training for its employees in conjunction with Abela’s presentation skills courses. Before responding to their request, I purchased and read Abela’s book. I not only found that his understanding of graphs was confused and fundamentally flawed, but also that his presentation principles were at times naïve. I told the company that I could not teach in conjunction with Abela because confusion would result.

Here’s Abela’s chart selection diagram. Take a few minutes to examine it on your own before reading my critique below. Does it make sense? Are its suggestions valid?

Where to begin? Let’s follow the sequence of choices moving outwards from the center: “What would you like to show?” We’re given four choices: 1) Comparison, 2) Distribution, 3) Composition, and 4) Relationship. This suggests that at the highest level we always want to show one of these four things in a graph. I call them “things” because I can’t think of a common term that describes these mismatched concepts. Comparison is an activity that all graphs are designed to support, Distribution is a specific feature of a set of quantitative values, Composition refers to that which something is comprised of, and Relationship is a feature that exists between values in some form or another in all data sets. These concepts don’t go together. Only one of the four—Distribution—clearly describes a specific attribute of data. Distribution refers to manner in which a set of quantitative values belonging to a single variable are distributed from lowest to highest. For example, we might want to show how employees in a company are distributed by age from youngest to oldest. Unless we need to display the distribution of a set of quantitative values (and we understand that this is what Abela means by the term distribution), the diagram leaves us hanging with no clear direction.

You might assume that Abela’s book clarifies these choices, but it doesn’t. Here’s an example of the brief explanations that his book provides: “The last option…is composition: this is when you want to highlight the components of your data.” In this context, what does “components of your data” mean? You and I might understand that he’s referring to the items that make up a categorical variable, but would someone with no training in data visualization understand this? How about the terms distribution and comparison? Abela doesn’t explain what he means by these terms at all. The closest that he comes to an adequate explanation of these high-level choices is when he provides an example of a relationship: “If you want to show that your data provides evidence of a relationship, for example, between advertising and sales revenues, then you should move to [that part of the diagram].” This reveals that by the term “relationship” he’s referring to a correlation between two or more quantitative variables, but this is certainly not the only relationship that exists in quantitative data.

If we choose Relationship, we can follow the flow diagram to the only section that provides valid graph suggestions: a Scatter Chart if we need to show a correlation between two quantitative variables and a Bubble Chart if we want to show a correlation between three variables. Other than here in this one section, Abela’s suggestions are seriously flawed.

Let’s move on to the Comparison section. Our first choice is to show comparisons Among Items or Over Time. Those of us who are experienced data analysts know that by “Items” Abela is referring to the items that make up a categorical variable, but would this be clear to a novice? “Over Time” is clear, but why do values that change over time belong to the comparison section any more than the many other comparisons that we routinely need to enable in graphs? Also, Changing Over Time, as distinct from merely Over Time, appears as a first-level choice in the Composition section, which we’ll encounter later. Suffice it to say, this is not a clear and useful taxonomy of graph types.

Let’s proceed. If we select Among Items rather than Over Time, we are then faced with the choices Two Variables per Item, which leads us to a graph that few products support and for good reason—a vertical bar graph that varies not only the heights of the bars, but also their widths to simultaneously display two quantitative variables—or One Variable per Item, which leads us to the following choice: Many Categories or Few Categories. If we select Many Categories, we are told that we should use a Table or a Table with Embedded Charts. So, according to Abela, when are tables useful? Apparently, only when we must display many categories. Oh my. What about the usefulness of tables when people need precise values? What about their usefulness when people need an easy way to look up specific values? And how do we choose between a Table and a Table of Embedded Charts? By a Table of Embedded Charts, Abela is referring to what Edward Tufte calls a small multiples display—specifically one that is arranged in both columns and rows. According to Abela, small multiples are only useful when we want to make comparisons among items, but this is hardly the case. What about a series of small multiples that are used to compare values “Over Time” rather than “Among Items,” composed of line graphs? Or what about a small multiples display of scatter plots for comparing correlations? Apparently, these aren’t appropriate options.

Onward once again. If we choose Few Categories rather than Many Categories, we must then choose between Many Items and Few Items. So, if we only need to display a few categories, what if some of them contain many items and others contain few items? What do we then? Let’s keep it simple so we can proceed. Let’s say that we need to display a single categorical variable that consists of many items, perhaps a category called product that consists of fifty individual products, rather than a category called product family that consists of only four items. In this case, according to the diagram, we should use a horizontal bar graph rather than a vertical bar graph (a.k.a., column chart). It is certainly true that it would be harder to fit fifty vertical bars side by side with their labels positioned underneath than it would to fit fifty horizontal bars one above the next with their labels positioned to the left, but is this the only circumstance that suggests an advantage of horizontal over vertical bars? How about when the labels are long and cannot be easily placed under vertical bars, but can be placed to the left of horizontal bars quite easily? If we have few items, and therefore choose vertical bars, here’s the illustration of this chart that the diagram provides:

Abela's Vertical Bar Graph

Arranging bars to overlap is not an effective practice. Positioning them side-by-side without any overlap treats different series of bars equally and supports easy comparisons.

Let’s get through the remaining suggestions in the Comparison section and then put an end to this detailed review in favor of hitting the highlights only. If we need to show comparisons Over Time rather than Among Items, the diagram leads us to next choose between Many Periods and Few Periods, but this is a meaningless choice. Whether there are many or few periods actually has no bearing on the type of chart that we should choose in this case. For this reason, let’s skip this choice in the decision tree and go directly to the next four options: Cyclical Data, Non-Cyclical Data, Single or Few Categories, and Many Categories. Abela suggests that Cyclical Data should be displayed as a Circular Area Chart (actually illustrated by a radar chart, not an area chart), but this is rarely an effective choice. Cyclical data does not equate to a circular chart. For Non-Cyclical Data, he recommends a Line Chart, but line graphs usually work best for both cyclical and linear data. For Single or Few Categories, he recommends a Column Chart, but this is absurd. Several lines in a graph are much easier to interpret and compare than multiples sets of bars. Finally, for Many Categories, he suggests a Line Chart again, which is fine, but the number of categories does not determine the usefulness of a line graph. Line graphs excel in their ability to display patterns of change through time, which is something that bar graphs do poorly. Abela says nothing about featuring patterns of change rather than comparing values at particular points in time. The latter scenario is the only time when a bar graph should be used for a time series.

Now for the remaining highlights. In the Distribution section, a Column Histogram and a Line Histogram (i.e., a frequency polygon) are both appropriate, but the choice between them is not determined by Few Data Points vs. Many Data Points. The remaining suggestions—a Scatter Chart for Two Variables and a 3D Area Chart for Three Variables don’t belong here because they feature correlations, not distributions. In fact, a 3-D area graph is rarely useful and never for a general audience.

In the remaining Composition section, we are first asked to choose either Changing Over Time or Static. By continuing through the choices we learn that, by Composition, Abela is referring to part-to-whole displays, but none of the graphs that he goes on to suggest display parts of a whole effectively, despite the fact that people use them regularly for this purpose.

That’s it; we’re done. This diagram is not a “Thought-Starter,” as Abela suggests, but a thought confuser. This diagram was no doubt the good-intentioned attempt of Abela to help people select appropriate graphs for use in presentations. Despite his intentions, however, it fails because Abela lacks expertise in data visualization. He should have stuck to his area of expertise. He also should have never tried to squeeze chart selection guidance into a one-page diagram. Even if the diagram contained logical, well-organized, and valid suggestions, this guidance cannot be effectively conveyed in a single-page diagram without harmful reduction. Abela attributes great benefit to the single-page display as an aid to presentations. In his book, when describing the ideal length of a “conference room style presentation”—one that is designed to “engage, persuade, come to some conclusion, and drive action”—he writes:

The theoretical, ideal length of a conference room style presentation is one page—with lots of detail, well laid out. Why? Because if you can achieve the goals of your presentation in one page, why would you use two, or ten, or forty? If you are able to distill your message down to one page, your audience will get the sense that you have really captured the essence of the subject. They will also appreciate (and probably be stunned by) the brevity of your presentation.

There are, in fact, many reasons why you might choose to not squeeze everything onto a single page for your audience to view all at once. Sometimes information needs to be presented in a particular sequence, revealed one point at a time. Sometimes a single-page display “with lots of detail” will appear overwhelming, causing your audience’s attention to dissolve immediately upon showing it.

I could create a single-page diagram that provides valid graph selection guidance, but I won’t, even though I’ve been asked to do this several times. I’ve refrained because, by relying on the diagram for guidance without understanding why the recommended graph works best, novices would never learn the principles. They would forever follow a set of rules without understanding them, which isn’t enough. We don’t need an army of mindless robots. We need people who can think, who know how to apply the rules to new situations and when to break them.

My concern about Abela’s work runs deeper than his chart selection guidance. Despite much good advice, what Abela teaches in his book about presentations contains fundamental flaws. For example, the book’s first chapter instructs readers to identify the personality types of their audience, based primarily on the Myers-Briggs Type Indicator (MBTI) assessment, so they can create a presentation that’s tailored to their audience’s preferences. (By the way, according to Myers-Briggs, I’m an INTJ, in case that matters to you.) Even if a clear set of presentation standards could be tied to each of the four Myers-Briggs personality types, which might not be possible, unless you are presenting to one person only and can get that person to take the MBTI assessment in advance, this is the most impractical suggestion I’ve ever encountered in a book about presentation skills. It’s hard to imagine that readers and students don’t raise this objection immediately upon encountering this dumbfounding advice.

People and organizations, including software vendors in the analytics space, crave shortcuts to achievement. There are efficient paths to skill development, but no shortcuts. Data sensemaking and presentation skills require learning and a great deal of practice. When we were children, our underdeveloped brains believed in magic. Just say abracadabra, rub the lamp, or wave the wand and reality will bend to your wishes. I believed that I could run faster and jump higher if my parents would only buy me a pair of tennis shoes called PF Flyers. (If you recognize the reference, you’ve been around for a while.) We who engage in data sensemaking and then work to present what we’ve found to others are no longer children. In the Bible, the following words are attributed to the Apostle Paul:

When I was a child, I spoke as a child, I understood as a child, I thought as a child: but when I became a man, I put away childish things. (1 Corinthians 13:11; King James 2000 version)

Simplistic solutions are not the product of an adult mind. Isn’t it time to grow up?

Take care,


Statistical Failures in Scientific Research

July 16th, 2015

In recent years we’ve become increasingly aware of abundant errors in scientific research, many of which stem from bad statistics. Many of the ways in which these studies fail statistically are encouraged by academic journals. The way in which p-values are relied on excessively but almost routinely misused is a common example. Even though researchers must understand a particular subset of statistical concepts and techniques to do their work, relatively few possess this understanding. In some fields of scientific study you can even earn a doctorate without taking a single course in statistics. I’m encouraged by the fact that this problem is getting more attention today and that some fields of science and some scientific journals are diligently working on solutions.

If you’re involved in scientific research, even if you believe that you possess all of the statistical chops that are necessary, you owe it to yourself, your field of study, and the people who stand to benefit from or be harmed by your work, to read Statistics Done Wrong: The Woefully Complete Guide, by Alex Reinhart. This book was written for you.

Statistics Done Wrong

Reinhart is a statistics instructor and PhD student at Carnegie Mellon University who is especially tuned into the statistics that are required for scientific research and the ways in which they frequently go astray. He covers the gamut, including inadequate samples, misunderstandings about statistical significance, disregard for statistical power and measures of confidence, plus many other errors that routinely plague research studies and render them useless and sometimes even harmful. Reinhart knows this territory intimately and cares about it passionately. He exposes and explains the problems, describes the impact that they’re having on science and the world at large, and also shows us how to solve them.

This is not a primer in statistics, so don’t begin with this book. Reinhart assumes that you’ve taken a course or read a book or two, but that your exposure hasn’t fully equipped you with the statistical understanding and skills that are needed to do scientific research, especially experimental studies. You could skip this book and get by as a scientist who produces erroneous and worthless work but manages nonetheless to get published and advance through the ranks. Is that who you want to be? Is it enough to be recognized as a success or do you actually want your findings to be accurate and count for something? If you choose the latter, Statistics Done Wrong will help you develop some of the skills that are required.

Take care,


The Incentives Are All Wrong

July 7th, 2015

Perhaps you’ve heard that a large proportion of published scientific research findings are false. If anyone qualified took the time to validate them, up to half or more could be exposed as erroneous at the time of publication. Most of the errors are due to bad research methodology. Most of the bad research methodology is either due to insufficient training, incentives that are in conflict with good science, or both. The primary area of training that’s lacking is in statistics. Researchers are incentivized to be productive and innovative above all else. It’s all about getting published and then getting cited. It isn’t about getting it right.

A recent commentary in the Lancet, medicine’s foremost journal, titled “Offline: What is medicine’s 5 sigma?” (Volume 385, April 11, 2015) by Editor-in-Chief Richard Horton, describes his concern that “something has gone fundamentally wrong with one of our greatest human creations.”

The case against science is straightforward: much of the scientific literature, perhaps half, may simply be untrue. Afflicted by studies with small sample sizes, tiny effects, invalid exploratory analyses, and flagrant conflicts of interest, together with an obsession for pursuing fashionable trends of dubious importance, science has taken a turn towards darkness.

If poorly conducted research exists to this extent in mature fields of study such as medicine and physics, it isn’t surprising that it’s even more prevalent in the fledgling field of information visualization. Horton ended his commentary with some good news and some bad news:

The good news is that science is beginning to take some of its worst failings very seriously. The bad news is that nobody is ready to take the first step to clean up the system.

So far in the realm information visualization research, only the bad news applies.

Take care,