Thanks for taking the time to read my thoughts about Visual Business Intelligence. This blog provides me (and others on occasion) with a venue for ideas and opinions that are either too urgent to wait for a full-blown article or too limited in length, scope, or development to require the larger venue. For a selection of articles, white papers, and books, please visit my library.

 

Statisticians, Remember Your Native Tongue

August 25th, 2016

A common problem among many professions is the inability of expert practitioners to communicate with their clients. Attorneys are often guilty of speaking legalese to the folks that they represent, unaware that it is unfamiliar to them. Medical doctors sometimes struggle in the same way, even though their effectiveness relies on their ability to communicate clearly with their patients. Statisticians struggle with this problem more than most. You can be the most advanced statistician in the world, but if you cannot clearly report your findings to decision makers, your work is wasted. Learning to express statistical findings in ways that non-statisticians can understand should be a fundamental requirement of statistical training. I suspect that this problem is often due, not to inability, but instead, to a lack of awareness. It is indeed difficult to refrain from using statistical speak once you’ve become fluent in it, but I think that most statisticians lose awareness of the fact that others don’t understand it, so they rarely even try to overcome the problem. The solution to this problem begins with awareness. I’ll use an example from the work of a talented statistician, Howard Wainer, to illustrate this problem and its solution.

On the inside cover of Howard Wainer’s newest book, Truth or Truthiness, appear the words, “This wise book is a must-read for anyone who’s ever wanted to challenge the pronouncements of authority figures.” Including “truthiness” in the title—a word that was coined by the comedian Stephen Colbert—further suggests that Wainer’s intended audience is broad; certainly not limited to statisticians. Over the course of a long and productive career, Wainer has contributed a tremendous amount to the fields of statistics and data visualization. I’ve learned a great deal from his books. When reading them, however, I have at times cringed in response to sections that general readers would find confusing or even misleading due to a lack of statistical training. I find this frustrating, because I want the basic concepts of statistics to be more broadly understood. I celebrate those rare statisticians who manage to speak of their craft in accessible ways. Charles Whelan, the author of Naked Statistics, and Nate Silver, the author of The Signal and the Noise, are two statisticians who haven’t lost touch with the world at large.

In Truth or Truthiness, Wainer critiques a graph that appeared in the New York Times and redesigns it in a way that, in his opinion, is more effective. Here’s the original graph:

This combination of a bubble plot and a bar graph tells the story of increases in China’s acquisitions outside of the country, both in the number of deals and in the costs of those deals in millions of dollars. Although Wainer believes that this could be displayed more effectively, as do I, he credits it with two positive characteristics.

The New York Times’s plot of China’s increasing acquisitiveness has two things going for it. It contains thirty-four data points, which by mass media standards is data rich, showing vividly the concomitant increases in the two data series over a seventeen-year period…And second, by using Playfair’s circle representation it allows the visibility of expenditures over a wide scale.

(Truth or Truthiness, Howard Wainer, Cambridge University Press, 2016, p. 105)

While it is true that the New York Times does a better job of basing their stories on sufficient data than most news publications, I wouldn’t cite their use of bubbles in the upper chart as a benefit. Bubbles, which encode values based on their areas, require less vertical space to show this wide range of values than bars, but this slight advantage is wiped out by the fact that people cannot judge the relative areas of circles easily or accurately, nor can they easily compare bars to bubbles to clearly see the relationship between these two variables as they change through time. Wainer points out that the use of bubbles was introduced by William Playfair, the great pioneer of graphical communication, but Playfair did not have the benefit of our knowledge of visual perception when he used this technique. Statisticians must learn what works perceptually as part of their training in data visualization. Part of understanding your audience is understanding a few things about how their brains work.

Let’s look now at the alternative display that Wainer proposes.

Wainer's Scatterplots of Chinese Acquisitions

Before critiquing this ourselves, let’s hear what Wainer has to say.

Might other alternatives perform better? Perhaps. In Figure 9.14 is a two-paneled display in which each panel carries one of the data series. Panel 9.14a [the upper panel] is a straightforward scatter plot showing the linear increases in the number of acquisitions that China has made over the past seventeen years. The slope of the fitted line tells us that over those seventeen years China has, on average, increased its acquisitions by 5.5/year. This crucial detail is missing from the sequence of bars but is obvious from the fitted regression line in the scatter plot. Panel 9.14b [the lower panel] shows the increase in money spent on acquisitions over the same seventeen years. The plot is on a log scale, and its overall trend is well described by a straight line. That line has a slope of 0.12 in the log scale and hence translates to an increase of about 32 percent per year. Thus, the trend established over these seventeen years shows that China has both increased the number of assets acquired each year and also has acquired increasingly expensive assets.

The key advantage of using paired scatter plots with linearizing transformations and fitted straight lines is that they provide a quantitative measure of how China’s acquisitiveness has changed. This distinguishes Figure 9.14 from the New York Times plot, which, although it contained all the quantitative information necessary to do these calculations, had primarily a qualitative message.

(ibid., p. 105)

Wainer’s scatterplots and his explanation of them include several assumptions about his audience’s knowledge that miss the boat. Even if his readers all understand how to read scatterplots, a scatterplot is not a good choice for this information. Clearly, a central theme of this story is how China’s acquisitions changed through time, but this isn’t easy to see in a scatterplot. Merely by connecting the values in each graph with a line, the patterns of change through time and their comparisons would become clearly visible.

About the upper graph, Wainer says,

The slope of the fitted line tells us that over those seventeen years China has, on average, increased its acquisitions by 5.5/year. This crucial detail is missing from the sequence of bars but is obvious from the fitted regression line in the scatter plot.

This is a vivid example of the disconnection from the world at large that plagues many statisticians. Most people do not understand the meaning of the slope of a trend line in a scatterplot other than the fact that, in this case, it is trending upwards. Without the annotation that he included in the chart, the 5.5/year increase in deals per year on average would remain unknown. I also don’t think that pointing this 5.5/year increase out is an appropriate summary of the story, for it suggests greater consistency than we see in the data.

The lower scatterplot introduces a number of problems for typical readers. First of all, most people don’t know how to interpret log scales. In fact, many readers might not even notice that the scale is logarithmic. They certainly wouldn’t know what the slope of the trend line means, nor would they understand that this straight line of best fit with a log scale indicates an exponential rate of increase, which Wainer fails to mention. Most readers would be inclined to compare the trend lines and conclude that the patterns of change are nearly the same. Also, one of Wainer’s statements about the data isn’t entirely correct:

The trend established over these seventeen years shows that China has both increased the number of assets acquired each year and also has acquired increasingly expensive assets.

China did not increase the number of assets or the amount of money spent on those assets each year. There are many examples of years when these values decreased, which to me is an important part of the story.

In the final paragraph of his explanation, Wainer claims:

The key advantage of using paired scatter plots with linearizing transformations and fitted straight lines is that they provide a quantitative measure of how China’s acquisitiveness has changed.

This would only be an advantage if readers knew how to read these “paired scatter plots with linearizing transformations and fitted straight lines.” Unfortunately, most readers would not. In fact, phrases such as “linearizing transformations” might cause them to flee in horror.

The news story that the New York Times was attempting to tell could have covered all of the important facts in ways that were easily understood by a general audience. If the relationship between the number of acquisitions and the costs of those acquisitions was important to the story, a single scatterplot designed in the following way with a bit of text to explain it could have done the job.

A Scatter Plot of the Chinese Acquisition Data

I’ve intentionally used linear scales for both axes so that the trend line clearly exhibits the exponential nature of the correlation between the two variables. I wouldn’t rely on the graph alone to tell this part of the story, but would explain in words that when a line curves upwards in this fashion it exhibits an exponential rate of increase: the cost of the acquisitions does not increase in increments that are equal to the number of them, but instead increases by ever greater amounts as the number of acquisitions increases. In addition to the overall nature of the relationship, this graph also clearly exhibits the fact that the relationship varies somewhat, which is especially illustrated by the outlier that strays far from the trend line in the lower right corner, showing that in a particular year the number of acquisitions was not associated with an exponential increase in costs.

It is doubtful that the New York Times was particularly concerned with the nature of the relationship between the two variables, but mostly wanted to show how both variables increased through this period of time. To tell this story, I would suggest a couple of displays, starting with the paired line graphs below.

Two line graphs of the Chinese acquisition data

This would be easy for general readers to understand and it supports the basic message well. What it doesn’t do especially well, however, is clearly show the pattern of change in the value of acquisitions because to scale this graph to include the last two extremely high values, most of the values reside in the bottom 25% of the scale (i.e., from 0 to 40 billion dollars out of a total scale that extends to 160 billion dollars), resulting in a line that is looks a great deal flatter than it would if the graph were scaled to exclude the last two values. If this pattern of change should be displayed more clearly, and if we were assured that our readers understood logarithmic scales, rather than displaying the number of acquisitions on a linear scale and the value of acquisitions on a log scale, the patterns would be more comparable if both were scaled logarithmically. In fact, expressing the value of acquisitions in billions, rather than millions of dollars as Wainer did, would allow us to use the same exact log scales in both graphs, to make them fully comparable, as follows.

Line Graphs with Log Scales

Let’s assume, however, that it is best to avoid log scales altogether to prevent confusion, which would be the case with a general audience, even with readers of the New York Times.

One potential improvement would be to place both lines in a single graph, but to do this without creating a confusing and potentially misleading dual-scaled graph. To do this, we must express both sets of values using the same unit of measure and scale. One simple and common way to do this is to express both time series as the percentage difference of each value compared to the initial value (i.e., the value for the year 1990). Another common expression of the values that is perhaps even easier for people to understand involves expressing each year’s value as its percentage of the total for the entire period, as follows:

Single Line Graph of the Chinese Acquisitions Data

Now that the two lines appear in the same graph, they are easier to compare. It is clear that the number of acquisitions and their dollar value trended upward during this period, but not always and not always together. In other words, the correlation between the number and dollar amounts of acquisitions is there, but it isn’t particularly strong. Even though we have the scaling problem caused by the extremely high dollar values in 2005 and 2006, patterns of change during 1990 through 2004 are relatively clear and easy to compare. If this were not the case, however, we could address the scaling problem by providing a second line graph that only includes data from 1990 through 2004, as follows:

Single Line Graph with Outliers Removed

Now, let’s return to the main point. Those who do the work of data analysis must know how to clearly present their findings to those who rely on that information to make decisions and take action. This is an essential skill. Highly skilled statisticians are incredibly valuable, but only if they can explain their findings in understandable terms. This requires communications skills, both in the use of words and in the use of graphics. Training in these skills is every bit as important as training in statistics.

Take care,

Signature

The Myth of Self-Service Analytics

August 17th, 2016

Exploring and analyzing data is not at all like pumping your own gas. We should all be grateful that when gas stations made the transition from full service to self service many years ago, they did not relegate auto repair to the realm of self service as well. Pumping your own gas involves a simple procedure that requires little skill.

Pumping Gas

Repairing a car, however, requires a great deal of skill and the right tools.

Car repair

The same is true of data exploration and analysis (i.e., data sensemaking).

Self service has become one the most lucrative marketing campaigns of the last few years in the realms of business intelligence (BI) and analytics, second only to Big Data. Every vendor in the BI and analytics space makes this claim, with perhaps no exception. Self-service data sensemaking, however, is an example of false advertising that’s producing a great deal of harm. How many bad decisions are being made based on specious analytical findings by unskilled people in organizations that accept the self-service myth? More bad decisions than good, I fear.

Describing analytics as “self service” suggests that it doesn’t require skill. Rather, it suggests that the work can be done by merely knowing how to use the software tool that supports “self-service analytics.” Data sensemaking, however, is not something that tools can do for us. Computers are not sentient; they do not possess understanding. Tools can at best assist us by augmenting our thinking skills, if they’re well designed, but most of the so-called self-service BI and analytics tools are not well designed. At best, these dysfunctional tools provide a dangerous illusion of understanding, not the basis on which good decisions can be made.

Some software vendors frame their products as self service out of ignorance: they don’t understand data sensemaking and therefore don’t understand that self service doesn’t apply. To them, data sensemaking really is like pumping your gas. The few software vendors that understand data sensemaking frame their products as self service because the deceit produces sales, resulting in revenues. They don’t like to think of it as deceitful, however, but merely as marketing, the realm in which anything goes.

How did it become acceptable for companies that support data sensemaking—the process of exploring and analyzing data to find and understand the truth—to promote their products with lies? Why would we ever put our trust in companies that disrespect the goal of data sensemaking—the truth—to this degree? Honest vendors would admit that their products, no matter how well designed, can only be used effectively by people who have developed analytical skills, and only to the degree that they’ve developed them. This shouldn’t be a difficult admission, but vendors lack the courage and integrity that’s required to make it.

Some vendors take the self-service lie to an extreme, arguing that their tools take the human out of the loop of data sensemaking entirely. You simply connect their tools to a data set and then sit back and watch in amazement as it explores and analyzes the data at blinding speeds, resulting in a simple and complete report of useful findings. At least one vendor of this ilk—BeyondCore—is being hailed as a visionary by Gartner. This is the antithesis of vision. No skilled data analyst would fall for this ruse, but they unfortunately are not the folks who are usually involved in software purchase decisions.

Let’s be thankful that we can save a little money and time by pumping our own gas, but let’s not extend this to the realm of untrained data sensemaking. Making sense of data requires skills. Anyone of reasonable intelligence who wishes can develop these skills, just as they develop all other skills, through study and deliberate practice. That’s how I did it, and these skills have been richly rewarding. The people and organizations who recognize self-service analytics for the absurd lie that it is and take time to develop analytical skills will emerge as tomorrow’s analytical leaders.

Take care,

Signature

The Myth of Expertise Transference

August 12th, 2016

During the long course of my professional life, I’ve observed a disturbing trend. People sometimes claim expertise in one field based on experience in another. This is a fallacious and deceitful claim. I have extensive experience in visual design, but I cannot claim expertise in architecture. Any building that I designed would most certainly crumble around me. I’m a skilled teacher, but this does not qualify me as a psychotherapist. That hasn’t stopped me from occasionally giving advice to friends, but without charge, which probably matches its worth. Although these fields of endeavor overlap in some ways, expertise in one does not convey expertise in another. No concert violinist would claim the transfer of that virtuosity to the saxophone, but IT professionals sometimes make claims that are every bit as audacious.

The field of business intelligence (BI) provides striking examples of this trend. When BI initially emerged, data warehousing was the pre-existing field of endeavor that supplied BI with most of its initial workers and technologies. Years earlier, relational database theory and management supplied most of the initial workers and technologies of data warehousing. Today, the field of endeavor that goes by such names as analytics, data science, data visualization, and performance management, is the domain of workers and technologies that were previously associated with BI and in many cases still are. I know several individuals who began their careers as experts in relational databases, who then moved into data warehousing, and then into BI, and finally into analytics and its kin without actually developing expertise in any but their initial field of endeavor. Instead, they made names for themselves in relational databases or data warehousing, and then transferred that moniker to each subsequent field of endeavor with little study or experience, and thus little skill. Many of the people who give keynotes today at BI/Analytics/Big Data conferences and who write white papers on related topics fall into this category. This is one of the reasons why domains related to analytics are so confusing, hype-filled, and poorly realized.

The skill sets that were needed to design and build relational databases or even data warehouses are significantly different from those that are needed for expert data sensemaking. I know this quite well, because early in my career I studied and taught relationship database design, but when data warehousing emerged, I found that most of my skills were not transferable. I learned this the hard way by initially trying to build data warehouses using my relational database skills and failed miserably. Over the course of years, I retooled. When BI stole the limelight from data warehousing, I became fascinated by its intentions and vision, defined initially by Howard Dresner as “concepts and methods to improve business decision making by using fact-based support systems.” This harkened back to my first full-time job in IT, when I worked in the “Decision Support” group of a large semiconductor company. Even though I began my career helping people use data to support better decisions, when I began focusing on BI, relatively little that I had learned about data warehousing was useful. I had to shift my technology-centric focus back to a perspective that was line with my university studies in the social sciences. I needed to understand the human brain, the process of decision making, and the ways that technologies could assist in this essentially human activity. This took years and led me to entirely new areas of study, including human-computer interface design. Later, when I narrowed my focus to data visualization, once again I had to humbly accept the position of a novice. My previous studies and diverse areas of experience contributed a great deal to the eventual richness of my expertise in data visualization, but it did not bestow upon me the mantel of expertise. That, I had to earn through diligent study and years of deliberate practice. It is by these same diligent means that I continue to deepen and broaden my data visualization expertise today.

Many of those who think themselves data visualization experts today base this belief primarily on experience in graphic design. While it is true that expertise in graphic design can contribute to the development of expertise in data visualization, there is a great deal more to learn and practice if you wish to understand and effectively practice data visualization. As an expert in data visualization, I have as much of a right to claim expertise in graphic design as an expert graphic designer can rightfully claim expertise in data visualization, which is very little.

I’m tempted to say that “Expertise isn’t what it used to be.” It certainly seems that people make claims of expertise today with little actual knowledge or experience, but I suppose this might have always been so. I doubt it, however, for I believe that the ready availability of information on the web has inclined people to think that expertise is equally accessible. It isn’t. Whereas information can be looked up easily and quickly, expertise requires effort and time. It’s worthy of both.

Take care,

Signature

Exploratory Data Analysis Tool Features: What’s Needed and What Should Be Left Out

August 9th, 2016

I’ve spent a great deal of time over the years, and especially during the last few months, thinking deeply about the role of technologies in exploratory data analysis. When we create technologies of all types, we should always think carefully about their effects. Typically, new technologies are created to solve particular problems or to satisfy particular needs, so we attempt to consider how well they will succeed in doing this. But this isn’t enough. We must also consider potential downsides—ways in which those technologies might cause harm. This is especially true of information technologies, and data sensemaking technologies in particular, but this is seldom done by the companies that make them. The prevailing attitude in our current technopoly is that information technologies are always inherently good—what possible harm could there be? Some attention is finally being given to the ways in which information can be misused, but this isn’t the only problem.

Whenever we hand tasks over to computers that we have always done ourselves, we run the risk of losing critical skills and settling for results that are inferior. Tasks that involve thinking strike at the core of humanity’s strength. We sit on the top of the evolutionary heap because of the unique abilities of our brains. Surrendering thinking tasks to technologies ought to be approached with great caution.

I’d like to share a few guidelines that I believe software companies should follow when adding features to exploratory data analysis tools. Please review the following list and then share with me your thoughts about these guidelines.

  • Leave out any task that humans can do better than computers.
  • Leave out any task that’s associated with an important skill that would be lost if we allowed computers to do it for us.
  • Leave out any feature that is ineffective.
  • Add features to perform tasks that computers can do better than humans.
  • Add features to perform tasks that humans do not benefit from performing in some important way.
  • Add features that are recognized as useful by skilled data analysts, but only after considering the full range of implications.
  • Never add a feature simply because it can be added or because it would be convenient to add.
  • Never add a feature merely because existing or potential customers ask for it.
  • Never add a feature simply because an executive wants it.
  • Never design a feature in a particular way because it is easier than designing it in a way that works better.
  • Never design a feature that requires human-computer interaction without a clear understanding of the human brain—its strengths and limitations.
  • Never design a feature that that requires human-computer interaction that forces people to think and act like computers.

Take care,

Signature

Deep and Wide

July 25th, 2016

Like many of you, I grew up attending Sunday school. One of my memories of that experience involves a song that was a favorite among us kids: “Deep and Wide.” It consists of only a few words, sung over and over:

Deep and wide, deep and wide,
There’s a fountain flowing deep and wide.

What we loved about the song was not the words, which we didn’t understand (I still don’t), but the hand motions that went with them. For “deep,” we would hold our arms out in front of us with one hand extended high and the other low. For “wide,” we would extend our arms to the sides as far as we could reach, hoping to smack the kids to our left and right. That joke never got old.

These words came flooding back into my memory today as I was thinking about the need in data sensemaking to dig deep into data but also to explore data broadly. Deep and wide, focus and context, detail and summary, trees and forest are all expressions that capture these two fundamental perspectives from which we should view our data if we wish to understand it. Errors are routinely made when we dig into a specific issue and form judgments without understanding it in context. Exploring data from every possible angle provides the context that’s necessary to understand the details. It keeps us from getting lost among the trees, wandering from one false conclusion to another, fools rushing in and rushing out, never really knowing where we’ve been.

Take care,

Signature