Thanks for taking the time to read my thoughts about Visual Business
Intelligence. This blog provides me (and others on occasion) with a venue for ideas and opinions
that are either too urgent to wait for a full-blown article or too
limited in length, scope, or development to require the larger venue.
For a selection of articles, white papers, and books, please visit
October 17th, 2016
The unique ability of the human brain to create technologies has taken us far. The benefits of technology, however, are not guaranteed, yet we celebrate and pursue them with abandon. When we imagine new technological abilities, we tend to ask one question only: “Can we?” “Can we create such a thing?” However, we’re good at creating what we can but shouldn’t. “Should we?”, though rarely asked, is the more important question by far.
I recently read a book by Samuel Arbesman, entitled Overcomplicated. I found it intriguing, yet also utterly frightening. Arbeson is Scientist in Residence at Lux Capital, a science and technology venture capital firm. He is a fine spokesperson for his employer’s interests, for he gives the technologies that make venture capitalists rich free license to do what they will by calling it inevitable.
Many modern technologies are now complicated in ways and to degrees that place them beyond our understanding. Arbesman accepts these over-complications as a given. In light of this, he proposes ways to study them that might yield a bit more understanding, even though, in his opinion, they will forever remain beyond our full grasp. He argues that modern technologies are like biological systems—the result of evolution rather than design—sometimes a mishmash of kluges embedded in millions of lines of programming code and sometimes the results of computers generating their own code with little or no human involvement. At no point in the book does Arbesman ask the question that was constantly screaming in my head as I read it: “Should we?” Should we create technologies that exceed our understanding and can therefore never be fully controlled? The only rational and moral answer to this question is “No, we shouldn’t.”
Arbesman assumes that we often cannot design and develop modern technologies in ways that remain within the reach of human understanding. Even though he acknowledges several examples of technologies that have created havoc because they were not understood, such as financial trading systems and power grids, he accepts these over-complications as inevitable.
As a technology professional of many years, I see things differently. These technological monsters that we create today as the products of kluges are over-complicated not because they cannot be kept within the realm of understanding and our control but because of poor, sloppy, undisciplined, and shortsighted design. Arbesman and others who pull the strings of modern information technologies want us to believe that these technologies are inherently and necessarily beyond human understanding, but this is a lie. Those who create these technologies are simply not willing to do the work that’s required to build them well.
We have a choice. We could demand better design. We could and should set the limits of human understanding as the unyielding boundary of our technologies. We can choose to only build what we can understand. This is harder than quickly and carelessly throwing together kluges or trusting algorithms to manage themselves, but it is a path that we must take to avoid destruction.
Arbesman advocates humility in the face of technologies that we cannot understand, but this is an odd humility, for it’s wrapped in hubris—a belief that we have the right to unleash on the world that which we can neither understand nor control. We may have this ability, but we do not have this right, for it is an almost certain path to destruction. Along with most of the technologists that he admiringly quotes in the book, Arbesman seems to embrace all information technologies that can be created as both inevitable and good—a reverence for Technology with a capital “T” that is both irrational and dangerous.
I’m certainly not the only technology professional who is concerned about this. Many share my perspective and express it, but our concerns are not backed by the deep pockets of technology companies, which currently set the agendas and shape the values of cultures throughout the developed world. The fear that our technologies could do great harm if left uncontrolled has been around for ages. This is a reasonable fear. In his film Jurassic Park, Steven Spielberg poignantly expressed this fear regarding biological technologies. There’s a great scene in the movie when a scientist played by the actor Jeff Goldblum asks the questions that we should always ask about potential technologies before we create and unleash them on the world. The scene accurately frames the problem as one that results from the selfishness of those who care only about their own immediate gains, never raising their eyes to look further into the future and never doubting the essential goodness of their creations, despite the monsters we are capable of creating.
Although this concern about unbridled technological development is occasionally expressed, it has had little effect on modern culture so far. Each of us who cares about the future of humanity and understands that the arc of technological development can be brought into line with the interests of humanity without sacrificing anything of real value should do what we can to voice our concerns. In your own organization, when an opportunity to create, modify, or uniquely apply a technology arises, you can ask, “Should we?” This might not be the path to popularity—those who choose to do good are often unappreciated for a time—but it is the only path that doesn’t lead to destruction. Be courageous, because you should.
September 12th, 2016
I recently wrote about The Myth of Self-service Analytics in this blog. Some of you seemed to think that I was exaggerating the claims that vendors make about self-service analytics, in particular that their tools eliminate the need for analytical skills. To back my argument, I’ve collected a few examples of these claims from several vendors.
Self-service BI and analytics isn’t just about giving tools to analysts; it’s about empowering every user with actionable and relevant information for confident decision-making. (link).
Self-service Analytics for Everyone…Who’s Everyone? Your entire universe of employees, customers, and partners. Our WebFOCUS Business Intelligence (BI) and Analytics platform empowers people inside and outside your organization to attain insights and make better decision. (link)
Drive insight discovery with the data visualization app that anyone can use. With Qlik Sense, everyone in your organization can easily create flexible, interactive visualizations and make meaningful decisions.
Explore data with smart visualizations that automatically adapts to the parameters you set — no need for developers, data scientists or designers. (link)
Analytics anyone can use. (link)
The Spotfire Platform delivers self-service analytics to everyone in your company. (link)
Self-service analytics gives end users the ability to analyze and visualize their own data whenever they need to. (link)
This tool is intended for those who need to do analysis but are not Analysts nor wish to become them. (link)
Welcome to a new era of data visualization software. An era of self-service BI where instant access to insights wins the day time and time again. With Wave Analytics, now anyone can organize and present information in a much more intuitive way. Without a team of analysts. (link)
With self-service analytics, you can instantly slice and dice data on any device, without waiting for IT or analysts. (link)
Zoomdata brings the power of self-service BI to the 99%—the non-data geeks of the world who thirst for a simple, intuitive, and collaborative way to visually interact with data to solve business problems. (link)
TARGIT Decision Suite gives you self-service analytics solutions intuitive enough for the casual user… (link)
September 2nd, 2016
When evolution was purely biological, there were no reins to direct it, for evolution followed the course of nature. With homo sapiens, however, another form of evolution emerged that is exponentially faster—cultural evolution—which we can direct to some degree through deliberate choices. We haven’t taken the reins yet, however, and seldom even recognize that the reins exist, but we must if we wish to survive.
In the early days of our species, when our brains initially evolved the means to think abstractly, resulting in language and the invention of tools, we were not aware of our opportunity to direct our evolution. We are no longer naïve, or certainly shouldn’t be. We recognize and celebrate the power of our technologies, but seldom take responsibility for the effects of that power. Cultural evolution has placed within our reach not only the means of progress but also the means of regress. The potential consequences of our technologies have grown. Though we can choose to ignore these consequences and often work hard to do so, they’re now right up in our faces, screaming for attention.
Some of our technologies, beginning with the industrial revolution and continuing until now, contained seeds of destruction. Technologies that rely on fossil fuels, which contribute to global warming, are a prominent example. We can work to undo their harm either by (1) abstaining from their use, (2) developing new technologies to counter their effects, or (3) developing alternative technologies to replace them. When we create technologies, we should first consider their effects and proceed responsibly. We’re not doing this with information technologies. Instead, we embrace them without question, naively assuming that they are good, or at worst benign. Most information technologies provide potential benefits, but also potential harms.
Technologies that support data sensemaking and communication should be designed and used with care. We should automate only those activities that a computer can perform well and humans cannot. Effective data sensemaking relies on reflective, rational thought, resulting in understanding moderated by ethics. Computers can manipulate data in various ways but they cannot understand data and they have no concept of ethics. Computers should only assist in the data sensemaking process by augmenting our abilities without diminishing them.
You might think that I’m fighting to defend and preserve the dignity and value of humanity against the threat of potentially superior technologies. I care deeply about human dignity and the value of human lives, but these aren’t my primary motives. If we could produce a better world for our own and other species by granting information technologies free rein, I would heartily embrace the effort, but we can’t. By shifting more data sensemaking work to information technologies, as we are currently doing, we are inviting inferior results and a decline in human abilities.
Despite our many flaws, as living, sentient creatures we humans are able to make sense of the world and attempt to act in its best interests in ways that our information technologies cannot. We don’t always do this, but we can. Computers can be programmed to identify some of the analytical findings that we once believed only humans could discover, but they cannot perform these tasks as we do, with awareness, understanding, and care. Their algorithms lack the richness of our perceptions, thoughts, values, and feelings. We dare not entrust independent decisions about our lives and the world to computer algorithms.
We must understand our strengths and limitations and resist the temptation to create and rely on technologies to do what we can do better. We should not sit idly by as those who benefit from promoting information technologies without forethought do so simply because it is in their interests as the creators and owners of those technologies. No matter how well-intentioned technology companies and their leaders believe themselves to be, their judgments are deeply biased.
Technologies—especially information technologies—change who we are and how we relate to one another and the world. We are capable of thinking deeply about data when we develop the requisite skills, but we lose this capability when we allow computers to remove us from the loop of data sensemaking. The less we engage in deep thinking, the less we’re able to do it. So, we’re facing more than the problem that computers cannot fully reproduce the results of our brains; we’re also facing the forfeiture of these abilities if we cease to use them. By sacrificing these abilities, we would lose much that makes us human. We would devolve.
At any point in history, one question is always fundamental: “What are we going to do now?” We can’t change the past, but we must take the reins of the future. Among a host of useful actions, we must resist anyone who claims that their data sensemaking tools will do our thinking for us. They have their own interests in mind, not ours. Resistance isn’t futile; at least not yet.
August 31st, 2016
As a data sensemaker and communicator, and also as a teacher, I strive for clarity. However, the field in which I work—data visualization—is a cauldron of terminological confusion. Is it a chart, graph, or plot? Is it a data visualization, information visualization, or infographic? Is it a report, dashboard, or an analytical display? None of these terms have clear definitions. Even the word “data” itself breaks the world into camps: Is data plural or singular? This terminological (and terminal) confusion distracts us from the things that matter.
What inspired me to write this was a recent observation while reading Andy Kirk’s new book Data Visualization: A Handbook for Data Driven Design. I got excited when I noticed that the book’s opening chapter bears the title “Defining Data Visualization.” My enthusiasm waned, however, when I read the following definition of data visualization:
The representation and presentation of data to facilitate understanding.
I was pleased that Andy declared data visualization’s goal as understanding, but I was disappointed, above all else, by the fact that it said nothing about the visual nature of data visualization. There are many ways to present data to facilitate understanding other than those that are visual (a.k.a., graphical). According to Andy’s definition, if I express data in the form of a paragraph consisting of words and numbers to facilitate understanding, I’ve created a data visualization. But that’s not right. Its visual nature is integral.
I was also bothered by the juxtaposition of “representation” and “presentation.” I realize that by “representation” Andy means displaying data in a way that is different from its raw form and that by “presentation” he means passing it on to others, but the combination of these two words that vary only in the existence of “re” (i.e., again) felt awkward and somehow out of sequence, as if we do something again (represent) before doing it in the first place (present). Why not replace both with a single word, such as “display”? Doing this and adding something about the visual nature of data visualization could result in the following:
The visual display of data to facilitate understanding.
Whether we display data to make sense of it or to communicate it to others, no distinction (“representation and presentation”) is necessary in the definition.
So far, I haven’t mentioned the fact that the term “data” itself might need some clarification as well. What is (or are) data? Does data visualization deal with data of all types or only particular types? I bring this up because what we traditionally mean by data visualization always involves quantitative data. A list of people’s names alone is not a data visualization. Associate something quantitative with those names, such as people’s ages, and display them visually, such as by using bars of varying lengths, and we have a data visualization. So, perhaps we should add one more word to our definition:
The visual display of quantitative data to facilitate understanding.
Now we have a definition that’s simple, accurate, and clear. This certainly isn’t the only possible definition, but it’s now one that works.
(P.S. I sent an email to Andy recently to question the omission of “visual” from his definition. He indicated that, if the book were still in the editing phase, he would probably add it. It’s amazing how easy it is to miss the obvious, which is why good editors are priceless.)
August 25th, 2016
A common problem among many professions is the inability of expert practitioners to communicate with their clients. Attorneys are often guilty of speaking legalese to the folks that they represent, unaware that it is unfamiliar to them. Medical doctors sometimes struggle in the same way, even though their effectiveness relies on their ability to communicate clearly with their patients. Statisticians struggle with this problem more than most. You can be the most advanced statistician in the world, but if you cannot clearly report your findings to decision makers, your work is wasted. Learning to express statistical findings in ways that non-statisticians can understand should be a fundamental requirement of statistical training. I suspect that this problem is often due, not to inability, but instead, to a lack of awareness. It is indeed difficult to refrain from using statistical speak once you’ve become fluent in it, but I think that most statisticians lose awareness of the fact that others don’t understand it, so they rarely even try to overcome the problem. The solution to this problem begins with awareness. I’ll use an example from the work of a talented statistician, Howard Wainer, to illustrate this problem and its solution.
On the inside cover of Howard Wainer’s newest book, Truth or Truthiness, appear the words, “This wise book is a must-read for anyone who’s ever wanted to challenge the pronouncements of authority figures.” Including “truthiness” in the title—a word that was coined by the comedian Stephen Colbert—further suggests that Wainer’s intended audience is broad; certainly not limited to statisticians. Over the course of a long and productive career, Wainer has contributed a tremendous amount to the fields of statistics and data visualization. I’ve learned a great deal from his books. When reading them, however, I have at times cringed in response to sections that general readers would find confusing or even misleading due to a lack of statistical training. I find this frustrating, because I want the basic concepts of statistics to be more broadly understood. I celebrate those rare statisticians who manage to speak of their craft in accessible ways. Charles Whelan, the author of Naked Statistics, and Nate Silver, the author of The Signal and the Noise, are two statisticians who haven’t lost touch with the world at large.
In Truth or Truthiness, Wainer critiques a graph that appeared in the New York Times and redesigns it in a way that, in his opinion, is more effective. Here’s the original graph:
This combination of a bubble plot and a bar graph tells the story of increases in China’s acquisitions outside of the country, both in the number of deals and in the costs of those deals in millions of dollars. Although Wainer believes that this could be displayed more effectively, as do I, he credits it with two positive characteristics.
The New York Times’s plot of China’s increasing acquisitiveness has two things going for it. It contains thirty-four data points, which by mass media standards is data rich, showing vividly the concomitant increases in the two data series over a seventeen-year period…And second, by using Playfair’s circle representation it allows the visibility of expenditures over a wide scale.
(Truth or Truthiness, Howard Wainer, Cambridge University Press, 2016, p. 105)
While it is true that the New York Times does a better job of basing their stories on sufficient data than most news publications, I wouldn’t cite their use of bubbles in the upper chart as a benefit. Bubbles, which encode values based on their areas, require less vertical space to show this wide range of values than bars, but this slight advantage is wiped out by the fact that people cannot judge the relative areas of circles easily or accurately, nor can they easily compare bars to bubbles to clearly see the relationship between these two variables as they change through time. Wainer points out that the use of bubbles was introduced by William Playfair, the great pioneer of graphical communication, but Playfair did not have the benefit of our knowledge of visual perception when he used this technique. Statisticians must learn what works perceptually as part of their training in data visualization. Part of understanding your audience is understanding a few things about how their brains work.
Let’s look now at the alternative display that Wainer proposes.
Before critiquing this ourselves, let’s hear what Wainer has to say.
Might other alternatives perform better? Perhaps. In Figure 9.14 is a two-paneled display in which each panel carries one of the data series. Panel 9.14a [the upper panel] is a straightforward scatter plot showing the linear increases in the number of acquisitions that China has made over the past seventeen years. The slope of the fitted line tells us that over those seventeen years China has, on average, increased its acquisitions by 5.5/year. This crucial detail is missing from the sequence of bars but is obvious from the fitted regression line in the scatter plot. Panel 9.14b [the lower panel] shows the increase in money spent on acquisitions over the same seventeen years. The plot is on a log scale, and its overall trend is well described by a straight line. That line has a slope of 0.12 in the log scale and hence translates to an increase of about 32 percent per year. Thus, the trend established over these seventeen years shows that China has both increased the number of assets acquired each year and also has acquired increasingly expensive assets.
The key advantage of using paired scatter plots with linearizing transformations and fitted straight lines is that they provide a quantitative measure of how China’s acquisitiveness has changed. This distinguishes Figure 9.14 from the New York Times plot, which, although it contained all the quantitative information necessary to do these calculations, had primarily a qualitative message.
(ibid., p. 105)
Wainer’s scatterplots and his explanation of them include several assumptions about his audience’s knowledge that miss the boat. Even if his readers all understand how to read scatterplots, a scatterplot is not a good choice for this information. Clearly, a central theme of this story is how China’s acquisitions changed through time, but this isn’t easy to see in a scatterplot. Merely by connecting the values in each graph with a line, the patterns of change through time and their comparisons would become clearly visible.
About the upper graph, Wainer says,
The slope of the fitted line tells us that over those seventeen years China has, on average, increased its acquisitions by 5.5/year. This crucial detail is missing from the sequence of bars but is obvious from the fitted regression line in the scatter plot.
This is a vivid example of the disconnection from the world at large that plagues many statisticians. Most people do not understand the meaning of the slope of a trend line in a scatterplot other than the fact that, in this case, it is trending upwards. Without the annotation that he included in the chart, the 5.5/year increase in deals per year on average would remain unknown. I also don’t think that pointing this 5.5/year increase out is an appropriate summary of the story, for it suggests greater consistency than we see in the data.
The lower scatterplot introduces a number of problems for typical readers. First of all, most people don’t know how to interpret log scales. In fact, many readers might not even notice that the scale is logarithmic. They certainly wouldn’t know what the slope of the trend line means, nor would they understand that this straight line of best fit with a log scale indicates an exponential rate of increase, which Wainer fails to mention. Most readers would be inclined to compare the trend lines and conclude that the patterns of change are nearly the same. Also, one of Wainer’s statements about the data isn’t entirely correct:
The trend established over these seventeen years shows that China has both increased the number of assets acquired each year and also has acquired increasingly expensive assets.
China did not increase the number of assets or the amount of money spent on those assets each year. There are many examples of years when these values decreased, which to me is an important part of the story.
In the final paragraph of his explanation, Wainer claims:
The key advantage of using paired scatter plots with linearizing transformations and fitted straight lines is that they provide a quantitative measure of how China’s acquisitiveness has changed.
This would only be an advantage if readers knew how to read these “paired scatter plots with linearizing transformations and fitted straight lines.” Unfortunately, most readers would not. In fact, phrases such as “linearizing transformations” might cause them to flee in horror.
The news story that the New York Times was attempting to tell could have covered all of the important facts in ways that were easily understood by a general audience. If the relationship between the number of acquisitions and the costs of those acquisitions was important to the story, a single scatterplot designed in the following way with a bit of text to explain it could have done the job.
I’ve intentionally used linear scales for both axes so that the trend line clearly exhibits the exponential nature of the correlation between the two variables. I wouldn’t rely on the graph alone to tell this part of the story, but would explain in words that when a line curves upwards in this fashion it exhibits an exponential rate of increase: the cost of the acquisitions does not increase in increments that are equal to the number of them, but instead increases by ever greater amounts as the number of acquisitions increases. In addition to the overall nature of the relationship, this graph also clearly exhibits the fact that the relationship varies somewhat, which is especially illustrated by the outlier that strays far from the trend line in the lower right corner, showing that in a particular year the number of acquisitions was not associated with an exponential increase in costs.
It is doubtful that the New York Times was particularly concerned with the nature of the relationship between the two variables, but mostly wanted to show how both variables increased through this period of time. To tell this story, I would suggest a couple of displays, starting with the paired line graphs below.
This would be easy for general readers to understand and it supports the basic message well. What it doesn’t do especially well, however, is clearly show the pattern of change in the value of acquisitions because to scale this graph to include the last two extremely high values, most of the values reside in the bottom 25% of the scale (i.e., from 0 to 4 billion dollars out of a total scale that extends to 16 billion dollars), resulting in a line that is looks a great deal flatter than it would if the graph were scaled to exclude the last two values. If this pattern of change should be displayed more clearly, and if we were assured that our readers understood logarithmic scales, rather than displaying the number of acquisitions on a linear scale and the value of acquisitions on a log scale, the patterns would be more comparable if both were scaled logarithmically, as follows.
Let’s assume, however, that it is best to avoid log scales altogether to prevent confusion, which would be the case with a general audience, even with readers of the New York Times.
One potential improvement would be to place both lines in a single graph, but to do this without creating a confusing and potentially misleading dual-scaled graph. To do this, we must express both sets of values using the same unit of measure and scale. One simple and common way to do this is to express both time series as the percentage difference of each value compared to the initial value (i.e., the value for the year 1990). Another common expression of the values that is perhaps even easier for people to understand involves expressing each year’s value as its percentage of the total for the entire period, as follows:
Now that the two lines appear in the same graph, they are easier to compare. It is clear that the number of acquisitions and their dollar value trended upward during this period, but not always and not always together. In other words, the correlation between the number and dollar amounts of acquisitions is there, but it isn’t particularly strong. Even though we have the scaling problem caused by the extremely high dollar values in 2005 and 2006, patterns of change during 1990 through 2004 are relatively clear and easy to compare. If this were not the case, however, we could address the scaling problem by providing a second line graph that only includes data from 1990 through 2004, as follows:
Now, let’s return to the main point. Those who do the work of data analysis must know how to clearly present their findings to those who rely on that information to make decisions and take action. This is an essential skill. Highly skilled statisticians are incredibly valuable, but only if they can explain their findings in understandable terms. This requires communications skills, both in the use of words and in the use of graphics. Training in these skills is every bit as important as training in statistics.