Thanks for taking the time to read my thoughts about Visual Business
Intelligence. This blog provides me (and others on occasion) with a venue for ideas and opinions
that are either too urgent to wait for a full-blown article or too
limited in length, scope, or development to require the larger venue.
For a selection of articles, white papers, and books, please visit
April 4th, 2016
We review published research studies for several reasons. One is to become familiar with the authors’ findings. Another is to provide useful feedback to the authors. I review infovis research papers for several other reasons as well. My primary reason is to learn, and this goal is always satisfied—I always learn something—but the insights are often unintended by the authors. By reviewing research papers, I sharpen my ability to think critically. I’d like to illustrate the richness of this experience by sharing the observations that I made when I recently reviewed a study by Drew Skau, Lane Harrison, and Robert Kosara titled “An Evaluation of the Impact of Visual Embellishments in Bar Charts,” published in the Eurographics Conference on Visualization (EuroVis). My primary purpose here is not to reveal flaws in this study, but to show how a close review can lead to new ways of thinking and to thinking about new things.
This research study sought to compare the effectiveness of bar graphs that have been visually embellished in various ways to those of normal design to see if the embellishments led to perceptual difficulties, resulting in errors. The following figure from the paper illustrates a graph of normal design (baseline) and six types of embellishments (rounded tops, triangular bars, capped bars, overlapping triangular bars, quadratically increasing bars, and bars that extend below the baseline).
The study consisted of two experiments. The first involved “absolute judgments” (i.e., decoding the value of a single bar) and the second involved “relative judgments” (i.e., determining the percentage of one bar’s height relative to another). Here’s an example question that test subjects were asked in the “absolute judgments” experiment: “In the chart below, what is the value of C?”
As you can see, the Y axis and scale only include two values: 0 at the baseline and 100 at the top. More about this later. Here’s an example question in the “relative judgments” experiment: “In the chart below, what percentage is B of A?”
As you can see, when relative judgments were tested, the charts did not include a Y axis with a quantitative scale.
Let’s consider one of the first concerns that I encountered when reviewing this study. Is the perceptual task that subjects performed in the “absolute judgment” experiment actually different from the one they performed in the “relative judgment” experiment? By absolute judgment, the authors meant that subjects would use the quantitative scale along the Y axis to decode the specified bar’s value. Ordinarily, we read values in a bar graph by associating its height to the nearest value along the quantitative scale and then adjusting it slightly up or down depending on whether it is above or below that value. In this experiment, however, only the value of 100 on the scale is useful for interpreting a bar’s value. Given the fact that the top of the Y axis marked a value of 100, its height represented a value of 100% to which the bar could be compared. In other words, the task involved a relative comparison of a bar’s height to the Y axis’ height of 100%, which is perceptually the same as comparing the height of one bar to another. Although perceptually equal, tasks in the “absolute judgment” experiment were slightly easier cognitively because the height of the Y axis was labeled 100, as in 100%, which provided some assistance that was missing when subjects were asked to compare the relative heights of two bars, neither of which had values associated with them.
Why did the authors design two experiments of perception that they described as different when both involved the same perceptual task? They didn’t notice that they were in fact the same. I suspect that this happened because they designed their graphs in a manner that emulated the design that was used by Cleveland and McGill in their landmark study titled “Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods.” In that original study, the graphs all had a Y axis with a scale that included only the values 0 and 100, but test subjects were only asked to make relative judgments, similar to those that were performed in the “relative judgment” experiment in the new study. The authors of the new study went wrong when they added an experiment to test “absolute judgments” without giving the graphs a normal quantitative scale that consisted of several values between 0 and 100.
Despite the equivalence of the perceptual tasks that subjects performed in both experiments, the authors went on to report significant differences between the results of these experiments. Knowing that the perceptual tasks were essentially the same, this led me to speculate about the causes of these differences. This speculation led me to a realization that I’d never previously considered. It occurred to me that in the “relative judgment” experiment, subjects might have been asked at times to determine “What percentage is A of B?” when A was larger than B. Think about it. Relative comparisons between two values (i.e., what is the percentage of bar A compared to bar B) are more difficult when A is larger than B. For example, it is relatively easy to assess a relative proportion when bar A is four-fifths the height of bar B (i.e., 80%), but more difficult to assess a relative proportion when bar A is five-fours the height of bar B (i.e., 125%). The former operation can be performed as a single perceptual task, but the latter requires a multi-step process. Comparing A to B when A is 25% greater in value than B requires one to perceptually isolate the portion of bar A that extends above the height of bar B, compare that portion only to the height of bar B, and then add the result of 25% to 100% to get the full relative value of 125%. This is cognitively more complex, involving a mathematical operation, and more perceptually difficult because the portion of bar A that exceeds the height of bar B is not aligned with the base of bar B.
Observation #1: Relative comparisons of one bar to another are more difficult when you must express the proportional difference of the greater bar.
Equipped with this possible explanation for the differences in the two experiments’ results, I emailed the authors to request a copy of their data so I could confirm my hypothesis. This led to the next useful insight. Although receptive to my request, only one author had access to the data and it was not readily accessible. The one author with the data was buried in activity. I finally received it after waiting for six weeks. I understand that people get busy and my request was certainly not this fellow’s priority. What surprised me, however, is that the data file wasn’t already prepared for easy distribution. A similar request to a different team of authors also resulted in bit of a delay, but in that case only about half of the data that I requested was ever provided because the remainder was missing, even though the paper had only been recently published. These two experiences have reinforced my suspicion that data sets associated with published studies are not routinely prepared for distribution and might not even exist. This seems like a glaring hole in the process of publishing research. Data must be made available for review. Checking the data can reveal errors in the work and sometimes even intentional fabrication of the results. In fact, I’ll cause the infovis research community to gasp in dismay by arguing that peer reviews should routinely involve a review of the data. Peer reviewers are not paid for their time and many of them review several papers each year. As a result, many peer reviews are done at lightning speed with little attention, resulting in poor quality. To most reviewers, a requirement that they review the data would make participation in the process unattractive and impractical. Without this, however, the peer review process is incomplete.
Observation #2: Data sets associated with research studies are not routinely made available.
When I first got my hands on the data, I quickly checked to see if greater errors in relative judgments were related to comparing bars when the first bar was greater in value than the second, as I hypothesized. What I soon discovered, however, was something that the authors didn’t mention in their paper: in all cases the first bar was shorter than the second. For example, if the question was “What percentage is B of A?”, B (the first bar mentioned) was shorter than A (the second bar mentioned). So much for my hypothesis. What the hell, then, was causing greater errors in the “relative judgment” experiment?
Before diving into the data it occurred to me that I should first confirm that greater errors actually did exist in the “relative judgment” experiment compared to the “absolute judgment” experiment. They certainly seemed to when using the statistical mean as a measure of average error. However, when the mean is used in this way, we need to confirm that it’s based on a normal distribution of values, otherwise it’s not a useful measure of center. Looking at the distributions of errors, I discovered that there were many huge outliers. Correct answers could never exceed 100%, which was the case when bars were equal in height, but I found values as large as 54,654%. These many outliers wreaked havoc on the results when based on the mean, especially in the “relative judgment” experiment. When I switched from the mean to the median as a measure of central tendency the differences between the two experiments vanished. Discovering this was a useful reminder that researchers often misuse statistics.
Observation #3: Even experienced infovis researchers sometimes base their results on inappropriate statistics.
Having switched from the mean to the median, I spent some time exploring the data from this new perspective. In the process, I stumbled onto an observation that makes perfect sense, but which I’d never consciously considered. Our errors in assessing the relative heights of bars are related to the difference between the heights: the greater the difference, the greater the error. Furthermore, this relationship appears to be logarithmic.
In the two graphs below, the intervals along the X axis represent the proportions of one bar’s height to the other, expressed as a percentage. For example, if the first bar is half the height of the second to which it is compared, the proportion would be 50%. If the two bars were the same height, the proportion would be 100%. In the upper graph the scale along the Y axis represents the median percentage of error that test subjects committed when comparing bars with proportions that fell within each interval along the X axis. The lower graph is the same except that it displays the mean rather than the median percentage of error in proportional judgments.
As you can see, when the first bar is less than 10% of the second bar’s height, errors in judgment are greatest. As you progress from one interval to the next along the X axis, errors in judgment consistently decrease and do so logarithmically. I might not be the first person to notice this, but I’ve never run across it. This is a case where data generated in this study produced a finding that wasn’t intended and wasn’t noticed by the authors. Had I only examined the errors expressed as means rather than medians, I might have never made this observation.
Observation #4: Errors in proportional bar height comparisons appear to decrease logarithmically as the difference in their relative heights decreases.
At this point in my review, I was still left wondering why the vast majority of outliers occurred in the “relative judgment” experiment. Tracking this down took a bit more detective work, this time using a magnifying glass to look at the details. What I found were errors of various types that could have have been prevented by more careful experimental design. Test subjects were recruited using Mechanical Turk. Using Mechanical Turk as a pool for test subjects requires that you vet subjects with care. Unlike a direct interaction between test subjects and experimenters, anonymous subjects that participate in Mechanical Turk can more easily exhibit one of the following problems: 1) they can fail to take the experiment seriously, responding randomly or with little effort, and 2) they can fail to understand the directions with no way of determining this without a pre-test. Given the fact that the study was designed to test perception only, the ability of test subjects to correctly express relative proportions as percentages was required. Unfortunately, this ability was taken for granted. One common error that I found was a reversal of the percentage, such as expressing 10% (one tenth of the value) as 1000% (ten times the value). This problem could have been alleviated by providing subjects with the correct answers for a few examples in preparation for the experimental tasks. An even more common error resulted from the fact that graphs contained three bars and subjects were asked to compare a specific set of two bars in a specific order. Many subjects made the mistake of comparing the wrong bars, which can be easily detected by examining their responses in light of the bars they were shown.
[Note: After posting this blog, I investigated this observation further and discovered that it was flawed. See my comment below, posted on April 5, 2016 at 4:13pm, to find out what I discovered.]
Observation #5: When test subjects cannot be directly observed, greater care must be taken to eliminate extraneous differences in experiments if the results are meant to be compared.
I could have easily skipped to the end of this paper to read its conclusions. Having confirmed that the authors found increases in errors when bars have embellishments, I could have gone on my merry way, content that my assumptions were correct. Had I done this, I would have learned little. Reviewing the work of others, especially in the thorough manner that is needed to publish a critique, is fertile ground for insights and intellectual growth. Everyone in the infovis research community would be enriched by this activity, not to mention how much more useful peer reviews would be if they were done with this level of care.
March 31st, 2016
A recent article titled “The Sleeper Future of Data Visualization? Photography” extends the definition of data visualization to a new extreme. Proposing photography as the future of data visualization is an example of the slippery slope down which we descend when we allow the meanings of important terms to morph without constraint. Not long ago I expressed my concern that a necklace made of various ornaments, designed to represent daily weather conditions, was being promoted as an example of data visualization. The term “data visualization” was initially coined to describe something in particular: the visual display of quantitative data. Although one may argue that data of any type (including individual pixels of a digital photograph) and anything that can be seen (including a necklace) qualify as data visualization, by allowing the term to morph in this manner we reduce its usefulness. Photographs can serve as a powerful form of communication, but do they belong in the same category as statistical graphs? A necklace with a string of beads and bangles that represent the last few days of weather might delight, but no one with any sense would argue that it will ever be used for the analysis or communication of data. Yes, this is an issue of semantics. I cringe, however, whenever I hear someone say, “This disagreement is merely semantic.” Merely semantic?! There is nothing mere about differences contained in conflicting meanings.
When I warn against the promiscuous morphing of the terms, I’m often accused of a purist’s rigidness, but that’s a red herring. When I argue for clear definitions, I am fighting to prevent something meaningful and important from degenerating into confusion. Data visualization exists to clarify information. Let’s not allow its definition to contribute to the very murkiness that it emerged to combat. We already have a term for the images that we capture with cameras: they’re called photographs. We have a term for a finely crafted necklace: it’s a piece of art. If that necklace in some manner conveys data, call it data art if you wish, but please don’t create confusion by calling it data visualization.
Aside from the danger of describing photography as data visualization, the article exhibits other sloppy thinking. It promotes a new book titled “Photo Viz” by Nicholas Felton. Here’s a bit of the article, including a few words from Felton himself:
Every data visualization you’ve ever seen is a lie. At least in part. Any graph or chart represents layers and layers of abstraction…Which is why data-viz guru Nicholas Felton…is suddenly so interested in photography. And what started as a collection of seemingly random photos he saved in a desktop folder has become a curated photography book.
“Photo viz for me, in its briefest terms, is visualization done with photography or based on photography,” Felton says. And that means it’s visualization created without layers of abstraction, because every data point in an image is really just a photon hitting your camera sensor.
Abstraction is not a problem that should be eliminated from graphs. Even though millions of photonic data points might be recorded in a digital photograph, they do not represent millions of useful facts. Photos and graphs are apples and oranges. By definition, an individual item of data is a fact. Photos do not contain data in the same sense as graphs do. A fact that appears in a graph, such as a sales value of $382,304, is quite different from an individual pixel in a photo. Graphs are abstractions for a very good reason. We don’t want millions of data points in a graph; we only want the data that’s needed for the task at hand.
In the following example of photography as data visualization from Felton’s book, the image is wonderfully illustrative and potentially informative.
Although useful, this montage of photographs that illustrates a surfing maneuver is not an example of data visualization. We can applaud such uses of photography without blurring the lines between photographic illustration and data visualization.
A graph is abstract in another sense as well—one that is even more fundamental: a graph is a visual representation of abstract data. Unlike a photo, which represents physical data, graphs give visual form to something that lacks physical form and is in that sense abstract. Financial data is abstract; a flower is physical. I wouldn’t use a photo to represent quantitative data, nor would I use a graph to represent a flower.
How we classify things, each with its kin, matters. Just because a gorilla sometimes stands on two legs, we don’t call him a man.
March 22nd, 2016
Science is the best method that we’ve found for seeking truth. I trust science, but I don’t trust scientists. Science itself demands that we doubt and therefore scrutinize the work of scientists. This is fundamental to the scientific method. Science is too important to allow scientists to turn it into an enterprise that primarily serves the interests of scientists. Many have sounded the alarm in recent years that this tendency exists and must be corrected. BBBC Radio 4 recently aired a two-episode series by science journalist Alok Jha titled “Saving Science from the Scientists.” Jha does an incredible job of exposing some of ways in which science is currently failing us, not because its methods are flawed, but because scientists often fail to follow them.
This system can’t just rely on trust. Transparency and openness have to be implicit. In speaking with scientists it became clear to me that the culture and incentives within the modern scientific world itself are pushing bad behavior.
We all have a stake in this. Science has and will continue to form a big part in modern life, but we seem to have given scientists a free pass in society. Perhaps it’s time to knock scientists off their pedestal, bring them down to our level, and really scrutinize what they’re up to. Let’s acknowledge and account for the humans in science. It will be good for them and it will be good for us.
Marc Edwards, the Virginia Tech professor who exposed the high levels of lead in the water of Flint, Michigan, expresses grave concerns about our modern scientific enterprise. Bear in mind that the toxins that he discovered and exposed had been denied by government scientists. Here’s a bit of Jha’s interview with him:
My fear is that someday science will become like professional cycling, where, if you don’t cheat you can’t compete…The beans that are being counted for success have almost nothing to do with quality. It has to do with getting your number of papers, getting your research funding, inflating your h-index, and frankly, there are games that people play to make these things happen.
The h-index is a ranking system for scientists that is based on the number of publications and citations by others of those publications. Science is a career. To advance, you must publish and be cited. This perverts the natural incentives of science from a pursuit of knowledge to a pursuit of professional advancement and security.
Even the much praised process of peer review is often dysfunctional. Reviewers are often unqualified. Even more of a problem, however, is the fact that they are busy and therefore take little time in their reviews, glossing over the surface of studies that cannot be understood without greater time and thought. How can we address problems in the peer review process? Jha suggests a few thoughts on the matter.
There is a way to tackle these issues, and that’s by opening up more of the scientific process to outside scrutiny. Peer review reports could be published alongside the research papers. Even more importantly, scientists could be releasing their raw data too. It’s an approach that’s already revolutionized the quality of work in one field.
The field that he was referring to in the final sentence was genetics. There was a time when the peer review process in genetics was severely flawed, but steps were taken to put this right.
Dysfunction in the scientific process varies in degree among disciplines. Some are more mature in their efforts to enforce good practices than others. Some, such as infovis research, have barely begun the process of implementing the practices that are needed to promote good science. It is not encouraging, however, that this fledgling field of research has already erected the protections against scrutiny that we have come to expect only from long-term and entrenched institutionalization. The response that I’ve received from officials in the IEEE InfoVis community in response to my extensive and thoughtful critiques of its published studies are in direct conflict with the openness that those leaders should be encouraging. When they deny that problems exist or insist that they are addressing them successfully behind closed doors, I can’t help but think of the Vatican’s response for many years to the problem of child molestation. No, I am neither comparing the gravity of bad research to child molestation nor am I comparing researchers to malign priests, but am instead comparing the absurd protectionism of the infovis research community’s leaders to that of Catholic leadership. Systemic problems do exist in the infovis research community and they are definitely not being acknowledged and addressed successfully. Just as in other scientific disciplines, infovis researchers are trapped in a dysfunctional system of their own making, yet they defend and maintain it rather than correcting it for fear of recrimination. They’re concerned that to speak up would result in professional suicide. By remaining silent, however, they are guaranteeing the mediocrity of their profession.
Jha sums up his news story with the following frank reminder:
There’s nothing better than science in helping us to see further, and it’s therefore too important to allow it to become just another exercise in chasing interests instead of truths…We need to save scientific research from the business it’s become, and perhaps we need to remind scientists that it’s us, the public, that gives them the license to do their work, and its us to whom they owe their primary allegiance.
I’m not interested in revoking anyone’s license to practice science; I just want to jolt them into remembering what science is, which is much more than a career.
March 1st, 2016
Modern science relies heavily on an approach to the assessment of uncertainty that is too narrow. Scientists rely on statistical measures of significance to establish the merits of their findings, often without fully understanding the limitations of those statistics and the original intentions for their use. P-values and even confidence intervals are cited as stamps of approval for studies that are meaningless and of no real value. Researchers strive to reach significance thresholds as if that were the goal, rather than the addition of useful knowledge. In his book Willful Ignorance: The Mismeasure of Uncertainty, Herbert I. Weisberg, PhD, describes this impediment to science and suggests solutions.
This book is for researchers who are dissatisfied with the way that probability theory is being applied to science, especially those who work in the social sciences. Weisberg describes the situation as follows:
To achieve an illusory pseudo-certainty, we dutifully perform the ritual of computing a significance level or confidence interval, having forgotten the original purposes and assumptions underlying such techniques. This “technology” for interpreting evidence and generating conclusions has come to replace expert judgment to a large extent. Scientists no longer trust their own intuition and judgment enough to risk modest failure in the quest for great success. As a result, we are raising a generation of young researchers who are highly adept technically but have, in many cases, forgotten how to think for themselves.
In science, we strive for greater certainty. Probability is a measure of certainty. But what do we mean by certainty? What we experience as uncertainty arises from two distinct sources: doubt and ambiguity. “Probability in our modern mathematical sense is concerned exclusively with the doubt component of uncertainty.” We measure it quantitatively along a scale from 0 for complete uncertainty to 1 for complete certainty. Statistical measures of probability do not address ambiguity. “Ambiguity pertains generally to the clarity with which the situation of interest is being conceptualized.” Ambiguity—a state of confusion, of simply not knowing—does not lend itself as well as doubt to quantitative measure. It is essentially qualitative. When we design scientific studies, we usually strive to decrease ambiguity through various controls (selecting a homogeneous group, randomizing samples, limiting the number of variables, etc.), but this form of reductionism distances the objects of study from the real world in which they operate. Efforts to decrease ambiguity require judgments, which require expertise regarding the object of study that scientists often lack.
A chasm exists in modern science between researchers, who focus on quantitative measures of doubt and practitioners who rely on qualitative judgments to do their work. This is clearly seen in the world of medicine, with research scientists on one hand and clinicians on the other. “We have become so reliant on our probability-based technology that we have failed to develop methods for validation that can inform us about what really works and, equally important, why.” Uncertainty reduction in science requires a collaboration between these artificially disconnected perspectives.
Our current methodological orthodoxy plays a major role in deepening the division between scientific researchers and clinical practitioners. Prior to the Industrial Age, research and practice were more closely tied together. Scientific investigation was generally motivated more directly by practical problems and conducted by individuals involved in solving them. As scientific research became more specialized and professionalized, the perspectives of researchers and clinicians began to diverge. In particular, their respective relationships to data and knowledge have become quite different.
As I’ve said through various critiques of research studies and discussions with researchers, this chasm between researchers and expert practitioners is especially wide in the field of information visualization and seems to be getting wider.
To make his case, Weisberg takes his readers through the development of probability theory from its beginnings. He does this in great detail, so be forewarned that this assumes an interest in the history of probability. In fact, this history is quite interesting, but it does make up the bulk of the book. It is necessary, however, to help the reader understand the somewhat arbitrary way in which statistical probability was conceptualized in the context of games of chance, as well as the limitations of that particular framing. Within this conceptual perspective, specific statistics such as correlation coefficients and P-values were developed for specific purposes that should be understood.
In the conduct of scientific research, we have the choice of a half-empty or half-full perspective. We must judge whether we really do understand what is going on to some useful extent, or must defer to quantitative empirical evidence. Statistical reasoning seems completely objective, but can blind us to nuances and subtleties. In the past, the problem was to teach people, especially scientists and clinicians, to apply critical skepticism to their intuitions and judgments. Thinking statistically has been an essential corrective to widespread naiveté and quackery. However, in many fields of endeavor, the volumes of potentially relevant data are growing exponentially…Unfortunately, the capacities for critical judgment and deep insight we need may be starting to atrophy, just as opportunities to apply them more productively are increasing.
Don’t assume that Weisberg wants to dismantle the mechanisms of modern science. Instead, he wants to augment them to advance knowledge more effectively.
Is there a way to avoid the regression of science? The answer is surprisingly simple, in principle. We must recognize that probability theory alone is insufficient to establish scientific validity. There is only one foolproof way to learn whether an observed finding, however statistically significant it may appear, might actually hold up in practice. We must dust off the time-honored principle of replication as the touchstone of validity…Only when the system demands and rewards independent replications of study findings can and should public confidence in the integrity of the scientific enterprise be restored.
In addition to study replication, Weisberg also strongly advocates a merging of the perspectives and skills of researchers and practitioners.
Theoretical knowledge and insight can often be helpful in focusing attention or promoting attention on a promising subject of variables. Understanding causal processes will often improve the chances of success, and of identifying factors that are interpretable by clinicians. Clinical insight applied to individual cases will depend on understanding causal mechanisms, not just blind acceptance of black-box statistical models.
Weisberg goes on to suggest ways in which current computer technologies and rapidly expanding data collections create new opportunities for the conduct of science, in many respects similar to Ben Shneiderman’s vision of Science 2.0. Opportunities abound, but they will remain untapped if we fail to correct glaring flaws in our current approach to scientific research. Weisberg knows that this won’t be easy, but he exhibits a balance between concern for systemic dysfunction and optimism for progress. Even more, he offers specific suggestions for setting this progress in motion.
This is a marvelous book—well-written and the product of exceptional thinking. If the role of statistics in research does not interest or concern you, don’t buy this book, for you won’t stick with it. If you share my concerns, however, that science must be renovated and augmented to address the challenges of today and that our understanding and use of probability theory is central to this effort, this book is worth your time.
February 24th, 2016
A few days ago, I received an email from a professor who teaches information visualization about a recent research study titled “Hypothetical Outcome Plots Outperform Error Bars and Violin Plots for Inferences About Reliability of Variable Ordering.” The study, done by Jessica Hullman, Paul Resnick, and Eytan Adar, was published by the journal PLOS ONE on November 15, 2015. The professor asked if I was familiar with it, and, if so, what I thought of it. I wasn’t aware of it, but that soon changed. This study is nonsense—another representation of dysfunction within the infovis research community. Like many infovis researchers, the authors appear to be naive about the ways that people use information visualization in the real world and what actually works for the human brain. In this blog post I’ll highlight the study’s problems and describe the conditions that, I suspect, gave rise to them.
Hypothetical Outcome Plots (HOPs) were created by the authors of this study to display one or more sets of quantitative values so that people can see how those values are distributed and, when multiple sets are displayed, compare the distributions. HOPs do this, not as a static display, such as a histogram or box plot, but as an animated series of values that appear sequentially, 400ms per frame. The following example shows a single HOPs frame (i.e., one of many values).
When animated, the blue line, which represents a single value, changes position to display several values in a data set, one at a time. In the figure below, the animated HOPs on the right represents the same two normal distributions that are displayed on the left with blue lines to mark the means and error bars to represent 95% confidence intervals.
The authors make the following claim about the merits of their study: “Our primary contribution is to provide empirical evidence that untrained users can interpret and benefit from animated HOPs.” When I came across this claim early in the paper, I could not imagine how HOPs could ever serve as a viable substitute for graphs that summarize distributions, or, how untrained users would find them more informative than a simple descriptive sentence. What was I missing that allowed the authors to make this claim? Upon further review, I discovered that the authors devised experiments that 1) restricted the usefulness of the static distribution graphs that they pitted against HOPs, and 2) asked subjects to perform useless tasks that were customized to match the abilities of HOPs. Simply put, the authors stacked the deck in favor of HOPs, yet were still unable to back their claims.
Before diving into the study, let’s remind ourselves of what data visualization is. Here’s a definition that I’ve been presenting recently in lectures:
Data visualization is technology-augmented visual thinking and communication about quantitative data.
Data visualization involves human-computer collaboration. We use visual perception and cognition to do what they do well and we allow computers to assist us by performing tasks that they can do better than we can. Not every task is performed by the human visual system. Only those tasks that the visual system can handle better than cognition or a computer are performed in this way. Skilled data visualizers distribute the work of data sensemaking appropriately between perception, cognition, and the computer in ways that leverage the strengths and avoid the weaknesses of each.
Now, back to the study. The authors were inspired by the fact that people think about proportions more naturally in terms of frequencies (counts) rather than percentages. For example, those who have not learned to think comfortably in terms of percentages often find it easier to understand the expression “57 out of 100 people” than the equivalent expression “57% of the people.” With this in mind, it apparently occurred to the authors that, if they represented distributions as a randomly selected set of 100 values and presented those values one at a time as an animation, people could potentially engage in counting to examine and compare distributions.
Let’s think about the characteristics that describe distributions and therefore typically need to be represented by distribution displays. In general, we describe the nature of a distribution in terms of the following three characteristics:
- Spread (the range across which the values are distributed)
- Central Tendency (a measure of the distribution’s center)
- Shape (the pattern that is formed by the set of values when arranged from lowest to highest)
Each of these characteristics answers specific questions that we typically ask about distributions. For example, the central tendency answers such questions as, “What value is most typical?” and “What single value is most representative of the set as a whole?” The shape answers questions such as, “Where are most of the values located?” and “Is the distribution normal, skewed, uniform, bimodal, etc.?” Several graphs have been developed to display these characteristics of distributions, including histograms, frequency polygons, strip plots, quantile plots, violin plots, box plots, and Q-Q plots. They vary in the characteristics that they feature and therefore in the purposes for which they are used. None of the displays that have been developed for this purpose rely on counting. Anyone who needs to examine and compare distributions can easily learn to use these graphs. I know this, because I teach people to use them.
But what about those occasions when we need to explain something about one or more distributions to a lay audience? If that information is best described visually, we use a simple distribution graph and explain how to read it. If that information can be communicated just as well in words and numbers, we take that approach. For a lay audience, anything that HOPs could possibly display could be more clearly presented in a simple sentence. Counting values as they watch an animated display is never a viable solution.
In this study, only one of the tasks that subjects were asked to perform was typical of the questions that we ask when examining distributions. The others were contrived to rely on counting to suggest a use of HOPs. Even if counting could answer some questions that we might ask about distributions, should that lead us to invent a form of data visualization to support the task of counting? Think about it. Is it humans or computers that excel at counting? Clearly, it isn’t humans. We count slowly and are prone to error, but counting is a task that computers were specifically designed to do at great speed and accuracy. What I’m pointing out is that HOPs are an attempt to use the human visual and cognitive systems to do something that is handled far better by a computer. The authors’ attempt to create a counting visualization was fundamentally misguided.
Let’s review the study to see how the authors fallaciously ascribed benefits to an ineffective form of display.
The Study’s Design
The study tested the ability of subjects to perform various tasks while exclusively examining normal distributions using one of the three following displays:
- A short horizontal blue line to mark the mean and error bars to show 95% of the spread around the mean.
- A violin plot in which the widest blue area marks the mean, the top and bottom show the spread, and the varying width of the blue area shows the shape.
- HOPs, in which the lowest and highest positions where values appear during the animation suggest the spread, the position in the middle of that range suggests the mean, and the frequency of values appearing in particular ranges suggests the shape.
Each data set was created by randomly selecting 5,000 values from a larger, normally distributed data set. When displayed as HOPs, however, not all of the values were included. Instead, a random sample of approximately 100 values was selected, varying somewhat from task to task, with a median of 89 values and a mean of 101. These values were then displayed individually, in random order, as an animation. Subjects were given the ability to pause the sequence and to manually advance it forward or backward, frame by frame. The frames were numbered and the numbers were visible to subjects so they could tell where they were in the series at any time, along with the total number of frames.
The study was divided into three major sections based on the number of distributions that were shown: 1) four tasks involving one-distribution displays, 2) four tasks involving two-distribution displays, and 3) one task involving a three-distribution display. Sample displays for each of these three sections are illustrated below using violin plots.
The tasks that subjects were asked to perform differed in each of the three sections. Let’s examine each section in turn.
While viewing each single-distribution display, subjects were asked to perform three tasks: 1) identify the mean, 2) estimate what proportion of values were located above a particular point, and 3) estimate what proportion of values fell within a specified range (always multiples of 10, such as from 20 to 50).
As you can probably imagine, when asked to identify the mean, subjects could easily do this when viewing the error bar and violin plot displays. With HOPs, subjects performed less well when the values were distributed across a large spread, as you would expect, but almost as well when the values were distributed across a small spread. This makes sense. In HOPs, when the line that marks the value hops around within a narrow region, it is easy to estimate a position near the center of that region. This was the only task that subjects were asked to perform that was typical of actual tasks that are done with distributions in the real world and wasn’t devised to match the abilities of HOPs.
When subjects were asked to estimate the proportion of values that fell above a particular point or within a particular range, they performed better when using HOPs, which isn’t surprising when you consider the study’s design. The number of values that were shown using HOPs was always relatively close to 100. HOPs supported these specific tasks by inviting subjects to count the number of times that the line appeared above the specified threshold or within the specified range. On the other hand, the error bar display provided no information about the shape of the distribution, so it could not support this task at all. The violin plot provided information about the shape of the distribution, but it is difficult to estimate the percentage of values that fall above a particular position or within a particular range based on the varying width of the blue shaded area. These are not tasks that we would perform using violin plots. Using HOPs to perform these tasks took some time and effort, but it provided the easiest way to answer the questions. It would be ludicrous to conclude from this, however, that HOPs would ever provide the best way to examine distributions. If we ever needed to perform these particular tasks, we would use a different form of display. For example, a histogram with binned intervals of 10 (0-9, 10-19, etc.) would make it easy to determine the proportion of values in a specified range. Nevertheless, this isn’t a task that we would ordinarily rely on our visual system to handle, but would rely on the computer to respond to a specific query. Queries, which can be be generated in various ways, provide precise and efficient answers. For example, “What percentage of values fall above the threshold of 76?” Expressed in computer terms, we would request a count of the rows where the value of some measure is greater than 76. Virtually all analytical tools support queries such as this, and good visualization tools allow graphs to be brushed to select particular ranges of values to retrieve the precise number or percentage of values associated with those ranges. In light of these efficient and precise options, who would choose to count items while watching a HOPs animation?
In this section of the study, the means of two independent distributions, A and B, were deliberately differed such that B was greater in value on average than A. Subjects were asked to compare the distributions. Typically, when comparing two independent distributions, we would ask questions such as:
- “On average, which is greater, A or B, and by how much?”
- “Which exhibits greater variation, A or B?”
- “How do distributions A and B differ in shape?”
Instead of a question along these lines, however, subjects were asked to determine “how often” B was larger than A out of 100? This is a strange question. Imagine looking at the violin plot below and being asked to determine how often values of B are larger than A.
The question doesn’t make sense, does it? The closest question that makes sense is probably “On average, how much greater are values of B than A?” With two normal distributions, this could be determined by comparing their means. The strange question that subjects were asked was clearly designed to demonstrate a use of HOPs. Subjects were directed to count the number of times, frame by frame, that the value of B was higher than A. Those subjects who viewed HOPs, rather than the error bar or violin plot displays, supposedly succeeded in demonstrating the benefits of HOPs if they could count. But what did the HOPs display actually tell them about the two distributions? Remember, the authors are proposing HOPs as a useful form of display for people who don’t understand how to read conventional distribution displays. Imagine that, by counting, the untrained viewer determines that B is greater than A 62 out of 100 times. Does this mean that B is 62% greater than A? It does not. What that is meaningful has the viewer learned about the two distributions by viewing HOPs? Unfortunately, the authors don’t tell us.
What if, rather than comparing two datasets as separate distributions, we want to compare pairs of values to see how they relate to one another? For instance, imagine that we want to see how women’s salaries relate to their male spouse’s salaries to see if one tends to be higher than the other. We can’t see how two sets of values are paired using error bars or violin plots. Are HOPs appropriate for this? They are not. If we wish to examine relationships between two sets of paired values, we’ve moved from distribution analysis to correlation analysis, so we need different types of graphs, such as scatterplots. Watching HOPs animations to examine correlations would never match the usefulness of a simple scatterplot.
In this final section of the study, consisting of distributions A, B, and C, subjects were asked to determine “how often” B was larger than both A and C? Fundamentally, this is the same task as the one in the two-distribution displays section, only complicated a bit by the addition of a third variable. Even if it were appropriate to compare three independent distributions by randomly selecting a sample of 100 values from each, arbitrarily arranging them in groups of three—one value per variable—and counting the number of times B was greater than A and C, this is not a task that we would typically perform by viewing a visualization of any type. Instead, if we were examining the data ourselves, we would simply query the data to determine the number or percentage of instances in which B > A and B > C. Watching a time-consuming animation would be absurd. Or, if we were reporting our findings to untrained users, we would do so with a simple sentence, such as “B is greater than both A and C in 32 out of 100 instances.”
The flaws in this study should be obvious to anyone with expertise in data visualization, so how is it that this study was performed by academics who specialize in infovis research and how did it pass the peer review process, resulting in publication? In part, I think people are inclined to embrace this study because it exhibits two qualities that are attractive to the infovis research community: 1) it proposes something new (innovation is valued above effectiveness by many in the field), and 2) it features animation, which is fun. Who can resist the natural appeal of “dancing data?” Those of us who rely on data visualization to do real work in the world, however, don’t find it difficult to resist inappropriate animations. Those who approach data visualization merely as the subject matter of research publications that will earn them notoriety and tenure are more susceptible to silly, ineffective visualizations. As long as the research community embraces this nonsense, it will remain of little value to the world. If you’re involved in the research community, this should concern you.
How am I able to find flaws in studies like this when researchers, including professors, miss them? It isn’t because I’m smarter than the average researcher. What sets me apart is my perspective. Unlike most researchers, I’m deeply involved in the actual use of data visualization and have been for many years. Because I work closely with others who use data visualization to solve real-world problems, I’m also painfully aware of the cost—the downright harm—of doing it poorly. These perspectives are foreign to many in the infovis research community. You cannot do good infovis research without first developing some expertise in the actual use of data visualization. This should be obvious, without needing to be said, but sadly, it is not.
P.S. I realize that this critique will likely ignite another shit storm of angry responses from the infovis research community. I will be accused of excessive harshness. Rather than responding to the substance of my critique, many will focus on my tone. To the degree that my critiques are sometimes harsh in tone, rest assured that I’ve crafted a tone that I believe is appropriate and necessary. I’m attempting to cut through the complacency of the infovis research community. If you believe that there is a kinder, gentler way to bring poor and potentially harmful research to light, I invite you to make the attempt. If your approach succeeds where mine fails, I will embrace you with gratitude. The best way to address the problem of poor research, of course, is to nip it in the bud before it is published, but that clearly isn’t happening.