Thanks for taking the time to read my thoughts about Visual Business
Intelligence. This blog provides me (and others on occasion) with a venue for ideas and opinions
that are either too urgent to wait for a full-blown article or too
limited in length, scope, or development to require the larger venue.
For a selection of articles, white papers, and books, please visit
May 6th, 2016
This is my response to a recent blog article by Robert Kosara of Tableau Software titled “3D Bar Charts Considered Not that Harmful.” Kosara has worked in the field of data visualization as a college professor and a researcher for many years, first at the University of North Carolina and for the last several years at Tableau. He’s not a fly-by-night blogger. But even the advice of genuine experts must be scrutinized, for gaps in their experience and biases, such as loyalties to their employers, can render their advice unreliable.
It has become a favorite tactic of information visualization (infovis) researchers to seek notoriety by discrediting long-held beliefs about data visualization that have been derived from the work of respected pioneers. For example, poking holes in Edward Tufte’s work in particular now qualifies as a competitive sport. Tufte’s claims are certainly not without fault. Many of his principles emerged as expert judgments rather than from empirical evidence. Most of his expert judgments, however, are reliable. While we should not accept anyone’s judgment as gospel without subjecting it to empirical tests, when we test them, we should do so with scientific rigor. Most attempts by researchers to discredit Tufte’s work have been based on sloppy, unreliable pseudo-science.
Back to Kosara’s recent blog article. Here’s the opening paragraph:
We’ve turned the understanding of charts into formulas instead of encouraging people to think and ask questions. That doesn’t produce better charts, it just gives people ways of feeling superior by parroting something about chart junk or 3D being bad. There is little to no research to back these things up.
We should certainly encourage people to use charts in ways that lead them to think and ask questions. Have you ever come across anyone who disagrees with this? Apparently the formulaic understanding of charts that “we” have been promoting produces a sense of superiority, evidenced by the use of terms such as “chart junk,” coined by Tufte. Kosara’s blog entry was written in response to Twitter-based comments about the following 3-D bar graph:
As you can see, this is not an ordinary 3-D bar graph. It starts out as a fairly standard 2-D bar graph on the left and then takes a sudden turn in perspective to reveal an added dimension of depth to the bars that shoot out from the page. Kosara describes the graph as follows:
At first glance, it’s one of those bad charts. It’s 3D, and at a fairly extreme angle. The perspective projection clearly distorts the values, making the red bar look longer in comparison to its real value difference. The bars are also cut off at the base, at least unless you consider the parts with the labels to be the bottoms of the bars (and even then, they’re not the full length to start at 0).
But then, what is this supposed to show? It’s about the fact that a fungicide names [sic] Trivapro produces more yield than the two other products or no treatment. There is no confusion here about which bar is longer. And the values are right there on the chart. You can do some quick math to figure out that a gain of 32 over the base of 146 is an increase of a bit over 20%…
Based on Kosara’s own description, this graph does not communicate clearly and certainly isn’t easy or efficient to read. He goes on to admit this fact more directly.
Is this a great chart? No. It’s not even a good chart. Is this an accurate chart? No. Though it has the numbers on it, so it’s less bad than without.
Lest we rashly render the judgment that this graph deserves, Kosara cautions, “It is much less bad than the usual knee-jerk reaction would have you think, though.” Damn, it’s too late. My knee already jerked with abandon.
The gist of Kosara’s article is two-fold: 1) 3-D graphs are not all that bad, and 2) we should only be concerned with problems that researchers have confirmed as real. It would be great if we could rely on infovis researchers to serve as high priests of graphical Truth, but relatively few of them have been doing their jobs. His own recent studies and the others that he cites in the article are fundamentally flawed. This includes the respected study on which Kosara bases his claim that 3-D effects in graphs are “not that harmful,” titled “Reading Bar Graphs: Effects of Extraneous Depth Cues and Graphical Context” by Jeff Zacks, Ellen Levy, Barbara Tversky, and Diane Shiano This paper, published in the Journal of Experimental Psychology: Applied in 1998, missed the mark.
The 1998 study consisted of five experiments. The first two experiments contain the findings on which Kosara bases his defense of 3-D bar graphs. In the first experiment, test subjects were shown a test bar in a rectangular frame, which was rendered in either 2D or 3D. My reproductions of both versions are illustrated below.
By only slightly manipulating the perspective of the 3-D bar, it was kept as simple as possible, giving it the best possible chance of causing no harm. Subjects were then asked to match the test bar to one of the bars in a separate five-bar array. The bars in the array ranged in height from 20 millimeters to 100 millimeters in 20-millimeter increments. Two versions of the five-bar array were provided—one with 2-D bars and one with 3-D bars—one on each side of a separate sheet of paper. Half of the time the test bar was shown alone, as illustrated above, and the other half a second context bar appeared next to the test bar, but the test bar was always marked as being the one that should be matched to a bar in the five-bar array. The purpose of the context bar was to determine if the presence of another bar of a different height from the test bar had an effect on height judgments. This experiment found that subjects made greater errors in matching the heights of 3-D vs. 2-D bars, as expected. It also found that the presence of context bars had no effect on height judgments.
It was the second experiment that led Kosara to claim that 3-D effects in bars ought not to concern us. The second experiment was exactly like the first, except for one difference that Kosara described as the addition of a “delay after showing people the bars.” He went on to explain that this delay eliminated differences in height judgments when viewing 2-D vs. 3-D bars, and further remarked, “That is pretty interesting, because we don’t typically have to make snap judgments based on charts.” Even on the surface this comment is wrong. When we view bar graphs, the perceptual activity of reading and comparing the heights of the bars is in fact a snap judgment. It is handled rapidly and pre-attentively by the visual cortex of the brain, involving no conscious effort. The bigger error in Kosara’s comment, however, is his description of the second experiment as the same as the first except for a “delay” after showing people the bars. The significant difference was not the delay, however, but the cause for the delay. After viewing the test bar, subjects were asked to remove it from view by turning the sheet of paper over, placing it on the desk, and then retrieving a second sheet on which the test bar no longer appeared before looking at the five-bar matrix to select the matching bar. In other words, when they made their selection the test bar was no longer visible, which meant that they were forced to rely on working memory as their only means of matching the test bar to the bar of corresponding height in the five-bar matrix.
When subjects were forced to rely on working memory rather than using their eyes to match the bars, errors in judgment increased significantly overall. In fact, errors increased so significantly that the difference seen in the first experiment related to 2-D vs. 3-D bars disappeared. Put differently, the increase in judgment errors increased so much when relying on working memory that the lesser differences based on 2D vs. 3D became negligible in comparison.
Another difference surfaced in the second experiment, which Kosara interpreted as further evidence that 3-D effects shouldn’t concern us when compared to greater problems.
The other effect is much more troubling, though: neighboring bars had a significant effect on people’s perception. This makes sense, as we’re quite susceptible to relative size illusions like the Ebinghaus [sic] Effect (in case you haven’t seen it, the orange circles below are the same size).
What this means is that the data itself causes us to misjudge the sizes of the bars!
Where to begin? The Ebbinghaus Illusion pertains specifically to the areas of circles, not the lengths of bars. Something similar, called the Parallel Lines Illusion, was what concerned the authors of the 1998 study (see below).
Most people perceive the right-hand line in the frame on the left as longer than the right-hand line in the frame on the right, even though they are the same length. As you can see in my illustration below, however, this illusion does not apply to lines that share a common baseline and a common frame, as bars do. The second and fourth lines appear equal in height.
Also, if the presence of context bars caused subjects to make errors in height judgments, why wasn’t this effect found in the first experiment? Let’s think about this. Could the fact that subjects had to rely on working memory explain the increase in errors when context bars were present? You bet it could. The presence of two visual chunks of information (the test bar and the context bar) in working memory rather than one (the test bar only) increased the cognitive load, making the task more difficult. The second experiment revealed absolutely nothing about 2-D vs. 3-D bars. Instead, it confirmed what was already known: working memory is limited and reliance on it can have an effect on performance.
In the last paragraph of his article, Kosara reiterates his basic argument:
It’s also important to realize just how little of what is often taken as data visualization gospel is based on hearsay and opinion rather than research. There are huge gaps in our knowledge, even when it comes to seemingly obvious things. We need to acknowledge those and strive to close them.
Let me make an observation of my own. It is important to realize that what is often claimed by infovis researchers is just plain wrong, due to bad science. I wholeheartedly agree with Kosara that we should not accept any data visualization principles or practices as gospel without confirming them empirically. However, we should not throw them out in the meantime if they make sense and work, and we certainly shouldn’t reject them based on flawed research. The only reliable finding in the 1998 study regarding 2-D vs. 3-D bars was that people make more errors when reading 3-D bars. Until and unless credible research tells us differently, I’ll continue to avoid 3-D bar graphs.
(P.S. I hope that Kosara’s defense of 3-D effects is not a harbinger of things to come in Tableau. That would bring even more pain than those silly packed bubbles and word clouds that were introduced in version 8.)
May 3rd, 2016
Adam Grant of the Wharton School of Business has written a marvelous new book titled Originals: How Non-Conformists Move the World.
Similar to Malcolm Gladwell’s book Outliers, Grant’s book shows that originality is not something we’re born with but something that we can learn. We heap high praise on original thinkers who manage to make their mark on the world, yet our default response is to discourage original thinking. Being a non-conformist takes courage. Without useful originality, there is no progress. How do we foster greater originality in our world and in ourselves? Grant does a wonderful job of telling us how.
According to Grant, “Originality involves introducing and advancing an idea that’s relatively unusual within a particular domain, and that has the potential to improve it.” “Originals” are more than idea generators. “Originals are people who take the initiative to make their visions a reality.”
Allow me to whet your appetite for this book by sharing a few excepts that spoke to me.
The hallmark of originality is rejecting the default and exploring whether a better option exists…The starting point is curiosity…”
They [originals] feel the same fear, the same doubt, as the rest of us. What sets them apart is that they take action anyway. They know in their hearts that failing would yield less regret than failing to try.
Broad and deep experience is critical for creativity. In a recent study comparing every Nobel Prize-winning scientist from 1901 to 2005 with typical scientists of the same era, both groups attained deep expertise in their respective field of study. But the Nobel Prize winners were dramatically more likely to be involved in the arts than less accomplished scientists.
Procrastination may be the enemy of productivity, but it can be a resource for creativity. Long before the modern obsession with efficiency precipitated by the Industrial Revolution and the Protestant work ethic, civilizations recognized the benefits of procrastination. In ancient Egypt, there were two different verbs for procrastination: one denoted laziness; the other meant waiting for the right time.
“Dissenting for the sake of dissenting is not useful. It is also not useful if it is ‘pretend dissent’—for example, if role-played,” [Charlan] Nemeth explains. “It is not useful if motivated by considerations other than searching for the truth or the best solutions. But when it is authentic, it stimulates thought; it clarifies and emboldens.”
“Shapers” are independent thinkers, curious, non-conforming, and rebellious. They practice brutal, nonhierarchical honesty. And they act in the face of risk, because their fear of not succeeding exceeds their fear of failing.
The easiest way to encourage non-conformity is to introduce a single dissenter…Merely knowing that you’re not the only resister makes it substantially easier to reject the crowd.
If you want people to take risks, you need first to show what’s wrong with the present. To drive people out of their comfort zones, you have to cultivate dissatisfaction, frustration, or anger at the current state of affairs, making it a guaranteed loss.
To channel anger productively, instead of venting about the harm a perpetrator has done, we need to reflect on the victims who have suffered from it…Focusing on the victim activates what psychologists call empathetic anger—the desire to right wrongs done to another.
I hope that this brief glimpse into Originals is enough to convince you of its worth. We need more originals to solve the many, often complex problems that threaten us today. This book doesn’t just make this case, it outlines a plan for making it happen.
April 29th, 2016
Todd Rose, director of the “Mind, Brain, and Education” program at the Harvard Graduate School of Education, has written a brilliant and important new book titled The End of Average.
In it he argues that our notion of average, when applied to human beings, is terribly misguided. The belief that variation can be summarized using measures of center is often erroneous, especially when describing people. The “average person” does not exist, but the notion of the “Average Man” is deeply rooted in our culture and social institutions.
Sometimes variation—individuality—is the norm, with no meaningful measure of average. Consider the wonderful advances that have been made in neuroscience over the past 20 years or so. We now know so much more about the average brain and how it functions. Or do we? Some of what we think we know is a fabrication based on averaging the data.
In 2002, Michael Miller, a neuroscientist at UC Santa Barbara, did a study of verbal memory using brain scans. Rose describes this study as follows:
One by one, sixteen participants lay down in an fMRI brain scanner and were shown a set of words. After a rest period, a second series of words was presented and they pressed a button whenever they recognized a word from the first series. As each participant decided whether he had seen a particular word before, the machine scanned his brain and created a digital “map” of his brain’s activity. When Miller finished his experiment, he reported his findings the same way every neuroscientist does: by averaging together all the individual brain maps from his subjects to create a map of the Average Brain. Miller’s expectation was that this average map would reveal the neural circuits involved in verbal memory in the typical human brain…
There would be nothing strange about Miller reporting the findings of his study by publishing a map of the Average Brain. What was strange was the fact that when Miller sat down to analyze his results, something made him decide to look more carefully at the individual maps of his research participants’ brains… “It was pretty startling,” Miller told me. “Maybe if you scrunched up your eyes real tight, a couple of the individual maps looked like the average map. But most didn’t look like the average map at all.”
The following set of brain scans from Miller’s study illustrates the problem:
As you can see, averaging variation in cases like this does not accurately or usefully represent the data or the underlying phenomena. Unfortunately, this sort of averaging remains common practice in biology and social sciences. As Rose says, “Every discipline that studies human beings has long relied on the same core method of research: put a group of people into some experimental condition, determine their average response to the condition, then use this average to formulate a general conclusion about all people.”
This problem can be traced back to Belgian astronomer turned social scientist Adolphe Quetelet in the early 19th century. Quetelet (pronounced “kettle-lay”) took the statistical mean down a dark path that has since become a deep and dangerous rut. Sciences that study human beings have fallen into this rut and become trapped ever since. Many of the erroneous findings in these fields of research can be traced this fundamental misunderstanding and misuse of averages. It’s time to build a ladder and climb out of this hole.
When Quetelet began his career as an astronomer in the early 19th century, the telescope had recently revolutionized the science. Astronomers were producing a deluge of measurements about heavenly bodies. It was soon observed, however, that multiple measurements of the same things differed somewhat, which became known as the margin of error. These minor differences in measurements of physical phenomena almost always varied symmetrically around the arithmetic mean. Recognition of the “normal distribution” emerged in large part as a result of these observations. When Quetelet’s ambition to build a world-class observatory in Belgium was dashed because the country became embroiled in revolution, he began to wonder if it might be possible to develop a science for managing society. Could the methods of science that he learned as an astronomer be applied to the study of human behavior? The timing of his speculation was fortunate, for it coincided with the 19th century’s version of so-called “Big Data” as a tsunami of printed numbers. The development of large-scale bureaucracies and militaries led to the publication of huge collections of social data. Quetelet surfed this tsunami with great skill and managed to construct a methodology for social science that was firmly built on the use of averages.
Quetelet thought of the average as the ideal. When he calculated the average chest circumference of Scottish soldiers, he thought of it as the chest size of the “true” soldier and all deviations from that ideal as instances of error. As he extended his work to describe humanity in general, he coined the term the “Average Man.”
This notion of average as ideal, however, was later revised by one of Quetelet’s followers—Sir Francis Galton—into our modern notion of average as mediocre, which he associated with the lower classes. He believed that we should strive to improve on the average. Galton developed a ranking system for human beings consisting of fourteen distinct classes with “Imbeciles” at the bottom and “Eminent” members of society at the top. Further, he believed that the measure of any one human characteristic or ability could serve as a proxy for all other measures. For example, if you were wealthy, you must also be intelligent and morally superior. In 1909 Galton argued, “As statistics have shown, the best qualities are largely correlated.” To provide evidence for his belief, Galton developed statistical methods for measuring correlation, which we still use today.
Out of this work, first by Quetelet and later by Galton, the notion of the Average Man and the appropriateness of comparing people based on rankings became unconscious assumptions on which the industrial age was built. Our schools were reformed to produce people with the standardized set of basic skills that was needed in the industrial workplace. In the beginning of the 20th century, this effort was indeed an educational reform, for only six percent of Americans graduated from high school. Students were given grades to rank them in ability and intelligence. In the workplace, hiring practices and performance evaluations soon became based on a system of rankings as well. The role of “manager” emerged to distinguish above-average workers who were needed to direct the efforts of less capable, average workers.
I could go on, but I don’t want to spoil this marvelous book for you. I’ll let an excerpt from the book’s dust cover suffice to give you a more complete sense of the book’s scope:
In The End of Average, Rose shows that no one is average. Not you. Not your kids. Not your employees or students. This isn’t hollow sloganeering—it’s a mathematical fact with enormous practical consequences. But while we know people learn and develop in distinctive ways, these unique patterns of behaviors are lost in our schools and businesses which have been designed around the mythical “average person.” For more than a century, this average-size-fits-all model has ignored our individuality and failed at recognizing talent. It’s time to change that.
Weaving science, history, and his experience as a high school dropout, Rose brings to life the untold story of how we came to embrace the scientifically flawed idea that averages can be used to understand individuals and offers a powerful alternative.
I heartily recommend this book.
April 19th, 2016
This blog entry was written by Nick Desbarats of Perceptual Edge.
In recent decades, one of the most well-supported findings from research in various sub-disciplines of psychology, philosophy and economics is that we all commit elementary reasoning errors on an alarmingly regular basis. We attribute the actions of others to their fundamental personalities and values, but our own actions to the circumstances in which we find ourselves in the moment. We draw highly confident conclusions based on tiny scraps of information. We conflate correlation with causation. We see patterns where none exist, and miss very obvious ones that don’t fit with our assumptions about how the world works.
Even “expert reasoners” such as trained statisticians, logicians, and economists routinely make basic logical missteps, particularly when confronted with problems that were rare or non-existent until a few centuries ago, such as those involving statistics, evidence, and quantified probabilities. Our brains simply haven’t had time to evolve to think about these new types of problems intuitively, and we’re paying a high price for this evolutionary lag. The consequences of mistakes, such as placing anecdotal experience above the results of controlled experiments, range from annoying to horrific. In fields such as medicine and foreign policy, such mistakes have certainly cost millions of lives and, when reasoning about contemporary problems such as climate change, the stakes may be even higher.
As people who analyze data as part of our jobs or passions (or, ideally, both), we have perhaps more opportunities than most to make such reasoning errors, since we so frequently work with large data sets, statistics, quantitative relationships, and other concepts and entities that our brains haven’t yet evolved to process intuitively.
In his wonderful 2015 book, Mindware: Tools for Smart Thinking, Richard Nisbett uses more reserved language, pitching this “thinking manual” mainly as a guide to help individuals make better decisions or, at least, fewer reasoning errors in their day-to-day lives. I think that this undersells the importance of the concepts in this book, but this more personal appeal probably means that this crucial book will be read by more people, so Nisbett’s misplaced humility can be forgiven.
Mindware consists of roughly 100 “smart thinking” concepts, drawn from a variety of disciplines. Nesbitt includes only concepts that can be easily taught and understood, and that are useful in situations that arise frequently in modern, everyday life. “Summing up” sections at the end of each chapter usefully summarize key concepts to increase retention. Although Nesbitt is a psychologist, he draws heavily on fields such as statistics, microeconomics, epistemology, and Eastern dialectical reasoning, in addition to psychological research fields such as cognitive biases, behavioral economics, and positive psychology.
The resulting “greatest hits” of reasoning tools is an eclectic but extremely practical collection, covering concepts as varied as the sunk cost fallacy, confirmation bias, the law of large numbers, the endowment effect, and multiple regression analysis, among many others. For anyone who’s not yet familiar with most of these terms, however, Mindware may not be the gentlest way to be introduced to them, and first tackling a few books by Malcolm Gladwell, the Heath brothers, or Jonah Lehrer (despite the unfortunate plagiarism infractions) may serve as a more accessible introduction. Readers of Daniel Kahneman, Daniel Ariely, or Gerd Gigerenzer will find themselves in familiar territory fairly often, but will still almost certainly come away with valuable new “tools for smart thinking,” as I did.
Being aware of the nature and prevalence of reasoning mistakes doesn’t guarantee that we won’t make them ourselves, however, and Nisbett admits that he catches himself making them with disquieting regularity. He cites research that suggests, however, that knowledge of thinking errors does reduce the risk of committing them. Possibly more importantly, it seems clear that knowledge of these errors makes it considerably more likely that we’ll spot them when they’re committed by others, and that we’ll be better equipped to discuss and address them when we see them. Because those others are so often high-profile journalists, politicians, domain experts, and captains of industry, this knowledge has the potential to make a big difference in the world, and Mindware should be on as many personal and academic reading lists as possible.
April 4th, 2016
We review published research studies for several reasons. One is to become familiar with the authors’ findings. Another is to provide useful feedback to the authors. I review infovis research papers for several other reasons as well. My primary reason is to learn, and this goal is always satisfied—I always learn something—but the insights are often unintended by the authors. By reviewing research papers, I sharpen my ability to think critically. I’d like to illustrate the richness of this experience by sharing the observations that I made when I recently reviewed a study by Drew Skau, Lane Harrison, and Robert Kosara titled “An Evaluation of the Impact of Visual Embellishments in Bar Charts,” published in the Eurographics Conference on Visualization (EuroVis). My primary purpose here is not to reveal flaws in this study, but to show how a close review can lead to new ways of thinking and to thinking about new things.
This research study sought to compare the effectiveness of bar graphs that have been visually embellished in various ways to those of normal design to see if the embellishments led to perceptual difficulties, resulting in errors. The following figure from the paper illustrates a graph of normal design (baseline) and six types of embellishments (rounded tops, triangular bars, capped bars, overlapping triangular bars, quadratically increasing bars, and bars that extend below the baseline).
The study consisted of two experiments. The first involved “absolute judgments” (i.e., decoding the value of a single bar) and the second involved “relative judgments” (i.e., determining the percentage of one bar’s height relative to another). Here’s an example question that test subjects were asked in the “absolute judgments” experiment: “In the chart below, what is the value of C?”
As you can see, the Y axis and scale only include two values: 0 at the baseline and 100 at the top. More about this later. Here’s an example question in the “relative judgments” experiment: “In the chart below, what percentage is B of A?”
As you can see, when relative judgments were tested, the charts did not include a Y axis with a quantitative scale.
Let’s consider one of the first concerns that I encountered when reviewing this study. Is the perceptual task that subjects performed in the “absolute judgment” experiment actually different from the one they performed in the “relative judgment” experiment? By absolute judgment, the authors meant that subjects would use the quantitative scale along the Y axis to decode the specified bar’s value. Ordinarily, we read values in a bar graph by associating its height to the nearest value along the quantitative scale and then adjusting it slightly up or down depending on whether it is above or below that value. In this experiment, however, only the value of 100 on the scale is useful for interpreting a bar’s value. Given the fact that the top of the Y axis marked a value of 100, its height represented a value of 100% to which the bar could be compared. In other words, the task involved a relative comparison of a bar’s height to the Y axis’ height of 100%, which is perceptually the same as comparing the height of one bar to another. Although perceptually equal, tasks in the “absolute judgment” experiment were slightly easier cognitively because the height of the Y axis was labeled 100, as in 100%, which provided some assistance that was missing when subjects were asked to compare the relative heights of two bars, neither of which had values associated with them.
Why did the authors design two experiments of perception that they described as different when both involved the same perceptual task? They didn’t notice that they were in fact the same. I suspect that this happened because they designed their graphs in a manner that emulated the design that was used by Cleveland and McGill in their landmark study titled “Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods.” In that original study, the graphs all had a Y axis with a scale that included only the values 0 and 100, but test subjects were only asked to make relative judgments, similar to those that were performed in the “relative judgment” experiment in the new study. The authors of the new study went wrong when they added an experiment to test “absolute judgments” without giving the graphs a normal quantitative scale that consisted of several values between 0 and 100.
Despite the equivalence of the perceptual tasks that subjects performed in both experiments, the authors went on to report significant differences between the results of these experiments. Knowing that the perceptual tasks were essentially the same, this led me to speculate about the causes of these differences. This speculation led me to a realization that I’d never previously considered. It occurred to me that in the “relative judgment” experiment, subjects might have been asked at times to determine “What percentage is A of B?” when A was larger than B. Think about it. Relative comparisons between two values (i.e., what is the percentage of bar A compared to bar B) are more difficult when A is larger than B. For example, it is relatively easy to assess a relative proportion when bar A is four-fifths the height of bar B (i.e., 80%), but more difficult to assess a relative proportion when bar A is five-fours the height of bar B (i.e., 125%). The former operation can be performed as a single perceptual task, but the latter requires a multi-step process. Comparing A to B when A is 25% greater in value than B requires one to perceptually isolate the portion of bar A that extends above the height of bar B, compare that portion only to the height of bar B, and then add the result of 25% to 100% to get the full relative value of 125%. This is cognitively more complex, involving a mathematical operation, and more perceptually difficult because the portion of bar A that exceeds the height of bar B is not aligned with the base of bar B.
Observation #1: Relative comparisons of one bar to another are more difficult when you must express the proportional difference of the greater bar.
Equipped with this possible explanation for the differences in the two experiments’ results, I emailed the authors to request a copy of their data so I could confirm my hypothesis. This led to the next useful insight. Although receptive to my request, only one author had access to the data and it was not readily accessible. The one author with the data was buried in activity. I finally received it after waiting for six weeks. I understand that people get busy and my request was certainly not this fellow’s priority. What surprised me, however, is that the data file wasn’t already prepared for easy distribution. A similar request to a different team of authors also resulted in bit of a delay, but in that case only about half of the data that I requested was ever provided because the remainder was missing, even though the paper had only been recently published. These two experiences have reinforced my suspicion that data sets associated with published studies are not routinely prepared for distribution and might not even exist. This seems like a glaring hole in the process of publishing research. Data must be made available for review. Checking the data can reveal errors in the work and sometimes even intentional fabrication of the results. In fact, I’ll cause the infovis research community to gasp in dismay by arguing that peer reviews should routinely involve a review of the data. Peer reviewers are not paid for their time and many of them review several papers each year. As a result, many peer reviews are done at lightning speed with little attention, resulting in poor quality. To most reviewers, a requirement that they review the data would make participation in the process unattractive and impractical. Without this, however, the peer review process is incomplete.
Observation #2: Data sets associated with research studies are not routinely made available.
When I first got my hands on the data, I quickly checked to see if greater errors in relative judgments were related to comparing bars when the first bar was greater in value than the second, as I hypothesized. What I soon discovered, however, was something that the authors didn’t mention in their paper: in all cases the first bar was shorter than the second. For example, if the question was “What percentage is B of A?”, B (the first bar mentioned) was shorter than A (the second bar mentioned). So much for my hypothesis. What the hell, then, was causing greater errors in the “relative judgment” experiment?
Before diving into the data it occurred to me that I should first confirm that greater errors actually did exist in the “relative judgment” experiment compared to the “absolute judgment” experiment. They certainly seemed to when using the statistical mean as a measure of average error. However, when the mean is used in this way, we need to confirm that it’s based on a normal distribution of values, otherwise it’s not a useful measure of center. Looking at the distributions of errors, I discovered that there were many huge outliers. Correct answers could never exceed 100%, which was the case when bars were equal in height, but I found values as large as 54,654%. These many outliers wreaked havoc on the results when based on the mean, especially in the “relative judgment” experiment. When I switched from the mean to the median as a measure of central tendency the differences between the two experiments vanished. Discovering this was a useful reminder that researchers often misuse statistics.
Observation #3: Even experienced infovis researchers sometimes base their results on inappropriate statistics.
Having switched from the mean to the median, I spent some time exploring the data from this new perspective. In the process, I stumbled onto an observation that makes perfect sense, but which I’d never consciously considered. Our errors in assessing the relative heights of bars are related to the difference between the heights: the greater the difference, the greater the error. Furthermore, this relationship appears to be logarithmic.
In the two graphs below, the intervals along the X axis represent the proportions of one bar’s height to the other, expressed as a percentage. For example, if the first bar is half the height of the second to which it is compared, the proportion would be 50%. If the two bars were the same height, the proportion would be 100%. In the upper graph the scale along the Y axis represents the median percentage of error that test subjects committed when comparing bars with proportions that fell within each interval along the X axis. The lower graph is the same except that it displays the mean rather than the median percentage of error in proportional judgments.
As you can see, when the first bar is less than 10% of the second bar’s height, errors in judgment are greatest. As you progress from one interval to the next along the X axis, errors in judgment consistently decrease and do so logarithmically. I might not be the first person to notice this, but I’ve never run across it. This is a case where data generated in this study produced a finding that wasn’t intended and wasn’t noticed by the authors. Had I only examined the errors expressed as means rather than medians, I might have never made this observation.
Observation #4: Errors in proportional bar height comparisons appear to decrease logarithmically as the difference in their relative heights decreases.
At this point in my review, I was still left wondering why the vast majority of outliers occurred in the “relative judgment” experiment. Tracking this down took a bit more detective work, this time using a magnifying glass to look at the details. What I found were errors of various types that could have have been prevented by more careful experimental design. Test subjects were recruited using Mechanical Turk. Using Mechanical Turk as a pool for test subjects requires that you vet subjects with care. Unlike a direct interaction between test subjects and experimenters, anonymous subjects that participate in Mechanical Turk can more easily exhibit one of the following problems: 1) they can fail to take the experiment seriously, responding randomly or with little effort, and 2) they can fail to understand the directions with no way of determining this without a pre-test. Given the fact that the study was designed to test perception only, the ability of test subjects to correctly express relative proportions as percentages was required. Unfortunately, this ability was taken for granted. One common error that I found was a reversal of the percentage, such as expressing 10% (one tenth of the value) as 1000% (ten times the value). This problem could have been alleviated by providing subjects with the correct answers for a few examples in preparation for the experimental tasks. An even more common error resulted from the fact that graphs contained three bars and subjects were asked to compare a specific set of two bars in a specific order. Many subjects made the mistake of comparing the wrong bars, which can be easily detected by examining their responses in light of the bars they were shown.
[Note: After posting this blog, I investigated this observation further and discovered that it was flawed. See my comment below, posted on April 5, 2016 at 4:13pm, to find out what I discovered.]
Observation #5: When test subjects cannot be directly observed, greater care must be taken to eliminate extraneous differences in experiments if the results are meant to be compared.
I could have easily skipped to the end of this paper to read its conclusions. Having confirmed that the authors found increases in errors when bars have embellishments, I could have gone on my merry way, content that my assumptions were correct. Had I done this, I would have learned little. Reviewing the work of others, especially in the thorough manner that is needed to publish a critique, is fertile ground for insights and intellectual growth. Everyone in the infovis research community would be enriched by this activity, not to mention how much more useful peer reviews would be if they were done with this level of care.