## Critique to Learn

We review published research studies for several reasons. One is to become familiar with the authors’ findings. Another is to provide useful feedback to the authors. I review infovis research papers for several other reasons as well. My primary reason is to learn, and this goal is always satisfied—I always learn something—but the insights are often unintended by the authors. By reviewing research papers, I sharpen my ability to think critically. I’d like to illustrate the richness of this experience by sharing the observations that I made when I recently reviewed a study by Drew Skau, Lane Harrison, and Robert Kosara titled “An Evaluation of the Impact of Visual Embellishments in Bar Charts,” published in the Eurographics Conference on Visualization (EuroVis). My primary purpose here is not to reveal flaws in this study, but to show how a close review can lead to new ways of thinking and to thinking about new things.

This research study sought to compare the effectiveness of bar graphs that have been visually embellished in various ways to those of normal design to see if the embellishments led to perceptual difficulties, resulting in errors. The following figure from the paper illustrates a graph of normal design (baseline) and six types of embellishments (rounded tops, triangular bars, capped bars, overlapping triangular bars, quadratically increasing bars, and bars that extend below the baseline).

The study consisted of two experiments. The first involved “absolute judgments” (i.e., decoding the value of a single bar) and the second involved “relative judgments” (i.e., determining the percentage of one bar’s height relative to another). Here’s an example question that test subjects were asked in the “absolute judgments” experiment: “In the chart below, what is the value of C?”

As you can see, the Y axis and scale only include two values: 0 at the baseline and 100 at the top. More about this later. Here’s an example question in the “relative judgments” experiment: “In the chart below, what percentage is B of A?”

As you can see, when relative judgments were tested, the charts did not include a Y axis with a quantitative scale.

Let’s consider one of the first concerns that I encountered when reviewing this study. Is the perceptual task that subjects performed in the “absolute judgment” experiment actually different from the one they performed in the “relative judgment” experiment? By absolute judgment, the authors meant that subjects would use the quantitative scale along the Y axis to decode the specified bar’s value. Ordinarily, we read values in a bar graph by associating its height to the nearest value along the quantitative scale and then adjusting it slightly up or down depending on whether it is above or below that value. In this experiment, however, only the value of 100 on the scale is useful for interpreting a bar’s value. Given the fact that the top of the Y axis marked a value of 100, its height represented a value of 100% to which the bar could be compared. In other words, the task involved a relative comparison of a bar’s height to the Y axis’ height of 100%, which is perceptually the same as comparing the height of one bar to another. Although perceptually equal, tasks in the “absolute judgment” experiment were slightly easier cognitively because the height of the Y axis was labeled 100, as in 100%, which provided some assistance that was missing when subjects were asked to compare the relative heights of two bars, neither of which had values associated with them.

Why did the authors design two experiments of perception that they described as different when both involved the same perceptual task? They didn’t notice that they were in fact the same. I suspect that this happened because they designed their graphs in a manner that emulated the design that was used by Cleveland and McGill in their landmark study titled “Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods.” In that original study, the graphs all had a Y axis with a scale that included only the values 0 and 100, but test subjects were only asked to make relative judgments, similar to those that were performed in the “relative judgment” experiment in the new study. The authors of the new study went wrong when they added an experiment to test “absolute judgments” without giving the graphs a normal quantitative scale that consisted of several values between 0 and 100.

Despite the equivalence of the perceptual tasks that subjects performed in both experiments, the authors went on to report significant differences between the results of these experiments. Knowing that the perceptual tasks were essentially the same, this led me to speculate about the causes of these differences. This speculation led me to a realization that I’d never previously considered. It occurred to me that in the “relative judgment” experiment, subjects might have been asked at times to determine “What percentage is A of B?” when A was larger than B. Think about it. Relative comparisons between two values (i.e., what is the percentage of bar A compared to bar B) are more difficult when A is larger than B. For example, it is relatively easy to assess a relative proportion when bar A is four-fifths the height of bar B (i.e., 80%), but more difficult to assess a relative proportion when bar A is five-fours the height of bar B (i.e., 125%). The former operation can be performed as a single perceptual task, but the latter requires a multi-step process. Comparing A to B when A is 25% greater in value than B requires one to perceptually isolate the portion of bar A that extends above the height of bar B, compare that portion only to the height of bar B, and then add the result of 25% to 100% to get the full relative value of 125%. This is cognitively more complex, involving a mathematical operation, and more perceptually difficult because the portion of bar A that exceeds the height of bar B is not aligned with the base of bar B.

Observation #1: Relative comparisons of one bar to another are more difficult when you must express the proportional difference of the greater bar.

Equipped with this possible explanation for the differences in the two experiments’ results, I emailed the authors to request a copy of their data so I could confirm my hypothesis. This led to the next useful insight. Although receptive to my request, only one author had access to the data and it was not readily accessible. The one author with the data was buried in activity. I finally received it after waiting for six weeks. I understand that people get busy and my request was certainly not this fellow’s priority. What surprised me, however, is that the data file wasn’t already prepared for easy distribution. A similar request to a different team of authors also resulted in bit of a delay, but in that case only about half of the data that I requested was ever provided because the remainder was missing, even though the paper had only been recently published. These two experiences have reinforced my suspicion that data sets associated with published studies are not routinely prepared for distribution and might not even exist. This seems like a glaring hole in the process of publishing research. Data must be made available for review. Checking the data can reveal errors in the work and sometimes even intentional fabrication of the results. In fact, I’ll cause the infovis research community to gasp in dismay by arguing that peer reviews should routinely involve a review of the data. Peer reviewers are not paid for their time and many of them review several papers each year. As a result, many peer reviews are done at lightning speed with little attention, resulting in poor quality. To most reviewers, a requirement that they review the data would make participation in the process unattractive and impractical. Without this, however, the peer review process is incomplete.

Observation #2: Data sets associated with research studies are not routinely made available.

When I first got my hands on the data, I quickly checked to see if greater errors in relative judgments were related to comparing bars when the first bar was greater in value than the second, as I hypothesized. What I soon discovered, however, was something that the authors didn’t mention in their paper: in all cases the first bar was shorter than the second. For example, if the question was “What percentage is B of A?”, B (the first bar mentioned) was shorter than A (the second bar mentioned). So much for my hypothesis. What the hell, then, was causing greater errors in the “relative judgment” experiment?

Before diving into the data it occurred to me that I should first confirm that greater errors actually did exist in the “relative judgment” experiment compared to the “absolute judgment” experiment. They certainly seemed to when using the statistical mean as a measure of average error. However, when the mean is used in this way, we need to confirm that it’s based on a normal distribution of values, otherwise it’s not a useful measure of center. Looking at the distributions of errors, I discovered that there were many huge outliers. Correct answers could never exceed 100%, which was the case when bars were equal in height, but I found values as large as 54,654%. These many outliers wreaked havoc on the results when based on the mean, especially in the “relative judgment” experiment. When I switched from the mean to the median as a measure of central tendency the differences between the two experiments vanished. Discovering this was a useful reminder that researchers often misuse statistics.

Observation #3: Even experienced infovis researchers sometimes base their results on inappropriate statistics.

Having switched from the mean to the median, I spent some time exploring the data from this new perspective. In the process, I stumbled onto an observation that makes perfect sense, but which I’d never consciously considered. Our errors in assessing the relative heights of bars are related to the difference between the heights: the greater the difference, the greater the error. Furthermore, this relationship appears to be logarithmic.

In the two graphs below, the intervals along the X axis represent the proportions of one bar’s height to the other, expressed as a percentage. For example, if the first bar is half the height of the second to which it is compared, the proportion would be 50%. If the two bars were the same height, the proportion would be 100%. In the upper graph the scale along the Y axis represents the median percentage of error that test subjects committed when comparing bars with proportions that fell within each interval along the X axis. The lower graph is the same except that it displays the mean rather than the median percentage of error in proportional judgments.

As you can see, when the first bar is less than 10% of the second bar’s height, errors in judgment are greatest. As you progress from one interval to the next along the X axis, errors in judgment consistently decrease and do so logarithmically. I might not be the first person to notice this, but I’ve never run across it. This is a case where data generated in this study produced a finding that wasn’t intended and wasn’t noticed by the authors. Had I only examined the errors expressed as means rather than medians, I might have never made this observation.

Observation #4: Errors in proportional bar height comparisons appear to decrease logarithmically as the difference in their relative heights decreases.

At this point in my review, I was still left wondering why the vast majority of outliers occurred in the “relative judgment” experiment. Tracking this down took a bit more detective work, this time using a magnifying glass to look at the details. What I found were errors of various types that could have have been prevented by more careful experimental design. Test subjects were recruited using Mechanical Turk. Using Mechanical Turk as a pool for test subjects requires that you vet subjects with care. Unlike a direct interaction between test subjects and experimenters, anonymous subjects that participate in Mechanical Turk can more easily exhibit one of the following problems: 1) they can fail to take the experiment seriously, responding randomly or with little effort, and 2) they can fail to understand the directions with no way of determining this without a pre-test. Given the fact that the study was designed to test perception only, the ability of test subjects to correctly express relative proportions as percentages was required. Unfortunately, this ability was taken for granted. One common error that I found was a reversal of the percentage, such as expressing 10% (one tenth of the value) as 1000% (ten times the value). This problem could have been alleviated by providing subjects with the correct answers for a few examples in preparation for the experimental tasks. An even more common error resulted from the fact that graphs contained three bars and subjects were asked to compare a specific set of two bars in a specific order. Many subjects made the mistake of comparing the wrong bars, which can be easily detected by examining their responses in light of the bars they were shown.

[Note: After posting this blog, I investigated this observation further and discovered that it was flawed. See my comment below, posted on April 5, 2016 at 4:13pm, to find out what I discovered.]

Observation #5: When test subjects cannot be directly observed, greater care must be taken to eliminate extraneous differences in experiments if the results are meant to be compared.

I could have easily skipped to the end of this paper to read its conclusions. Having confirmed that the authors found increases in errors when bars have embellishments, I could have gone on my merry way, content that my assumptions were correct. Had I done this, I would have learned little. Reviewing the work of others, especially in the thorough manner that is needed to publish a critique, is fertile ground for insights and intellectual growth. Everyone in the infovis research community would be enriched by this activity, not to mention how much more useful peer reviews would be if they were done with this level of care.

Take care,

### 8 Comments on “Critique to Learn”

By Jason Mack. April 4th, 2016 at 11:43 am

Steve- I really liked the way you presented your analysis of this paper in terms of what you learned by taking a critical approach. I think the perpetual seeking of endorsements (ex. likes / retweets), leads people to take an “everything is great” approach and never question anything. On the flip side, providing a constructive critique does take a lot of work and if delivered in an appropriate betterment of learning context, should always be welcomed.

By Stephen Few. April 4th, 2016 at 1:24 pm

Jason,

It’s nice to have someone recognize the work that goes into a thorough critique of a research study. I was thinking earlier today that I have probably become more familiar with some of these studies than many of their authors. It is fairly common for senior researchers to affix their names to studies but contribute little to the actual work. This is a result of the incentives that are built into academia to publish a great deal rather than to publish great work.

Even though “constructive critique…should always be welcomed,” as you said, it often isn’t welcomed by those who stand to benefit the most from it. The infovis research community came to mind a few days ago while I was reading Kathryn Schulz’s wonderful book Being Wrong. Here’s a particularly relevant excerpt:

“In 1972, the psychologist Irving Janis defined groupthink as, ‘a mode of thinking that people engage in when they are deeply involved in a cohesive in-group, when the members’ strivings for unanimity override their motivation to realistically appraise alternative courses of action.’ Groupthink most commonly affects homogenous, close-knit communities that are overly insulated from internal and external criticism, and that perceive themselves as different from or under attack by outsiders. Its symptoms include censorship of dissent, rejection or rationalization of criticisms, the conviction of moral superiority, and the demonization of those who hold opposing beliefs. It typically leads to the incomplete or inaccurate assessment of information, the failure to seriously consider other possible options, a tendency to make rash decisions, and the refusal to reevaluate or alter those decisions once they’ve been made.”

This describes the response of the infovis research community to criticism from practitioners precisely.

By Lane. April 4th, 2016 at 6:26 pm

Hi Steve,

2nd author here– thanks for the re-analysis and the new insight!
I want to add that it’s a growing trend in InfoVis to release data alongside papers.
Here’s one story:

In InfoVis 2014 I published “Ranking Visualizations of Correlation using Weber’s Law” [1], and released the data alongside the experiment materials and analysis scripts [2].
Matthew Kay and Jeff Heer did a solid re-analysis of that data, published a nice followup “Beyond Weber’s Law: A Second Look at Ranking Visualizations of Correlation” [3], and released their updated analysis scripts [4].

While meta-analyses are somewhat new ground for InfoVis, positive critiques (like yours and my own) will help ensure this trend continues.

(By the way, you might find the methodologies in these papers interesting, as they are designed to entirely avoid the response biases you identified in your analysis.)

By Stephen Few. April 5th, 2016 at 1:39 pm

Hi Lane,

I’m encouraged to hear that there is a growing trend in InfoVis to provide data with research papers. Jeff Heer made the same point to me earlier today in an email. A requirement that researchers should provide their data is a no-brainer, in my opinion, that would be difficult to oppose. As I mentioned in my critique, however, I think that it will be difficult to get the community to agree that an examination of the data should be made a requirement of peer review. How do you think the community will respond to this?

Thanks for the links to the other articles. I’ve probably read most of them, but I’ll defintely take a look.

By Stephen Few. April 5th, 2016 at 4:13 pm

One of the benefits of publishing a critique as I’ve done here is that it provides an opportunity to discover errors in your own observations. I received an email from Xan Gregg of SAS that asked several questions about my critique. One was whether the errors in proportional estimates that I displayed on the Y axis of my bar graph was based on the percentage difference between the correct answers and the estimates or on the actual difference. In other words, if the correct proportion of one bar to another is 10% and the estimate given was 15%, was the error expressed as 50% or 5%? I thought it was based on the actual difference, but upon double-checking I discovered that it was based on the percentage difference. Remember that one of my observations, which I thought might be novel, was that there was a logarithmic relationship between the proportion and the amount of error, with a decrease in error as the proportion increases. Once I realized that this was based on percentage differences rather than the actual difference between the correct and estimated values, that logarithmic relationship took on a whole new meaning. Of course errors related to small proportions would be greater than large proportions when expressed as a percentage difference because even small differences between small values can be huge when expressed as a percentage difference. For example, an estimate of 5% when the correct answer is 4% represents a percentage difference not of 1% but of 25%. Silly me. What I observed as an interesting pattern in proportional estimates was nothing more than a result of differences between small values always yielding much greater percentage differences than equal differences in large values.

After realizing my mistake, I examined the relationship between proportion and estimate error again, this time expressing the error as actual difference rather than percentage difference. The pattern that I found was much different; it was bell-shaped, with the lesser error when the proportions were smallest (less than 10%) and largest (90% or greater). Why? I suspect that this is because at these extremes in proportion you are assisted by heuristics: really short bars will usually represent a proportion that is less than 10% and when the bars that you’re comparing are close to the same size, they almost always have a value that is near 100%. Proportions in the middle range, however, can’t rely on a simple heuristic and must therefore be based entirely on our ability to perceive proportional differences in bar heights.

By Drew Skau. April 5th, 2016 at 9:06 pm

Steve, thank you for your attention to detail in your review. I certainly appreciate your willingness to check and double check work, as it leads to better science and a clearer understanding of the mechanisms involved with bar chart perception. I too would like to see a more rigorous review process, especially if it helps catch important insights that would otherwise be overlooked.

To improve access to the data from our study, I have uploaded the resulting data to the study’s GitHub repository: https://github.com/dwskau/bar-chart-embellishment

By Stephen Few. April 6th, 2016 at 9:54 am

Drew and Lane,

I appreciate your willingness to welcome critique and respond to it gracefully. Scientists cultivate this ability if they care more about the quality of their work than the defense of their egos. Long after we’re gone and forgotten, it is our work and its contributions that will live on if we do it well.

By Terry Hayden. April 6th, 2016 at 2:36 pm

I appreciate your critiques and explaining why you reached certain conclusions. It helps me when creating visualizations for my company to ensure that the information being presented is not skewed or misrepresented. I read your blog regularly to learn as much as I can about data visualization and have learned a tremendous amount from you. Thank you!