Thanks for taking the time to read my thoughts about Visual Business Intelligence. This blog provides me (and others on occasion) with a venue for ideas and opinions that are either too urgent to wait for a full-blown article or too limited in length, scope, or development to require the larger venue. For a selection of articles, white papers, and books, please visit my library.

 

One of Bill Gates’ favorite graphs redesigned

January 10th, 2014

This blog entry was written by Bryan Pierce of Perceptual Edge.

The following 3-D treemap was brought to our attention by a participant in our discussion forum.

This graph was selected by Bill Gates to be included in a recent edition of Wired Magazine that he guest edited. He explained why he included the graph as follows:

I love this graph because it shows that while the number of people dying from communicable diseases is still far too high, those numbers continue to come down. In fact, fewer kids are dying, more kids are going to school and more diseases are on their way to being eliminated. But there remains much to do to cut down the deaths in that yellow block even more dramatically. We have the solutions. But we need to keep up the support where they’re being deployed, and pressure to get them into places where they’re desperately needed.

This is an important message and a noble goal. But how well does the graph above tell this story? Not very well, actually.

A treemap is a space-filling graph that uses the size of rectangles to encode one quantitative variable and color intensity to encode a second. This treemap was created by Thomas Porostocky to display worldwide years of life lost by cause using data from the University of Washington’s Institute for Health Metrics and Evaluation database.

Let’s see what we can learn from this graph. First, we notice that the green section representing injuries is significantly smaller than the other two, but the relative sizes of the other two sections are difficult to judge. Next, we see that the rectangles in the yellow section are mostly light yellow. If we check the color scale at the bottom it shows us that most of the diseases in that section are decreasing at an annual rate of between -2% and -3%. We can also see the names on the larger rectangles that represent the causes responsible for more years of life lost (e.g., Malaria), and get a sense of their relative sizes based on their areas, but again, we can’t compare them with any accuracy. Treemaps were invented by Ben Shneiderman as a means to display part-to-whole relationships between huge numbers of values; data sets that are too large to display using graphs that can be more easily and accurately read, such as bar graphs. Only with a huge set of values would it make sense to rely on the areas of rectangles and the intensities of their colors to represent values, given the fact that our brains cannot interpret these attributes of visual perception easily and accurately.

The 3-D effect that’s been added to the treemap doesn’t provide us any information and makes the treemap harder to decode. One problem introduced by this effect involves the darkened colors that appear on the sides of the treemap to represent shadows, which are meaningless and misleading. 3-D graphs are rarely a good idea, but this 3-D is completely gratuitous.

If a treemap had been the best way to show this information, it would have been better to separate the three major sections using borders rather than different colors. Then a single diverging color scale could have been used for the whole treemap. For instance, negative values could have been varying shades of red, values near zero could have been gray, and positive values could have been varying shades of blue. This would have made it significantly easier to decode the values—especially the values near zero, representing little change—than the current design that uses three different sequential color scales.

There is another problem with the treemap, though it’s not apparent unless you look at the underlying data. The color scale in the treemap shows annual percentage changes ranging from -3% to +3%. However, some of the items in the treemap changed by larger amounts than this. For instance, between 2005 and 2010 the years of life lost per 100,000 people to malaria decreased by 23.80%, which is an annual percentage reduction of 4.76%. This is a great improvement, but this outlier is completely lost when viewing the treemap, which shows malaria as one of the many infectious diseases that decreased annually between -2% and -3%.

The information that appears in the treemap can be easily shown in two side-by-side bar graphs in a way that tells the story clearly and accurately and is just as visually engaging without resorting to gimmickry. In fact, by using a third variable to display information about the death rate for each cause, instead of solely showing the information in terms of years of life lost, the story can be enriched to give a clearer picture of the world. Here is our redesign:

The bar graph on the left shows the years of life lost per 100,000 people in 2010 for each cause, which is the information encoded by the areas of the rectangles in the original treemap. The bars have been ranked and color coded to make it easy to compare causes of death. The years of life lost to each cause as percentages of the whole are also shown in the column of text, just to the left of the bars.

The bar graph in the center shows the percentage change between the years of life lost per 100,000 people in 2005 and 2010 for each of the causes. Unlike the original graph, we’re showing the total percentage change between those years, rather than an annualized version.

The bar graph on the right displays information that’s not shown in the original treemap: the death rate per 100,000 people for each cause. The fact that this information can be viewed together with the years of life lost information is useful and we’ll examine it in more detail a little later.

You might notice that our bar graphs include fewer items than the original treemap. The original treemap contains a little over 100 rectangles, many of which are unlabeled. We had access to the original dataset, so we could have made bar graphs that included items for each individual disease, but we decided it would have undermined the core story to include dozens of tiny bars, so we decided to aggregate the data into useful categories. For instance we aggregated all different types of cancer into a single “Cancer” bar and all different types of heart disease into a single “Heart disease” bar. Also, for items that contributed less than 1% of total deaths, if they couldn’t already be aggregated into an obvious category like cancer, we moved them into an “Other” category. For instance, deaths from diphtheria are included in the “Other communicable diseases (including meningitis and hepatitis)” bar. In cases when access to these lower-level details is important, a table containing all individual causes of death could be included to provide this information.

Notice how much easier it is to interpret the values represented by the bars than it was to decode the rectangle sizes and color intensities in the treemap. The fact that fewer years of life are being lost to communicable, maternal, neonatal, and nutritional disorders, represented by the gray bars, is immediately obvious, because all of the gray bars are showing decreases (negative values) in the center graph. By placing the years of life lost rate and the death rate for each cause in close proximity to one another, it’s easy to find discrepancies between their patterns, which can be informative. For instance, most of the gray bars have relatively short death rate bars, in comparison to the bars that represent years of life lost. This is because many of the gray bars represent diseases or issues that tend to kill children, so each death results in many years of life lost. For instant, on average, each death from malaria robs someone of 67.2 years of estimated life. Conversely, the three largest brown bars, “Cancer,” “Heart disease,” and “Stroke” all represent things that tend to kill older people, so each death has a relatively lower impact on years of life lost. For instance, each death to heart disease, on average, is responsible for an estimated 17.5 years of life lost.

By using bar graphs, we’ve made it easier to interpret and compare the data, so that it’s easy to focus on the stories contained in the data, rather than struggling to decode an inappropriate and ineffectively designed display.

-Bryan

Driving and Walking: Time Enough for Thinking

January 2nd, 2014

I was first introduced to science fiction literature as a young man in undergraduate school. The first science fiction novel that I read was Stranger in a Strange Land by Robert Heinlein. This book was not typical of the genre in many respects—some called it the Bible of the 60’s counter-culture movement—but I loved it and eventually came to love science fiction in most of its sub-genres. Heinlein became one of my early favorites among science fiction authors. It is the memory of his novel Time Enough for Love that links my love of the genre to the topic of this blog article. The protagonist of the novel, Lazarus Long, who possesses genetics for longevity, was born on Earth in the 20th century and then lived for hundreds of years on many planets as a space traveler. The gift of a long life gave him time enough for love. The moral of the tale is simple: some things take time. Unlike Lazarus Long, none of us have been granted lives that will span many centuries, so what we learn to value in life must come, not through an extension of time, but through a better use of the time that we’re allotted.

Much of what fulfills me, both personally and professionally, exists in the realm of ideas. Thinking and what it produces are important to me, and as such, it deserves time. Some of my best thinking takes place while I’m driving long distances or taking long walks. Why these occasions in particular? Because they provide extended periods of time of activities that demand little attention, so my mind can concentrate on thinking. As I drive or walk, after a few minutes my mind has time enough for thinking. Insights in the form of connections, useful metaphors, changes in the ways ideas are organized, or simpler explanations of complex concepts rise to the surface of my consciousness long enough to be noticed and examined. I sometimes record these thoughts on my phone’s digital recorder app or I pull off to the side of the road and write them down.

There are many different ways to clear the stage of our busy lives for thoughts to emerge from the wings of consciousness and dance for us. Contemplation can be enabled in various ways, but the time and attention that it requires doesn’t occur naturally in our modern lives. In a time of perpetual connection to computer-mediated data, we must create an oasis of opportunity for thinking by occasionally disconnecting. This takes conscious effort. The futurist, Alex Soojung-Kim Pang, refers to this practice as contemplative computing in his book The Distraction Addiction.

Like me, Pang is a technologist who realizes its limitations and its potential for both good and ill. He promotes relationships with technologies that are thoughtfully constrained and suggests specific techniques for carving out time and space in the midst of a noisy world for contemplation. Here’s a sample of his advice:

The ability to pay attention, to control the content of your consciousness, is critical to a good life. This explains why perpetual distraction is such a big problem. When you’re constantly interrupted by external things—the phone, texts, people with “just one quick question,” clients, children—by self-generated interruptions, or by your own efforts to multitask and juggle several tasks at once, the chronic distractions erode your sense of having control in your life. They don’t just derail your train of though. They make you lose yourself.

He not only teaches ways to reduce distractions caused by technologies, but also how to use technologies in ways that enable greater focus on the tasks that they’re meant to support. Much of what he suggests falls into the category of uncommon common sense: ideas that seem obvious and sensible once they’re pointed out, but are seldom noticed until they are. If you’re looking for help in making time enough for thinking—something our world desperately needs and increasingly lacks—I suggest that you read The Distraction Addiction by Alex Soojung-Kim Pang.

Take care,

The Scourge of Unnecessary Complexity

December 16th, 2013

One of the mottos of my work is “eloquence through simplicity:” eloquence of communication through simplicity of design. Simple should not be confused with simplistic (overly simplified). Simplicity’s goal is to find the simplest way to represent something, stripping away all that isn’t essential and expressing what’s left in the clearest possible way. It is the happy medium between too much and too little.

While I professionally strive for simplicity in data visualization, I care about it in all aspects of life. Our world is overly complicated by unnecessary and poorly expressed information and choices, and the problem is getting worse in our so-called age of Big Data. Throughout history great thinkers have campaigned for simplicity. Steve Jobs was fond of quoting Leonardo da Vinci: “Simplicity is the ultimate sophistication.” Never has the need for such a campaign been greater than today.

A new book, Simple: Conquering the Crisis of Complexity, by Alan Siegal and Irene Etzkorn, lives up to its title by providing a simple overview of the need for simplicity, examples of simplifications that have already enriched our lives (e.g., the 1040EZ single-page tax form that the authors worked with the IRS to design), and suggestions for what we can all do to simplify the world. This is a wonderful book, filled with information that’s desperately needed.

In 1980, a typical credit card contract was about a page and a half long. Today, it is 31 pages. Have you ever actually read one? Do you know what you’ve agreed to? Or more directly related to the focus of this blog, you probably waste a great deal of time each day wading through endless choices in software to find the few things that you actually need. For example, how many different types of graphs do you need to build an effective dashboard? Rather than the libraries of a hundred graphs or more that you’ll find in most products, I advocate a library of eight that satisfies almost every case. Software vendors fill their products with more and more features, whether they’re useful or not. Why? Because those long lists of features appeal to buyers. The designer John Maeda, however, observes what I’ve also found to be true: “At the point of desire, you want more, but at the point of daily use, you want less.”

That incomprehensible 31-page credit card contract can be distilled down to a page of clear information. The authors of Simple have done it. We could be spending more of our lives actually living and less time navigating endless confusion.

Not everyone wants simplicity, however. Some organizations and people thrive on confusion and use it to take advantage of others. We shouldn’t put up with this, but we do because we don’t think it’s possible to fix. It’s not impossible, but it won’t be easy.

We can’t all become crusaders for simplicity, taking on the challenges of improving government, the legal system, banking, and healthcare, but we can all do our part to simplify our own lives and organizations. To find out why this matters and how it can be done, read Simple. “Any fool can make things bigger, more complex, and more violent. It takes a touch of genius—and a lot of courage—to move in the opposite direction.” (E. F. Shumacher)

Take care,

To Err is Academic

November 20th, 2013

Errors in scientific research are all too common, and the problem has been getting worse. We’ve been led to believe that the methods of science are self-correcting, which is true, but only if they’re understood and followed, which is seldom the case. Ignorance of robust scientific methodology varies among disciplines, but it’s hard to imagine that any discipline can do worse than the errors that I’ve encountered in the field of information visualization.

An alarming article, “Trouble at the Lab,” in the October 19, 2013 edition of The Economist provides keen insight into the breadth, depth, and causes of this problem in academic research as a whole.

Academic scientists readily acknowledge that they often get things wrong. But they also hold fast to the idea that these errors get corrected over time as other scientists try to take the work further. Evidence that many more dodgy results are published than are subsequently corrected or withdrawn calls that much-vaunted capacity for self-correction into question. There are errors in a lot more of the scientific papers being published, written about and acted on than anyone would normally suppose, or like to think.

Various factors contribute to the problem. Statistical mistakes are widespread. The peer reviewers who evaluate papers before journals commit to publishing them are much worse at spotting mistakes than they or others appreciate. Professional pressure, competition and ambition push scientists to publish more quickly than would be wise. A career structure which lays great stress on publishing copious papers exacerbates all these problems. “There is no cost to getting things wrong,” says Brian Nosek, a psychologist at the University of Virginia who has taken an interest in his discipline’s persistent errors. “The cost is not getting them published.”

Graduate students are strongly encouraged by professors to get published, in part because the professor’s name will appear on the published study, even if they’ve contributed little, and professors don’t remain employed without long and continually growing lists of publications. In the field of information visualization, most of the students who do these studies have never been trained in research methodology, and it appears that most of their professors have skipped this training as well. It might surprise you to hear that most of these students and many of the professors also lack training in the fundamental principles and practices of information visualization, which leads to naïve mistakes. This is because most information visualization programs reside in computer science departments, and most of what’s done in computer science regarding information visualization, however useful, does not qualify as scientific research and does not involve scientific methods. There are exceptions, of course, but overall the current state of information visualization research is dismal.

The peer review system is not working. Most reviewers aren’t qualified to spot the flaws that typically plague information visualization research papers. Those who are qualified are often unwilling to expose errors because they want to be liked, and definitely don’t want to set themselves up as a target for a tit-for-tat response against their own work. On several occasions when I’ve written negative reviews of published papers, friends of mine in the academic community have written to thank me privately, but have never been willing to air their concerns publicly—not once. Without a culture of constructive critique, bad research will continue to dominate our field.

Papers with fundamental flaws often live on. Some may develop a bad reputation among those in the know, who will warn colleagues. But to outsiders they will appear part of the scientific canon.

Some of the worst information visualization papers published in the last few years have become some of the most cited. If you say something (or cite something) often enough, it becomes truth. We’ve all heard how people only use 10% of their brains. This is common knowledge, but it is pure drivel. Once the media latched onto this absurd notion, the voices of concerned neuroscientists couldn’t cut through the confusion.

How do we fix this? Here are a few suggestions:

  1. Researchers must be trained in scientific research methods. This goes for their professors as well. Central to scientific method is a diligent attempt to disprove one’s hypotheses. Skepticism of this type is rarely practiced in information visualization research.
  2. Researchers must be trained in statistics. Learning to get their software to spit out a p-value is not enough. Learning what a p-value means and when it should be used is more important than learning to produce one.
  3. Rigid standards must be established and enforced for publication. The respected scientific journal Nature has recently established an 18-point guideline for authors. Most of the guidelines that exist for information visualization papers are meager and in many cases counter-productive. For example, giving high scores for innovation encourages researchers to prioritize novelty over usefulness and effectiveness.
  4. Peer reviewers must be carefully vetted to confirm that they possess the required expertise.
  5. Rigid guidelines must be established for the peer review process.
  6. Peer review should not be done anonymously. I no longer review papers for most publications because they require reviewers to remain anonymous, which I refuse to do. No one whose judgment affects the work of others should be allowed to remain anonymous. Also, anyone who accepts poorly done research for publication should be held responsible for that flawed judgment.
  7. Researchers should be encouraged to publish their work even when it fails to establish what they expected. The only failure in research is research done poorly. Findings that conflict with expectations are still valuable findings. Even poorly done research is valuable if the authors admit their mistakes and learn from them.
  8. Researchers should be encouraged to replicate the studies of others. Even in the “hard sciences,” most published research cannot be successfully replicated. One of the primary self-correcting practices of science is replication. How many information visualization papers that attempt to replicate research done by others have you seen? I’ve seen none.

I’m sure that other suggestions belong on this list, but these of the ones that come to mind immediately. Many leaders in the information visualization community have for years discussed the question, “Is data visualization science?” My position is that it could be and it should be, but it won’t be until we begin to enforce scientific standards. It isn’t easy to whip a sloppy, fledgling discipline into shape and you won’t win a popularity contest by trying, but the potential of information visualization is too great to waste.

Take care,

A Template for Creating Cycle Plots in Excel

October 23rd, 2013

This blog entry was written by Bryan Pierce of Perceptual Edge.

A cycle plot is a type of line graph that is useful for displaying cyclical patterns across time. Cycle plots were first created in 1978 by William Cleveland and his colleagues at Bell Labs. We published an article about them in 2008, written by Naomi Robbins, titled Introduction to Cycle Plots. Here is an example of a cycle plot that displays monthly patterns across five years:

(Click to enlarge.)

In this cycle plot, the gray lines each represent the values for a particular month across the five-year period from 2009 through 2013. For instance, the first gray line from the left represents January. Looking at it we can see that little changed in January between 2009 and 2010, then values dipped in 2011 and then increased again in 2012 and 2013. The overlapping horizontal blue line represents the mean for the five years of January values.

The strength of the cycle plot is that it allows us to see a cyclical pattern (in this case the pattern formed by the means across the months of a year) and how the individual values from which that pattern was derived have changed during the entire period. For instance, by comparing the blue horizontal lines, we can see that June is the month with the second highest values on average, following December. We can also see that the values steadily trended upwards from January through June before dropping off in July. This much we could also see by looking at a line graph of the monthly means. However, using the cycle plot, we can also see how the values for individual months have changed across the years by looking at the gray lines. If you look at the gray line for June, you can see that we’ve had a steady decline from one June to the next across all five years, to the point that the values for May have surpassed the values for June in the last two years. Unless something changes, this steady decline could mean that June will no longer have the second highest average in the future. The decline in June is not something that we could easily spot if we were looking at this data in another way.

Despite their usefulness, one of the reasons I think we don’t see cycle plots more often is that they’re not supported directly by Excel. They can be made in Excel, but it’s a nuisance. To help with this problem, we’ve put together an Excel template for creating cycle plots using a method that we learned about from Ulrik Willemoes, who attended one of Stephen’s public workshops. It contains a cycle plot for displaying months and years, as shown above, and also a cycle plot for displaying days and weeks. All you need to do is plug in your own data and make some minor changes if you want to display a different number of years or weeks. Step-by-step instructions are included in the Excel file. Enjoy!

-Bryan