Visual Business Intelligence

New Edition of “Now You See It”

Stephen Few — Mon, 22 Feb 2021 16:48:35 +0000

On April 15, 2021, my book Now You See It (2009) will become available in its second edition with the revised subtitle An Introduction to Visual Data Sensemaking.

Now You See It: An Introduction to Visual Data Sensemaking

This is more than a mere update. Essentially, this new edition combines the contents of the first edition with the contents of my book Signal: Understanding What Matters in a World of Noise. I wrote Signal in 2015 to complement Now You See It by covering more advanced data sensemaking techniques, including Statistical Process Control. And, in case you’re concerned that this new edition will be huge and heavy enough to serve as a doorstop, you’ll be pleased to hear that I’ve combined and refined the best of the two books into a single publication that is roughly the same size as the original version of Now You See It. This new edition will make the process of learning visual data sensemaking skills more fluid, efficient, and comprehensive.

Before you can present information to others, you must first make sense of it. Now You See It teaches the concepts, principles, and practices of visual data sensemaking. The skills taught in this book rely primarily on something that most of us possess—vision—interactively using graphs to find and examine the meaningful patterns and relationships that reside in quantitative data.

Although some questions about quantitative data can only be answered using sophisticated statistical techniques, most can be answered using relatively simple visual data sensemaking skills. No other book teaches these basic skills as comprehensively and in a way that is accessible to a broad audience. Even though these skills can be developed by anyone with eyes to see, they are not intuitive—they must be learned. Without these skills, even the best data visualization tools are of little use, and data will remain nothing but noise.

To Tell the Story Clearly, Omit Nothing Significant

Stephen Few — Fri, 07 Aug 2020 15:11:43 +0000

We’re in the midst of a worldwide COVID-19 pandemic. Our understanding of this novel pandemic and our efforts to combat it are determined in large part by the information that we consider. It’s critically important that information in news stories is presented clearly and accurately. Unfortunately, sources that we rely on for the news, including ordinarily reliable sources, sometimes present COVID-19 data in misleading ways. This is sometimes done by omitting relevant data. Even one of my favorite new sources, NPR, was recently guilty of this. The charts that were included in an NPR article titled “Charts: How the U.S. Ranks On COVID-19 Deaths Per Capita – And By Case Count” on August 5, 2010 by Jessica Craig illustrate this.

The article is a response to the recent Axios interview with Donald Trump by Jonathan Swan, which aired on August 3, 2020 on HBO. In particular, the article addresses the exchange that occurred during that interview about COVID-19 deaths in the United States. Trump suggested that we were doing better than any other country, which is not the case by any measure of COVID-19-related mortality. When Swan countered that we’re doing much worse than many countries, Trump handed him a chart that apparently referred to “case fatality ratio,” which is the ratio of deaths per infections. Swan then explained that he was referring to deaths in proportion to the population (i.e., deaths per capita), to which Trump responded, “You can’t do that.” You certainly can do that. There are several ways in which COVID-19 deaths per country can be compared, and per capita deaths is one of the most useful. Case fatality rate is useful as well. Both measures contribute to our understanding when they’re presented clearly.

My primary concern with the NPR article has to do with the two charts that appeared in it: one regarding deaths per capita and one regarding case fatality ratios, both of which compared the U.S. to other countries with “50,000 or more reported cases.” Let’s begin with the chart that shows per capita deaths:

It appears that only two countries—Brazil and France—are doing better than the U.S., but this is hardly the case. As of August 5th, 45 countries reported more than 50,000 COVID-19 cases to date. In other words, 37 out of 44 other countries with 50,000 or more cases are doing better than the U.S., and this doesn’t even count the much larger number of countries with fewer than 50,000 cases. The chart suffers from the “curse of the top 10.” There’s nothing magical, or in this case relevant, about the number 10. Arbitrarily limiting this chart to 10 countries presents a misleading message. At a minimum, this chart should show all 45 countries with more than 50,000 cases. We should be grateful, I suppose, that the designer of this chart did not limit it to 8 countries rather than 10, which would have made the U.S. appear best of all. Now that’s a chart Trump could love. In fact, I wouldn’t be surprised if his staff prepares that exact chart for his next interview.

As an aside, I’ll respond to one more point that was made in the article about deaths per capita:

The per capita death rate is primarily an indication of the overall disease burden in a country, according to Justin Lessler, an associate professor of epidemiology at Johns Hopkins University. (Disease burden is the term used to describe the impact of a particular disease in terms of years of life lost and years lived with disability.)

Both sentences above exhibit problems. The first suggests that the per capita death rate is primarily an indication of the “overall disease burden in a country.” That’s not entirely accurate. It indicates the proportional rather than the overall disease burden. The overall disease burden is better represented by the total case count. The second sentence, which provides a technical definition of “disease burden,” is entirely irrelevant. When defined as the “impact of a particular disease in terms of years of life and years lived with disability,” the disease burden cannot be determined by the per capita death rate.

Let’s move on to the second chart:

Wow, the U.S. is looking particular great in this chart, isn’t it? We’re at the bottom of the chart, which is ordinarily a good place to be when countries are ranked from worse to best, but once again, the story isn’t clearly told without showing all 45 countries. In truth, we’re in the middle of the pack, not the best.

The article did go on to provide useful information by identifying some of the other factors that influence COVID-19 mortality measures when comparing countries, such as age (with COVID-19, because it causes deaths more frequently among older folks, the median age of a country is a factor) and “access to ventilators and ICU care if needed.” Unfortunately, however, it also went on to give Trump some credit for his positive spin on mortality statistics, largely ignoring the fact that Trump’s fundamental claim was a bald-faced lie. By no COVID-19 mortality measure is the U.S. doing better than all other countries, which is what Trump has been saying all along and clearly suggested in the Axios interview.

I suspect that NPR was trying to differentiate itself from most other news outlets by designing charts and including statements that appeared more even handed in its assessment of Trump’s performance. Whatever the motivation, NPR, ordinarily a reliable news source, failed to tell this story clearly.

Data Visualization Is Data Visualization Is Data Visualization

Stephen Few — Sun, 12 Jul 2020 15:27:01 +0000

The principles and practices of data visualization do not vary from one domain to another. They are the same. Data visualization applied to business differs only from data visualization applied to education (or healthcare, or government, or various branches of science, or any other domain you can imagine) in that each domain has its own data that must be understood before it can be visualized effectively. How the data is visualized, however, does not vary from one domain to another. All domains pull from the same repository of visual representations and, to work effectively, follow the same design principles and practices. While it is certainly true that some data domains might routinely rely more heavily on particular charts than other domains, that difference does not constitute a separate branch of data visualization. If you’ve developed expertise in data visualization while working in finance and you suddenly take a job working in healthcare, you will need to learn about healthcare, but not anything new about data visualization that is unique to that domain.

Over the years, as a data visualization practitioner, author, consultant, and teacher, I’ve applied my skills to many domains. To do this effectively, I had to learn enough about those domains to make sense of the data, but what I then did to visualize the data didn’t vary from one domain to another. From time to time, people who worked in a specific domain asked if I would write a book or teach a course about data visualization for their domain in particular. Would it make sense for me to write a new version of my book Show Me the Numbers that is specific to the needs of education, healthcare, or marketing organizations? The lessons that the book teaches about chart selection and design for communicating quantitative data effectively are illustrated throughout with examples drawn from multiple domains; examples that can be easily understood by everyone. A separate version of the book for each domain isn’t needed. You could certainly argue that marketing professionals might prefer to only see data visualizations that are based on marketing data when learning the skills, but would that provide them with any real benefit compared to familiar examples from multiple domains? I don’t think it would. In fact, using examples from multiple domains reinforces the fact that data visualization applies to all domains equally and in the same manner—the skills are transferable—which is a useful reminder. During the early stages of the learning process, focusing on the concepts and skills of data visualization rather than on the data domain is appropriate, even if you only plan to apply the skills to a single domain.

The first edition of my book Show Me the Numbers almost exclusively featured business examples. I chose to do this initially because the business examples that I created (e.g., graphs that featured revenues or expenses) were easy for any reader to understand. As a consequence, however, every once in a while someone would describe Show Me the Numbers as “data visualization for business,” which drove me nuts, because it artificially and unnecessarily limited the book’s audience. For this reason, when I wrote the second edition, I was careful to mix in examples drawn from multiple domains.

As a data visualization professional, it is perfectly reasonable for you to focus on a particular data domain if you wish because increasing your expertise in that domain will make you a better visualizer of its data. Just bear in mind that your visualization skills in particular, as opposed to your data domain expertise, are entirely transferable. When I first started teaching public data visualization workshops many years ago, I quickly observed that classrooms filled with people from various domains, rather than workshops that I taught privately for individual organizations, offered a real advantage to my students. Sharing experiences, discussing the material, working together in exercises, and even commiserating about the challenges that they faced when visualizing data, was richer in diverse groups drawn from various domains.

Data visualization is data visualization is data visualization. If you learn the skills well, you can apply them broadly.

Comparing COVID-19 Mortality Rates Over Time By Country

Stephen Few — Mon, 13 Apr 2020 18:34:54 +0000

As COVID-19 spreads its deadly effects around the world, many data analysts are struggling to track these effects in useful ways. Some attempts work better than others, however. Comparing these effects among various countries is particularly challenging. Some attempts that I’ve seen are confusing and difficult to read, even for statisticians. Here’s an example that was brought to my attention recently by a statistician who found it less than ideal:

I believe that the objectives of displays like this can be achieved in simpler, more accessible ways.

Before proposing an approach that works better, let’s acknowledge that country comparisons of deaths from COVID-19 are fraught with data problems that will never be remedied by any form of display. Even here in the United States, many deaths due to COVID-19 are never recorded. If someone with COVID-19 suffers from pneumonia as a result and then dies, what gets recorded as the cause on the death certificate: COVID-19 or pneumonia? Clear procedures aren’t currently in place. Medical personnel are focused on saving lives more than recording data in a particular way, which is understandable. This problem is no doubt occurring in every country. The integrity of the data from country to country differs to a significant degree and does so for many reasons. It’s important to recognize whenever we display this data that country comparisons will never be entirely reliable. Nevertheless, working with the best data that’s available, we must do what we can to make sense of it.

If we want to compare the number of deaths due to COVID-19 per country, both in terms of magnitudes and patterns of change over time, the following design choices seem appropriate:

Assuming that we want to understand the proportional impact on countries, use a ratio such as the number of deaths per 1 million people rather than the raw number of deaths, to adjust for population differences.
Aggregate the data to weekly values to eliminate the noise of day-to-day variation.
Use rolling time (i.e., week 1 consists of days 1 through 7, week 2 consists of days 8 through 14, etc.) rather than calendar time, beginning with the date on which the first death occurred in each country.

The following line graph exhibits these design choices. To keep things simple for the purpose of illustrating this approach, I’ve included four countries only: the U.S., China, Italy, and Canada. Also, for the sake of convenience, I’ve relied on the most readily available data that I could find, which comes from www.ourworldindata.org.

Most people in the general public could make sense of this graph with only a little explanation. It’s important to recognize, however, that no single graph can represent the data in all the ways that are needed to make sense of the situation. Perhaps the biggest problem with this graph is the fact that the number of weekly deaths per 1 million people per country varies so much in magnitude, ranging from over 90 at the high end in Italy to less than 1 at its peak in China, the blue line representing China appears almost flat as it hugs the bottom of the graph, which makes its pattern of change unreadable. Assuming that the number of deaths in China is accurate (not a valid assumption for any country), this tells us that COVID-19 has had relatively little effect on China overall. The immensity of China in both population and geographical space is reflected in this low mortality rate. The picture would look much different if we considered Wuhan Province alone.

Obviously, if we want to compare the patterns of change among these countries more easily, regardless of magnitude, we must solve this scaling problem. Some data analysts attempt to do this by using a logarithmic scale, but this isn’t appropriate for the general public because few people understand logarithmic scales and their effects on data. Another approach is to complement the graph above with a series of separate graphs, one per country, that have been independently scaled to more clearly feature the patterns of change. Here’s the same graph above, complemented in this manner:

With this combination of graphs, there is now more that we can see. For instance, the pattern of change in China is now clearly represented. Notice how similar the patterns in China and Italy are. From weeks 1 through 7, which is all that’s reflected in Italy so far, the patterns are almost identical. Will their trajectories continue to match as time goes on? Time will tell. Notice also the subtle differences in the patterns of change in the U.S. versus Canada. In the beginning, mortality increased in Canada at a faster rate but started to decrease from the fourth to fifth week while the pattern in the U.S. does not yet exhibit a decrease as of the sixth week. Will mortality in the U.S. exhibit a decline by week 7 similar to China and Italy? When another complete week’s worth of data is added to the U.S. graph, we’ll be able to tell.

Clearly, there are many valid and useful ways to display this data. I propose this simple set of graphs as one of them.

Display New Daily Cases of COVID-19 with Care

Stephen Few — Thu, 09 Apr 2020 17:06:28 +0000

Statistics are playing a major role during the COVID-19 pandemic. The ways that we collect, analyze, and report them, greatly influences the degree to which they inform a meaningful response. An article in the Investor’s Business Daily titled “Dow Jones Futures Jump As Virus Cases Slow; Why This Stock Market Rally Is More Dangerous Than The Coronavirus Market Crash” (April 6, 2020, by Ed Carson) brought this concern to mind when I read the following table of numbers and the accompanying commentary:

U.S. coronavirus cases jumped 25,316 on Sunday [April 5th] to 336,673, with new cases declining from Saturday’s record 34,196. It was the first drop since March 21.

The purpose of the Investor’s Business Daily article was to examine how the pandemic was affecting the stock market. After the decline in the number of reported new COVID-19 cases on Sunday, April 5th, on Monday, April 6, 2020, the stock market surged (Dow Jones gained 1,627.46 points, or 7.73%). This was perhaps a response to hope that the pandemic was easing. This brings a question to mind. Can we trust this apparent decline as a sign that the pandemic has turned the corner in the United States? I wish we could, but we dare not, for several reasons. The purpose of this blog post is not to critique the news article and certainly not to point out the inappropriateness of this data’s effects on the stock market, but merely to argue that we should not read too much into the daily ups and downs of newly reported COVID-19 case counts.

How accurate should we consider daily new case counts based on the date when those counts are recorded? Not at all accurate and of limited relevance. I’ll explain, but first let me show you the data displayed graphically. Because the article did not identify its data source, I chose to base the graph below on official CDC data, so the numbers are a little different. I also chose to begin the period with March 1st rather than 2nd, which seems more natural.

What feature most catches your eye? For most of us, I suspect, it is the steep increase in new cases on April 3rd, followed by a seemingly significant decline on April 4th and 5th.

A seemingly significant rise or fall in new cases on any single day, however, is not a clear sign that something significant has occurred. Most day-to-day volatility in reported new case counts is noise—it’s influenced by several factors other than actual new infections that developed. There is a great deal of difference between the actual number of new infections and the number of new infections that were reported as well as a significant difference between the date on which infections began and the date on which they were reported. We currently have no means to count the number of infections that occurred, and even if we tested everyone for the virus’s antibodies at some point, we would still have no way of knowing the date on which those infections began. Reported new COVID-19 cases is a proxy for the measure that concerns us.

Given the fact that reported new cases is probably the best proxy that’s currently available to us, we could remove much of the noise related to the specific date on which infections began by expressing new case counts as a moving average. A moving average would provide us with a better overview of the pandemic’s trajectory. Here’s the same data as above, this time expressed as a 5-day moving average. With a 5-day moving average the new case count for any particular day is averaged along with the four preceding days (i.e., five-days-worth of new case counts are averaged together), which smooths away most of the daily volatility.

While it still looks as if the new case count is beginning to increase at a lesser rate near the end of this period, this trend no longer appears as dramatic.

Daily volatility in reported new case counts is caused by many factors. We know that the number of new cases that are reported on any particular day do not accurately reflect the number of new infections. It’s likely that most people who have been infected have never been tested. Two prominent reasons for this are 1) the fact that most cases are mild to moderate and therefore never involve the medical intervention, and 2) the fact that many people who would like to be tested cannot because tests are still not readily available. Of those who are tested and found to have the virus, not all of those cases are recorded or, if recorded, are forwarded to an official national database. And finally, of those new cases that are recorded and do make it into an official national data base, the dates on which they are recorded are not the dates on which the infections actually occurred. Several factors determine the specific day on which cases are recorded, including the following:

When the patient chooses or is able to visit a medical facility.
The availability of medical staff to collect the sample. Staff might not be available on particular days.
The availability of lab staff to perform the test. The sample might sit in a queue for days.
The speed at which the test can be completed. Some tests can be completed in a single day and some take several days.
When medical staff has the time to record the case.
When medical staff gets around to forwarding the new case record to an official national database.

There’s a lot that must come together for a new case to be counted and to be counted on a particular day. As the pandemic continues, this challenge will likely increase because, as medical professionals become increasingly overtaxed, both delays in testing and errors in reporting the results will no doubt increase to a corresponding degree.

Now, back to my warning that we shouldn’t read too much into daily case counts as events are unfolding. Here’s the same daily values as before with one additional day, April 6th, included at the end.

Now what catches your eye. It’s different, isn’t it? As it turns out, by waiting one day we can see that reported new cases did not peek on April 3rd followed by a clear turnaround. New cases are still on the rise. Here’s the same data expressed as a 5-day moving average:

The trajectory is still heading upwards at the end of this period. We can all hope that expert projections that the curve will flatten out in the next few days will come to pass, but we should not draw that conclusion from the newly reported case count for any particular day. The statistical models that we’re using are just educated guesses based on approximate data. The true trajectory of this pandemic will only be known in retrospect, if ever, not in advance. Patience in interpreting the data will be rewarded with greater understanding, and ultimately, that will serve our needs better than hasty conclusions.

Ordinal Malpractice

Stephen Few — Thu, 19 Mar 2020 15:49:04 +0000

We love to put things in order. “Which college is best, second best, third best, etc., and how can I get my kid into one near the top of the list?” “I love God, Mom, America, and apple pie, in that order.” “Formal education consists of elementary school, middle school, high school, undergraduate school, and finally graduate school, if you’re lucky.” “Our best salesperson is John, second best is Mary, Sally is third, and poor Harold is at the bottom of the list.” We sometimes forget, however, that when we sequence things, even when that sequence is based on a quantitative measure (e.g., salespeople ranked by sales revenues), the order itself is merely ordinal, not quantitative. The company’s top salesperson, John, might be mediocre at best, and the second-best salesperson, Mary, might sell but a smidgen less than John or perhaps only half as much. A #2 ranking merely reveals that Mary sells less than John, not how much less.

An ordered list that appears along the axis of a graph, such as the ranked list of salespeople below, is called an ordinal scale.

An interval scale, like the one below, is quite different.

An interval scale subdivides a continuous range of quantitative values into equal intervals, in this case a range extending from 0 to 500, subdivided into intervals of 100 each. One of the most common interval scales that we use in quantitative data analysis involves ranges of time (e.g., from years 1950 through 2020) subdivided into equal periods (e.g., the 1950s, 1960s, etc.). Interval scales, by definition, are quantitative in nature; ordinal scales are not. In general, all we can say about ordinal scales is that the items have a meaningful order, nothing more. Along an interval scale, quantitative distances between adjacent items are always equal, but along an ordinal scale, distances between items typically vary.

A Likert scale is an example of an ordinal scale that is often used in social science research and surveys. Likert scales allow people to respond to questions, such as “How often do you drink more than a single serving of an alcoholic beverage in a day?”, by selecting from an ordered list such as the following:

Notice that the items have a meaningful order, but the scale itself is not quantitative. The difference in the frequency of occurrence between “Never” and “Seldom” is not necessarily the same as the difference between “Seldom” and “Occasionally.” Item #5 (“Always”) is not five times greater than item #1 (“Never”). Even though items in ordinal scales are often labeled with numbers (e.g., “1 – Never” and “5 – Always”), the numbers only indicate a sequence (item #1, item #2, etc.), not quantities.

Ordinal scales often provide meaningful and useful ways to arrange items in a list. I often arrange the values that appear in a graph in order from low to high or high to low because it is easier to compare values when those that are close to one another in magnitude are near one another in the graph. You can see this by comparing the two graphs below: one ordered by SAT scores from high to low and the other arranged alphabetically by student names.

As I’ve already mentioned, when values are ranked, the ranking itself is not and shouldn’t be treated as quantitative. Shouldn’t, but often is.

Here are three examples of salespeople ranked by sales revenues, displayed graphically. Notice how differently the salespeople vary in sales performance among these three graphs even though the rankings are the same.

For most purposes, it is the sales revenues themselves in these examples that are important. They deserve far more attention than the rankings.

Let’s get back to Likert scales for a moment. Assigning numeric values to items in a Likert scale is not appropriate, in my opinion, but it is routinely done. For example, social science research that uses a questionnaire to measure depression among a population, based on the following Likert scale, could simply add up the numbers associated with the items to produce an overall 5-point depression score.

Measuring people’s depression in this manner, however, does not qualify as truly quantitative.

If ordinal quantification is misleading, why is it done? Mostly, for convenience. It allows people to represent something that is difficult to measure with a simple number, but that number is inherently misleading. Sometimes this is also done for another reason: to lend Likert scales an inflated sense of accuracy, precision, significance, and objectivity when making research claims. A great deal of social science and survey-based performance reporting (e.g., customer satisfaction surveys) is based on quantified Likert scales. In my opinion, this renders any claims that are based on them suspect.

In science and data sensemaking (including data visualization), it is important to understand the difference between interval scales and ordinal scales: the former are quantitative; the latter are not. Both play a role, but their roles should not be conflated.

Data Visualization Is Not a Panacea

Stephen Few — Mon, 09 Mar 2020 17:59:01 +0000

It galls me when people oversell data visualization. Data visualization combines technologies (visual representations of quantitative data) with specific skills (techniques for creating and interacting with those visual representations) to make sense of and communicate quantitative data. It does not replace the other technologies and skillsets that are also needed to derive value from quantitative data; it complements them. It contributes to solutions; it is not “the solution.”

As an expert in data visualization, I’ve never oversold it. Data visualization is extremely useful and at times essential, but it is only one of many technologies and skillsets that are needed to understand and communicate quantitative data. It enables us to see patterns and relationships in quantitative data that would be difficult or impossible to see in any other way, but it does not stand alone. To understand and present quantitative data effectively, we also need domain knowledge, statistical thinking, critical thinking, scientific thinking, systems thinking, and computer skills.

Marketing has infected us with a propensity for hype. It is no longer enough to truthfully describe what something is and does; we must exaggerate it to the point of absurdity in an effort to sell it. When we create expectations that can never be satisfied, however, we pave the road to confusion, frustration, and failure. When data visualization fails to deliver on hyperbolic promises, in what direction do the fingers of blame point? They point to data visualization rather than to the fools who misrepresent it.

If you value data visualization, you’ll do it no favors by exaggerating its role and worth. An honest assessment is all that’s needed. Data visualization is incredibly useful, but it’s not the “second coming.” Rather than tooting data visualization’s horn with the bombastic grandeur of a Wagnerian opera, demonstrate its worth by doing it well. Combine it with complementary skills and technologies to solve real problems. That’s enough. That will always be enough.

A Thinker’s Guide to Artificial Intelligence

Stephen Few — Thu, 05 Mar 2020 21:56:58 +0000

I just finished reading the book about Artificial Intelligence (AI) that I’ve been craving for years: Artificial Intelligence: A Guide for Thinking Humans, by Melanie Mitchell. More than any other book on this hot but largely misunderstood topic, this book describes AI in clear and accessible terms. It cuts through the hype to present a sane assessment with no agenda apart from a desire to inform. Reading this book, you’ll likely discover that AI is quite different from what you imagined.

Melanie Mitchell qualifies as a second-generation pioneer in the field of AI. Beginning in the mid-1980s she earned her Ph.D. in the field under the supervision of Douglas Hofstadter, the previous generation pioneer whose book Gödel, Escher, Bach: an Eternal Golden Braid inspired many to pursue AI. She continues to do research and development in AI as a professor at Portland State University and at the Santa Fe Institute. I’d wager that few people, if any, understand AI in general better than she does. In this book she explains what AI is, covers its history from inception through today, describes the approaches that have been pursued (symbolic AI, neural networks, machine learning, etc., including explanations for how these approaches work), and presents the strengths and limitations of AI in unvarnished terms. She does all of this with a practical eloquence that is rare among technology writers.

Should we be concerned about AI? You bet, but probably not for the reasons that you imagine. AI has never exhibited anything that qualifies as general intelligence (i.e., thinking as humans do), despite years of diligent effort. Will it ever? Nobody knows. In the meantime, however, we do know that computers can perform “narrow AI” tasks that are quite helpful. We should make sure that AI is only applied in ways that are truly useful and understood. If we can’t understand how AI’s results are produced, we can’t trust those results. We must also make sure that AI applications are designed in ways that are both effective and ethical. Current applications exhibit worrisome flaws. As AI researcher Pedro Domingos has said: “People worry that computers will get too smart and take over the world, but the real problem is that they’re too stupid and they’ve already taken over the world.” I agree. We can and should produce better, more useful AI technologies. Knowing that people like Melanie Mitchell are involved in the effort gives me hope—a glimmer, at least—that we just might head in that direction.

Proportionally Speaking

Stephen Few — Sun, 01 Mar 2020 21:40:29 +0000

As data sensemakers, we spend a great deal of time examining quantitative relationships. Along with distribution, correlation, and time-series relationships, proportion is the other quantitative relationship that plays a significant role in data sensemaking. A proportion is just a relationship between two quantities. If we compare the number of friends that Sally and John each have, Sally’s 20 friends compared to John’s 10 friends is a proportion. It’s really that simple, but confusion often occurs when we communicate proportions.

Much of the confusion probably stems from the fact that proportions can be expressed in several ways: as ratios, fractions, rates, and percentages. The proportion of Sally’s 20 friends compared to John’s 10 can be written as a ratio in either of the following ways: “20 to 10” or “20:10”. This same proportion can also be expressed as “2 to 1” or “2:1”, for these ratios represent the same proportion in which the first value is double the second. This proportion can also be expressed as the fraction “20/10”. The symbol for division (i.e., /) that appears in the fraction indicates that a proportion can also be expressed as the result of division, which is called the rate. In this case, the rate of Sally’s friends to John’s is 2, because 20 divided by 10 equals 2. A rate of 1, expressed as a percentage, is 100%, so the proportion of Sally’s friends to John’s, expressed as a percentage, is 200% (i.e., the rate of 2 multipled by 100%).

All of these expressions of the proportion reveal that Sally has twice as many friends as John. Expressed as a percentage, we could also say that Sally has 100% more friends than John, for Sally’s 200% minus John’s 100% results in a 100% difference. We should express it this way cautiously, however, for people often find “less than” or “greater than” expressions of proportions confusing. When we express a greater than or less than proportion, we must remember to express only the difference between the two values.

Each expression of the proportion above treats Sally’s friends as the point of reference. To get the rate of 2, we began with Sally’s number of friends and divided that by John’s number of friends (i.e., 20 / 10 = 2). The order matters. If we instead treated John’s number of friends as the point of reference, we could express the proportion of John’s 10 friends to Sally’s 20 in each of the following ways: the ratio 10 to 20, 10:20, 1 to 2, or 1:2; the fraction 10/20 or 1/2; the rate of 0.5; the percentage 50%. We could also say that John has 50% fewer friends than Sally (i.e., John’s 50% minus Sally’s 100% equals -50%).

If John lost 8 of his friends, leaving only 2, while Sally maintained all 20 of her friends, we could say that John has 0.1 or 10% the proportion of friends that Sally has. This is fairly straightforward and clear to anyone who understands the basic concepts of rates and percentages. On the other hand, would could also say that John has 90% fewer friends than Sally (10% minus 100% equals -90%), but, as I warned previously, this isn’t nearly as straightforward and clear for many people.

In the Oxford English Dictionary (OED), the first two definitions of “proportion” are:

1. A portion, a part, a share, especially in relation to a whole; a relative amount or number.
2. A comparative relation or ratio between things in size, quantity, number, etc.

Both of these definitions fit what we’re discussing here. The sixth definition that appears in the OED, however, which is particular to mathematics, can lead to confusion.

6. MATH. A relationship of equivalence between two pairs of quantities, such that the first bears the same relation to the second as the third does to the fourth.

In other words, when comparing the ratio 1:2 to the ratio 10:20, mathematicians would not just say that they are in proportion but that they actually are a proportion. I mention this only to point out that, if you’re talking to a mathematician about proportions, you might be using the term differently, so be careful. I’ve encountered this problem myself a few times.

A few months ago, I ran across an example of a proportion gone wrong. It appeared on the PBS Newshour broadcast on September 23, 2019 in a segment titled “Judges weigh Trump’s family planning finding rule.” Near the end of the broadcast the following text appeared on the screen:

This is an example of a proportion that has been expressed as the difference between two values (i.e., the average income of Title X patients minus the income that’s defined as the poverty level) rather than more clearly as the relationship between them, but that’s not the only problem here. An income that is 150% below the poverty line makes no sense. An income can’t be more than 100% below the poverty line, for that would produce a negative value and negative income doesn’t make sense in this context. The person who wrote this text must not understand proportions, and apparently the program’s hosts were confused as well, which is all too common. I suspect that they meant to say that 78% of Title X patients have incomes that fall below 150% of the poverty line. In 2019, the U.S. federal poverty level for a family of one was $12,140, so 150% of that is $18,210. It seems reasonable that 78% of people who take advantage of Title X—people who tend to have low incomes and thus are in need of Title X assistance—made less than $18,210 for a family of one, $24,690 for a family of two, $31,170 for a family of three, and so on.

When dealing with proportions, a rate of 1 and a percentage of 100% play an important role. They both express the same equal, one-to-one proportion. In other words, the two values that are being compared are the same. For this reason, we tend to think of proportions as being less than, equal to, or greater than 1 when expressed as a rate or less than, equal to, or greater than 100% when expressed as a percentage. Consistent with the importance of 100%, there is a special type of proportional relationship that is based on 100% of something: the part-to-whole relationship. The whole is 100% of some measure (e.g., total sales revenues) and the parts are lesser percentages into which the whole has been divided (e.g., sales revenues in separate geographical regions, consisting of North, South, East, and West), which add up to 100%. When examining parts of a whole, we spend much of our time comparing the parts to one another. As such, graphical displays of part-to-whole relationships are only effective if they make it easy to compare the parts. Unfortunately, the most popular part-to-whole display—the pie chart—does this job poorly. It is difficult to compare slices of a pie. If you don’t already know why this is so, I recommend that you read my article “Save the Pies for Dessert.” As it turns out, this problem with pie charts is well understood but routinely ignored.

The ways that changes in proportions are expressed are another common source of confusion. Let me illustrate. According to a recent survey, the obesity rate among U.S. adults is now 42%. If we’re told that the obesity rate has increased 40% in the last 20 years, what was the rate in the year 2000? Think about this for a moment. In the year 2000, was the obesity rate 2% (i.e., 2% + 40% = 42%) or was it 30% (i.e., 30% * 140% = 42%)? It depends on how you interpret the words “increased 40%.” People sometimes mistakenly interpret this as a percentage point increase rather than a percentage increase. In this particular case, common sense suggests that the obesity rate must have been 30% in the year 2000, for there’s no way that only 2% of U.S. adults were obese just 20 years ago. Without this context, however, people might be confused.

This increase may be expressed in any of the following ways: “From the year 2000 to the year 2020, the obesity rate among U.S. adults…”

“…increased from 30% to 42%.”
“…increased 12 percentage points to 42%.”
“…increased 40% to 42%.”

Which of the three expressions above would least likely result in confusion? I suspect that the first, “…increased from 30% to 42%”, is the clearest. We could, of course, state the change more thoroughly by saying “…increased 12 percentage points from 30% to 42%” or “increased 40% from 30% to 42%.” When communicating with the general public, extra care in expressing changes in proportions works best.

Communicating proportions isn’t terribly difficult if we’re aware of how people might misinterpret them and take care to express them clearly. If you’re proportionally challenged, I hope this helps.

Logarithms Unmuddled

Stephen Few — Fri, 21 Feb 2020 16:15:58 +0000

I often write about topics that I myself have struggled to understand. If I’ve struggled, I assume that many others have struggled as well. Over the years, I’ve found several mathematical concepts confusing, not because I’m mathematically disinclined or disinterested, but because my formal training in mathematics was rather limited and, in some cases, poorly taught. My formal training consisted solely of basic arithmetic in elementary school, basic algebra in middle school, basic geometry in high school, and an introductory statistics course in undergraduate school. When I was in school, I didn’t recognize the value of mathematics—at least not for my life. Later, once I became a data professional, a career that I stumbled into without much planning or preparation, I learned mathematical concepts on my own and on the run whenever the need arose. That wasn’t always easy, and it occasionally led to confusion. Like many mathematical topics, logarithms can be confusing, and they’re rarely explained in clear and accessible terms. How logarithms relate to logarithmic scales and logarithmic growth isn’t at all obvious. In this article, I’ll do my best to cut through the confusion.

Until recently, my understanding (and misunderstanding) of logarithms stemmed from limited encounters with the concept in my work. As a data professional who specialized in data visualization, my knowledge of logarithms consisted primarily of three facts:

Along logarithmic scales, each labeled value that typically appears along the scale is a consistent multiple of the previous value (e.g., multiples of 10 resulting in a scale such as 1, 10, 100, 1,000, 10,000, etc.).
Logarithmic scales make it easy to compare rates of change in line graphs because equal slopes represent equal rates of change.
Logarithmic growth exhibits a pattern that goes up by a constantly decreasing amount.

If you, like me, became involved in data sensemaking (a.k.a., data analysis, business intelligence, analytics, data science, so-called Big Data, etc.) with a meagre foundation in mathematics, your understanding of logarithms might be similar to mine—similarly limited and confused. For example, if you think that the sequence of values 1, 10, 100, 1,000, 10,000, and so on is a sequence of logarithms, you’re mistaken, and should definitely read on.

Before reading on, however, I invite you to take a few minutes to write a definition for each of the following concepts:

Logarithm
Logarithmic scale
Logarithmic growth

In addition to definitions, take some time to describe how these concepts relate to one another. For example, how does a logarithmic scale relate to logarithmic growth? Give it a shot now before reading any further.

…

Regardless of how much you struggle to define these concepts and their relationships to one another, it’s useful to prime your brain for the topic. Now that you have, let’s dive in.

Logarithms

The logarithm (a.k.a., log) of a number is the power that the log’s base must be raised to equal that number. I realize this definition might not seem clear but hang in here with me. I promise that greater clarity will emerge. Logarithms always have a base (i.e., a number on which it is based). The most common base is 10, expressed as log₁₀, but any number may serve as the base. To determine the log₁₀ value of the number 100, we must determine the power of 10 that equals 100. What this means will become clear in a moment through an example, but before getting to that, let’s review what raising the power of a number means in mathematics.

Raising a number to a power involves multiplying the number by itself a specific number of times. The power indicates how many instances of the number are multiplied. For example, 10 to the power of 3, written as 10³ (the 3 in this case is called the exponent), involves multiplying 10 * 10 * 10, which equals 1,000. Raising a number to the power of 1 involves only one instance of that number—there is nothing to multiply—so the number remains unchanged. For example, 10¹ remains 10. Raising a number to the power of 2 involves multiplying two instances of that number, so 10² is 10 * 10, which equals 100. In these examples so far, the only time multiplication wasn’t involved was with the power of 1. Multiplication is also not involved whenever the exponent is zero or negative. In those cases, raising the power of a number involves division. For example, with the power of 0, rather than multiplying instances of the number by itself, we divide the number by itself, so 10⁰ is equal to 1, for 10 / 10 = 1. Here’s a list of values that result from raising the number 10 to the powers of 0 through 6:

Now that we’ve reviewed what it means to raise a number to a particular power, we can get back to logarithms. Remember that the log of a number is the power that the log’s base must be raised to equal that number. So, to find the log2 value of the number 8, we must determine the power of 2 (the log’s base) that is equal to 8. In other words, we must determine how many times 2 must be multiplied by itself to equal 8. Since 2¹ = 2 and 2² = 4 (i.e., 2 * 2 = 4) and 2³ = 8 (i.e., 2 * 2 * 2 = 8), we know that log₂ of 8 is 3. Given this procedure, what is the log₁₀ value of 100? It is 2, for 10 must be raised to the power of 2 (i.e., 10 * 10) to equal 100. What’s the log₂ of 64? It is 6, for 2 must be raised to the power of 6 (i.e., 2 * 2 * 2 * 2 * 2 * 2) to equal 64.

So far, we’ve only dealt with logs that result in nice, round numbers, but that isn’t always the case. For example, what is the log₂ of 100? The log₂ of 64 is 6 and the log₂ of 128 is 7, so the log₂ of 100 is somewhere between 6 and 7. When expressed to eight decimal places, the log₂ of 100 is precisely 6.64385619. What is the log₁₀ of 5? It must be less than 1, because 5 is less than 10. The precise answer is 0.698970004.

Have you ever examined a list of the logarithms associated with an incremental sequence of numbers? Doing this is revealing. Here’s a list of the log₂ values for the numbers 1 through 32, with an additional column that shows the proportional relationship between log₂ values and the numbers on which they’re based:

Notice that, other than the log₂ value of the number 3 (i.e., 1.584963, which is 52.832% of 3), as we read down the list, each log is a decreasing percentage of the number on which it is based. Keep this fact in mind. It will come in handy as we examine logarithmic scales and logarithmic growth.

Logarithmic Scales

A logarithmic scale (a.k.a., log scale) is one in which equal distances along the scale correspond to equal logarithmic distances. Because of the nature of logarithms, each number that typically appears along the scale is a consistent multiple of the previous number. The example below includes a log₁₀ scale along the Y axis.

Along a log₁₀ scale, because the base is 10, each number is 10 times the previous number. The example above begins at 1, but it could begin at any number. For example, a log₁₀ scale could begin at 40 and continue with 400, 4,000, 40,000, and so on, each ten times the previous. A log₂ scale that begins with 1 would continue with 2, 4, 8, 16, and so on, each two times the previous. Unlike a linear scale in which the intervals from one number to the next are always equal in value, such as 0, 10, 20, 30, 40, etc., along a log scale the intervals (i.e., the quantitative distances between the labeled values) consistently increase in value, each time multiplied by the base.

The numbers 1, 10, 100, 1,000, 10,000, 100,000, and 1,000,000 in the graph above correspond to logarithms with a base of 10, but those numbers are not themselves logarithms. Instead, they are the numbers from which the logarithms were derived. Here’s the scale that appears along the Y axis of the graph above, this time with the actual log₁₀ values 0 through 6 labeled in addition to the numbers 1 through 1,000,000 from which those logarithms were derived.

We usually label the log scales with the numbers from which the logarithms were derived rather than the logarithms themselves because the former are typically more familiar and useful.

Another characteristic of a log scale that reinforces its nature bears mentioning, which I’ll illustrate below by featuring a single interval only along the Y axis of the graph shown previously.

Notice that the minor tick marks between 1,000 and 10,000 get closer and closer together from bottom to top. This is easier to see if the scale is enlarged and the minor tick marks are labeled, as I’ve done below.

Each interval from one tick mark to the next (1,000 to 2,000, 2,000 to 3,000, etc.) consistently covers a numeric range of 1,000, but the spaces between the marks get smaller and smaller because the differences in the logarithms corresponding to those numbers get smaller and smaller. To illustrate this, I included a column of the log10 values that correspond to each tick mark in the example above. The decreasing distances between the tick marks correspond precisely to the decreasing differences between the log values.

Logarithmic Growth

Because the numbers that typically appear as labels on a log scale are each a consistent multiple of the previous number, if you didn’t already understand logarithms, you might assume that logarithmic growth involves a series of values that are each a consistent multiple of the previous value. Here’s an example of how that might look as a series of values:

In this example, each daily value doubles the previous value. This, however, is not an example of logarithmic growth. It is instead an example of exponential growth (a.k.a., compound growth). With exponential growth, the amount of increase from one value to the next is always greater. Compound interest earned on money in a savings account is an example of exponential growth. As the balance grows, even though the rate of interest remains constant, the amount of growth in dollars consistently increases because of the growing balance. For example, 10% interest on $100 (i.e., $10) would increase the balance to $110 during the first period, and then during the next period, it would be based on $110 resulting in $11 of interest, a dollar more. Even though the interest rate remains constant, because the balance grows from one period to the next, the amount of increase grows as well.

Contrary to exponential growth, logarithmic growth (a.k.a., log growth) exhibits a constant decrease in the amount of growth from one value to the next. In other words, it always grows but it does so to a decreasing degree over time. A simple example is the distance that a bullet travels when you shoot it straight up into the air from the moment it leaves the gun until it reaches its apex, before beginning its descent. The height of the bullet starts off by increasing quickly but those increases constantly decrease in amount from one interval of time to the next due to the pull of gravity.

So, how does log growth relate to log scales? It’s not at all obvious, is it? Good luck finding an explanation on the web that’s understandable if you’re not fluent in mathematics. Here’s a graphical example of log growth, based on the log₂ values for the numbers 1 through 64:

I’ve annotated this graph with lines connected to points in time when the logarithm has increased by a whole unit (i.e., from 0 to 1, 1 to 2, etc.). Starting on day 1 the log value is zero and whole-unit increases are subsequently reached on days 2, 4, 8, 16, 32, and 64. Do you recognize this pattern of days along the X axis? It matches the numbers that would appear along a log₂ scale that begins with 1. In other words, the intervals between the days on which the logarithms increased by a whole unit consistently grew by a multiple of 2.

Have you noticed that the pattern formed by log growth is the inverse (i.e., flipped top to bottom and left to right version) of the pattern formed by exponential growth? To illustrate this, the graph below displays three different patterns of growth: logarithmic, linear, and exponential.

This inverted relationship between patterns of logarithmic and exponential growth visually confirms the inverted relationship that exists between logarithms and the exponential powers that are used to produce them.

Given the nature of logarithms, what do you think would happen to the shape of the blue exponential line above if I changed the scale along the Y axis from linear to logarithmic? If your answer is that the blue line would now take on the shape of logarithmic growth similar to the orange line above, you’re thinking in the right direction, but you went too far. The nature of logarithms to progressively decrease in the amount that they grow from one to the next would cancel out the nature of exponents to progressively increase in amount that they grow from one to the next, resulting in a linear pattern similar to the gray line in the graph above.

…

I hope you agree that these concepts actually make sense when they’re explained with clear words and examples. You still might not have much use for logarithms unless your work involves advanced mathematics, but now you’re less likely to embarrass yourself by saying something dumb about them, as I’ve done on occasion.