Thanks for taking the time to read my thoughts about Visual Business
Intelligence. This blog provides me (and others on occasion) with a venue for ideas and opinions
that are either too urgent to wait for a full-blown article or too
limited in length, scope, or development to require the larger venue.
For a selection of articles, white papers, and books, please visit
December 16th, 2014
This blog entry was written by Bryan Pierce of Perceptual Edge.
In 2015, Stephen Few will offer different combinations of five data visualization courses at public workshops around the world. He’ll teach his three introductory courses, Show Me the Numbers: Table and Graph Design (now as a two-day course, with additional content and several more small-group exercises and discussions), Information Dashboard Design, and Now You See It: Visual Data Analysis. He’s also introducing two new advanced courses for people who have already attended the prerequisite introductory courses or read the associated books and are looking to hone their skills further: Signal: Understanding What Matters in a World of Noise and Advanced Dashboard Design.
Stephen will teach the following public workshops in 2015:
- Berkeley, California on Jan 13 – 14: Signal: Understanding What Matters in a World of Noise
- Berkeley, California on Jan 27 – 29: Advanced Dashboard Design (Almost full!)
- Copenhagen, Denmark on Feb 24 – 26: Show Me the Numbers: Table and Graph Design and Now You See It: Visual Data Analysis
- London, U.K. on Mar 2 – 3: Show Me the Numbers: Table and Graph Design
- London, U.K. on Mar 4 – 6: Advanced Dashboard Design
- Sydney, Australia on Mar 23 – 24: Signal: Understanding What Matters in a World of Noise
- Sydney, Australia on Mar 25 – 27: Advanced Dashboard Design
- Stockholm, Sweden on Apr 21 – 23: Show Me the Numbers: Table and Graph Design and Information Dashboard Design
- Portsmouth, Virginia on Apr 28 – 30: Show Me the Numbers: Table and Graph Design and Now You See It: Visual Data Analysis
- Soest, Netherlands on May 6 – 8: Show Me the Numbers: Table and Graph Design and Information Dashboard Design
- Soest, Netherlands on May 11 – 12: Signal: Understanding What Matters in a World of Noise
- Minneapolis, Minnesota on Jun 2 – 4: Show Me the Numbers: Table and Graph Design and Information Dashboard Design
- Portland, Oregon on Sep 22 – 24: Show Me the Numbers: Table and Graph Design and Information Dashboard Design
- Dublin, Ireland on Oct 6 – 8: Show Me the Numbers: Table and Graph Design and Now You See It: Visual Data Analysis
- Wellington, New Zealand on Nov 4 – 6: Show Me the Numbers: Table and Graph Design and Information Dashboard Design (Registration not yet open)
- Sydney, Australia on Nov 11 – 13: Show Me the Numbers: Table and Graph Design and Now You See It: Visual Data Analysis (Registration not yet open)
These workshops are a great way to learn the data visualization principles that Stephen teaches in his books.
December 11th, 2014
If you’re familiar with my work, you know that I am an iconoclast within the business intelligence (BI) and analytics communities, refusing to join the drunkard’s party of hyperbolic praise for information technologies. The “clast” portion of the term “iconoclast” means “to break.” I often break away from the herd, and break the mold of convention, to say what I believe is true in the face of misinformation. My opinion that so-called Big Data is nothing more than marketing hype is a prime example of this. Despite my opinion of Big Data, I approached Christian Rudder’s book Dataclysm with great interest. The following excerpt from the book’s opening paragraph gave me hope that this was not just another example of marketing hype.
You have by now heard a lot about Big Data: the vast potential, the ominous consequences, the paradigm-destroying new paradigm it portends for mankind and his ever-loving websites. The mind reels, as if struck by a very dull object. So I don’t come here with more hype or reportage on the data phenomenon. I come with the thing itself: the data, phenomenon stripped away. I come with a large store of the actual information that’s being collected, which luck, work, wheedling, and more luck have put me in the unique position to possess and analyze.
What can this large store of actual information reveal?
Practically by accident, digital data can now show us how we fight, how we love, how we age, who we are, and how we’re changing. All we have to do is look: from just a very slight remove, the data reveals how people behave when they think no one is watching.
You can imagine my excitement upon reading these words. Actual insights gleaned from large data stores; a demonstration of data’s merits rather than confusing, hyperbolic claims. What is this “phenomenon,” however, that Rudder has “stripped away” from the data? To my great disappointment, I found that the context that’s required to gain real insights was often stripped away. Only a few pages into the book I already found myself stumbling over some of Rudder’s assumptions and practices. I enjoyed his casual, hip voice, and I greatly appreciated the excellent, no-nonsense design of his graphs (excluding the silly voronoi tessellation treemap), but some of his beliefs about data sensemaking gave me pause.
Rudder explains that he is trying to solve a real problem with many of behavioral science’s findings. Experimental research studies typically involve groups of test subjects that are not only too small for trustworthy results but also cannot be generalized because they consist almost entirely of homogeneous sets of college students.
I understand how it happens: in person, getting a real representative data set is often more difficult than the actual experiment you’d like to perform. You’re a professor or postdoc who wants to push forward, so you take what’s called a “convenience sample”—and that means the students at your university. But it’s a big problem, especially when you’re researching belief and behavior. It even has a name: It’s called WEIRD research: white, educated, industrialized, rich, and democratic. And most published social research papers are WEIRD.
Rudder’s concern is legitimate. His solution, however, is lacking.
Rudder is a co-founder of the online dating service OKCupid. As such, he has access to an enormous amount of data that is generated by the choices that customers make while seeking romantic connections. Add to this the additional data that he’s collected from other social media sites, such as Facebook and Twitter, and he has a huge data set. Even though the people who use these social media sites are more demographically diverse than WEIRD college students, they don’t represent society as a whole. Derek Ruths of McGill University and Jürgen Pfeffer of Carnegie Mellon University recently expressed this concern in an article titled “Social Medial for Large Studies of Behavior,” published in the November 28, 2014 issue of Science. Also, the conditions under which the data was collected exercise a great deal of influence, but Rudder has “stripped away” most of this context.
Contrary to his disclaimers about Big Data hype, Rudder expresses some hype of his own. Social media Big Data opens the door to a “poetry…of understanding. We are at the cusp of momentous change in the study of human communication.” He believes that the words people write on these sites provide the best source of information to date about the state and nature of human communication. I believe, however, that this data source reveals less than Rudder’s optimistic assessment. I suspect that it mostly reveals what people tend to say and how they tend to communicate on these particular social media sites, which support specific purposes and tend to be influenced by technological limitations—some imposed (e.g., Twitter’s 140 character limit) and others a by-product of the input device (e.g., the tiny keyboard of a smartphone). We can certainly study the effects that these technological limitations have on language, or the way in which anonymity invites offensive behavior, but are we really on the “cusp of momentous change in the study of human communication”? To derive useful insights from social media data, we’ll need to apply the rigor of science to our analyses just as we do with other data sources.
Behind every bit in my data, there are two people, the actor and the acted upon, and the fact that we can see each as equals in the process is new. If there is a “-clysm” part of the whole data thing, if this book’s title isn’t more than just a semi-clever pun or accident of the alphabet—then this is it. It allows us to see the full human experience at once, not just whatever side we happen to be paying attention to at a given time.
Having read the book, I found that the book’s title is only a “semi-clever pun.” Contrary to Rudder’s claim, his data does not “allow us to see the full human experience at once.” In fact, it provides a narrow window through which we can observe anonymous interactions between strangers in particular contexts that are designed for specific purposes (e.g., getting a date). Many of the findings that Rudder presents are fun and interesting, but we should take them with a grain of salt.
Fairly early in the book, Rudder presents his findings about women’s preferences in men and also about men’s preferences in women, but it isn’t clear what the data actually means because he’s stripped it of context. For example, when describing women’s age preferences—“the age of men she finds most attractive”—he fails to mention the specific data he based his findings on and the context in which it was collected. Were women being shown a series of photographs, two men at a time, and asked to select the one they found more attractive of the two; was the data based solely on the ages of the men that women contacted in hope of a date; or was the data drawn from some other context? Scientists must describe the designs of their studies, including the specific conditions under which they collected their data. Without understanding this context, we can’t understand the findings and certainly can’t rely on them.
Later in the book, Rudder presents findings about the number of messages people receive on OKCupid in correlation to their physical attractiveness. His graph of his findings displays a positive correlation between the two variables that remains steady through most of the series but suddenly increases to an exponential relationship beginning around the 90th percentile of attractiveness. He sliced and diced his findings in several ways, but never provided a fundamental and important piece of information: how was physical attractiveness measured? Important insights might reside in this data, but we can’t trust them without the missing context.
Rudder seems to be endorsing a typical tenet of Big Data hype that concerns me, which I’ll paraphrase as, “With Big Data we no longer need to adhere to the basic principles of science.” I applaud Rudder’s efforts to expose bad science in the form of small, demographically homogeneous groups of test subjects, but his alternative introduces its own set of problems, which are just as harmful. I suspect that Rudder endorses this particular alternative because it is convenient. He’s a co-founder of a social media site that collects tons of data. It’s in his interest to enthusiastically endorse this Big Data approach. I trust that Rudder’s conscious intentions are pure, but I believe that his perspective is biased by his role, experience, and interests.
Sourcing data from the wild rather than from controlled experiments in the lab has always been an important avenue of scientific study. These studies are observational rather than experimental. When we do this, we must carefully consider the many conditions that might affect the behavior that we’re observing. From these observations, we carefully form hypotheses, and then we test them, if possible, in controlled experiments. Large social media data sets don’t alleviate the need for this careful approach. I’m not saying that large stores of social media data are useless. Rather, I’m saying that if we’re going to call what we do with it data science, let’s make sure that we adhere to the principles and practices of science. How many of the people who call themselves “data scientists” on resumes today have actually been trained in science? I don’t know the answer, but I suspect that it’s relatively few, just as most of those who call themselves “data analysts” of some type or other have not been trained in data analysis. No matter how large the data source, scientific study requires rigor. This need is not diminished in the least by data volume. Social media data may be able to reveal aspects of human behavior that would be difficult to observe in any other way. We should take advantage of this. However, we mustn’t treat social media data as magical, nor analyze it with less rigor than other sources of data. It is just data. It is abundantly available, but it’s still just data.
In one example of insights drawn from social media data, Rudder lists words and phrases that are commonly used by particular groups of people but aren’t commonly used by other groups. He also listed the opposite: words that are least often used by particular groups but are commonly used by other groups. His primary example featured the words and comments of men among the following four ethnic groups: white, black, Latino, and Asian. Here’s the top ten words and phrases that Rudder listed for white men:
my blue eyes
hunting and fishing
I’m a white man and I must admit that this list does not seem to capture the essence of white men in general. I found the lists interesting, when considered in context, but far less revealing that Rudder claimed. Here’s an example of the “broad trends” that Rudder discerned from this approach to data analysis:
White people differentiate themselves mostly by their hair and eyes. Asians by there country of origin. Latinos by their music.
One aspect of the data that Rudder should have emphasized more is that it was derived from people’s responses to OKCupid profile questions. This means that we shouldn’t claim anything meaningful about these words and phrases apart from the context of self-description when trying to get a date. Another more fundamental problem is that by limiting the list to words and phrases that were used uniquely by particular groups, the list fails to represent the ways in which these groups view themselves overall. In other words, if the top words and phrases used by each of these groups were listed without filtering them on the basis of uniqueness (i.e., little use by other ethnic groups), very different self-descriptions of these groups would emerge.
I appreciate Rudder’s attempts to mine the relatively new and rich data resources that are available to him. Just as others have done before, he has an opportunity to reveal unknown and interesting aspects of human behavior that are specific to the contexts from which he has collected data. If he had stayed within these natural boundaries, I would have enjoyed his observations thoroughly. Unfortunately, Rudder strayed beyond these boundaries into the realm of claims that exceed his data and methods of analysis. This is an expression of Big Data hyperbole and technological solutionism. The only “data science” that is worthy of the name is, above all, rooted in the principles and practices of science. We dare not forget this in our enthusiasm.
December 2nd, 2014
In the opening chapter of Risk Savvy: How to Make Good Decisions by Gerd Gigerenzer, he writes:
When something goes wrong, we are told that the way to prevent further crisis is better technology, more laws, and bigger bureaucracy. How to protect ourselves from the next financial crisis? Stricter regulations, more and better advisers. How to protect ourselves from the threat of terrorism? Homeland security, full body scanners, further sacrifice of individual freedom. How to counteract exploding costs in health care? Tax hikes, rationalization, better genetic markers. One idea is absent from these lists: risk-savvy citizens.
In this wonderful book, Gigerenzer shows how to assess and deal with risk (i.e., known risks) versus uncertainty (i.e., unknown risks). “If risks are known, good decisions require logic and statistical thinking…If some risks are unknown, good decisions also require intuition and smart rules of thumb.” He provides practical guidelines for applying the right types of thinking to many real problems that we face in life, such as those related to finance and health care, leading to better decisions. He exposes the folly of trusting statistical algorithms when they’re applied to high levels of uncertainty, such as the behavior of the stock market. He also proposes ways to simplify the understanding of probabilities associated with quantifiable risks, such as the risk of dying from prostate cancer if you test positive on a prostate-specific antigens (PSA) test, which many medical doctors misunderstand.
He cares deeply about health care and is intimately acquainted with its failures in decision making.
The major cause is the unbelievable failure of medical schools to provide efficient training in risk literacy. Medical progress has become associated with better technologies, not with better doctors who understand these technologies. Medical students have to memorize tons of facts about common and rare diseases. What they rarely learn is statistical thinking and critical evaluation of scientific articles in their own field. Learning is geared toward performance in the final exam, which shows little correlation with clinical experience.
He argues that society as a whole must become risk literate, which involves a basic level of statistical thinking that few of us learn in school. Of all branches of mathematics, statistics is the most practical and broadly applicable to life, yet it isn’t routinely taught in school along with algebra, geometry, and calculus. Taught properly, children can easily learn the basics of statistical thinking. Gigerezner, and I also, believe that this training is essential and becoming more so each day. “As a general policy, coercing and nudging people like a herd of sheep instead of making them competent is not a promising vision for a democracy.”
Here, in his own words, is what he argues in the book:
- Everyone can learn to deal with risk literacy. In this book, I will explain principles that are easily understood by everyone who dares to know.
- Experts are part of the problem rather than the solution. Many experts themselves struggle with understanding risks, lack skills in communicating them, and pursue interests not aligned with yours. Giant banks go bust for exactly these reasons. Little is gained when risk-illiterate authorities are placed in charge of guiding the public.
- Less is more. When we face a complex problem, we look for a complex solution. And when it doesn’t work, we seek an even more complex one. In an uncertain world, that’s a big error. Complex problems do not always require complex solutions. Overly complicated systems, from financial derivatives to tax systems, are difficult to comprehend, easy to exploit, and possibly dangerous. And they do not increase the trust of the people. Simple rules, in contrast, can make us smart and create a safer world.
His arguments are convincing and the principles that he teaches are practical and accessible. I recommend this book.
December 1st, 2014
I believe that no single quality better equips data analysts for success than curiosity. Ian Leslie’s book Curious: The Desire to Know and Why Your Future Depends on It, isn’t about data analysis, but is instead about curiosity in general: what it is, how it develops, how caregivers can encourage its development in children, how societies view it, what it enables us to do, how modern technologies affect it, and how it enriches our lives.
Cognitive scientists have identified three types of curiosity:
- Diversive curiosity: the attraction to everything novel, which is with us from early childhood
- Epistemic curiosity: a desire to learn and understand
- Empathic curiosity: an interest in the thoughts and feelings of others
We’re born with diversive curiosity, but it can mature into epistemic and empathic curiosity. These deeper, “more disciplined, and effortful” types of curiosity are the subject of this book. Let me apply these distinctions to data visualization. Someone who is engaged by superficial aspects of an infographic’s design—what I often refer to as flash and dazzle—is exercising diversive curiosity. Many infographics are designed to appeal solely to curiosity of this type. Someone who is drawn to an infographic to learn and understand its story is exercising epistemic curiosity. This person is drawn in by the clarity and importance of the information. If the infographic reveals something about a people or culture, empathic curiosity might draw the viewer in.
Here are a few quotes to whet your appetite:
We romanticize the curiosity of children because we love their innocence. But creativity doesn’t happen in a void. Successful innovators and artists effortfully amass vast stores of knowledge, which they can then draw on effiortlessly. Having mastered the rules of their domain, they can concentrate on rewriting them.
A major concern of this book is that digital technologies are severing the link between effort and mental exploration. By making it easier for us to find answers, the Web threatens habits of deeper inquiry—habits that require patience and focused application.
Chris Anderson, the former editor of Wired, has made the extreme case for the potential of…[Big Data]. “Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is that they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.” Anderson thinks that when you amass Big Data, you no longer have to bother with the Big Why.
Curious? If so, I recommend that you read this book.
October 21st, 2014
In the past few years, several fine books have been written by neuroscientists. In this blog I’ve reviewed those that are most useful and placed Daniel Kahneman’s Thinking, Fast & Slow at the top of the heap. I’ve now found its worthy companion: The Organized Mind: Thinking Straight in the Age of Information Overload.
This new book by Daniel J. Levitin explains how our brains have evolved to process information and he applies this knowledge to several of the most important realms of life: our homes, our social connections, our time, our businesses, our decisions, and the education of our children. Knowing how our minds manage attention and memory, especially their limitations and the ways that we can offload and organize information to work around these limitations, is essential for anyone who works with data.
This excerpt from the introduction will provides a sense of Levitin’s intention:
We humans have a long history of pursuing neural enhancement—ways to improve the brains that evolution gave us. We train them to become more dependable and efficient allies in helping us to achieve our goals…Through the sheer force of human ingenuity, we have devised system to free our brains of clutter, to help us keep track of details that we can’t trust ourselves to remember. All of these innovations are designed either to improve the brain we have, or to off-load some of its functions to external sources…It’s helpful to understand that our modes of thinking and decision-making evolved over the tens of thousands of years that humans lived as hunter-gatherers. Our genes haven’t fully caught up with the demands of modern civilization, but fortunately human knowledge has—we now better understand how to overcome evolutionary limitations. This is the story of how humans have coped with information and organization from the beginning of civilization. It’s also the story of how the most successful members of society…have learned to maximize their creativity, and efficiency, by organizing their lives so that they spend less time on the mundane, and more time on the inspiring, comforting, and rewarding things in life.
Levitin describes the nature of our so-called information age, including the many ways that work done by information specialists in the past has been transferred to us (for example, making our own travel arrangements rather than relying on the services of a travel agent), resulting in overwhelming cognitive demands. He shows how the coping strategy of multi-tasking is in fact an inefficient form of serial tasking—attentional switching—that provides an illusory sense of productivity. Many of life’s challenges require focus—extended periods of uninterrupted attention. We also need the replenishment of neural energy that the mind-wandering mode provides, where associations and insights can form while the mind soars freely.
How we sort through incoming data in rapid triage to separate urgent matters from others, how we categorize and store it for later retrieval, how we protect ourselves from the din of distracting noise, and how we assess facts when making decisions, are all skills that we can learn. Levitin’s advice is practical, lucid, and firmly rooted in an understanding of the brain.
When organizations are chasing the latest so-called Big Data technologies, it’s important to recognize that more and faster isn’t necessarily better, and in fact, is often worse, because of the ways that our brains are designed. If you’re involved in business intelligence, analytics, data visualization, data science, data storytelling, decision support, or whatever you choose the call the work of data sensemaking and communication, books like The Organized Mind are more useful to your success than a hundred books about specific software tools. Do yourself a favor and read this book.