Thanks for taking the time to read my thoughts about Visual Business
Intelligence. This blog provides me (and others on occasion) with a venue for ideas and opinions
that are either too urgent to wait for a full-blown article or too
limited in length, scope, or development to require the larger venue.
For a selection of articles, white papers, and books, please visit
April 30th, 2014
Yesterday, I read an article on the website of Scientific American titled “Saving Big Data from Big Mouths” by Cesar A. Hidalgo. As you know if you read this blog regularly, I have grave concerns about the hyperbolic claims of Big Data and believe that it is little more than a marketing campaign to sell expensive technology products and services. In his article, Dr. Hidalgo, who teaches in the MIT Media Lab, challenges several recent articles in prominent publications that criticize the claims of Big Data. The fact that he characterizes naysayers as “big mouths” clues you into his perspective on the matter. In reading his article, I discovered that Dr. Hidalgo’s understanding of Big Data is limited, as is often the case with academics, and his position suffers from a fundamental problem with Big Data, which is that Big Data, as he defines it, doesn’t actually exist. When I say that it doesn’t exist, I’m arguing that Big Data, as he’s defined it, isn’t new or qualitatively different from data in the past. Big Data is just data.
I expressed my concerns to Dr. Hidalgo by posting the following comments in response to his article:
I’m one of the naysayers in response to the claims of so-called Big Data. I’m concerned primarily with the hype that leads organizations to waste money chasing new technologies rather than developing the skills that are needed to glean value from data. One of the fundamental problems with Big Data is the fact that no two people define it in the same way, so it is difficult to discuss it intelligently. In this article, you praised the benefits of Big Data, but did not define it. What do you mean by Big Data? How is Big Data different from other data? When did data become big? Are the means of gleaning value from so-called Big Data different from the means of gleaning value from data in general?
He was kind enough to respond with the following.
Dear Stephen Few,
These are all very good questions.
First, regarding the definition of big data:
As you probably know well, the term big data is used colloquially to refer mostly to digital traces of human activities. These include cell phone data, credit card records and social media activity. Big data is also used occasionally to refer to data generated by some scientific experiments (like CERN or genomic data), although this is not the most common use of the phrase so I will stick to the “digital traces of human activity” definition for now.
Beyond the colloquial definition, I have a working definition of big data that I use on occasion. To keep things simple I say that big data needs to be three times big, meaning that it needs to be big in size, resolution and scope. The size dimension is relatively obvious (data on 20 or 30 individuals is not the same as data on hundreds of thousands of them). The resolution dimension is better explained by an example. Consider having credit card data on a million people. A low resolution version of this dataset would consist only on the total yearly expenditure of each individual. A high resolution version, would include information on when and where the purchases where made. In this example, it is the resolution of the data what allows us to use it to study, for instance, the mobility of this particular group of people (notice that I am not generalizing to the general population, since this subpopulation might be worthy of study on its own). Finally, I require data to be big in scope. By this, I require data to be useful for applications other than the ones for which it was originally collected. For instance, mobile phone records are used by operators for billing purposes, but could be used to forecast traffic or to identify the location of mobile phone users prior to a natural disaster (and use this information to help speed up search and rescue operations). When data is big in size, resolution and scope, I am comfortable saying that it is big.
Second: Your question about when big data become big?
This is an interesting question because it points to the evolution of language. During the last decade the word data has begot at least two children: “metadata” and “big data”. The word metadata grew in popularity in the wake of the NSA scandal, as people needed to differentiate between the content of messages and their metadata. Big data, on the other hand, emerged as people searched for a short way to refer to the digital traces of human activity that were collected for operational purposes by service providers serving large populations, and that could be used for purposes that were beyond those for which the data was originally collected. Certainly, the phrase “big data” provides an economy of language, and as someone that enjoys writing I always appreciate that.
Regarding the time at which this transition happen, I remember that when I started working with mobile phone data (in 2004) people were not using the word big data. As more people entered the field, the word begun to gain force (around 2008). With the financial crisis, the hype of big data entered full swing, as many framed big data as a the new asset, or technology, that could save the economy. I guess at that time, everyone wanted to believe them :-).
Third and Final: You ask whether the means of gleaning value from big data include the methods used to glean value from data in general.
In short, the answer is yes (it is data after all). Multivariate regressions in all of its forms and specifications are still useful and welcome (I use them often). Yet, these new datasets have also stimulated the proliferation of some additional techniques. For instance, visualizations have progressed enormously during recent years since exploring these datasets is not easy, and large datasets involve more exploration. As an example, check dataviva.info . This site makes available more than 100 million visualizations to help people explore Brazil’s formal sector economy. By taking different combinations of visualizations it is possible to weave stories about industries and locations. An example of these stories for a related project, The Observatory of Economic Complexity (atlas.media.mit.edu), can be seen in this video (http://vimeo.com/40565955). Here you will be able to see how these visualization techniques allow people to quickly compose stories about a topic.
Finally, it is worth noting that different people might mean different things when they refer to gleaning value from data. For some people, this might involve explaining the mechanisms that gave rise to the observed patterns, or use the data to learn about an aspect of the world. This is a common approach on the social sciences. For other people value might emerge from predictions that are not cognitively penetrable but nevertheless accurate, such as the ones people obtain with different machine learning techniques, such as neural networks or those based on abstract features. The latter of these approaches, which is often used by computer scientists, can be very useful for sites that require accurate predictions, such as Netflix or Amazon. Here, the value is certainly more commercial, but is also a valid answer to the question.
I hope these answer help clarify your questions.
All the best
I wanted to respond in kind, but for some unknown reason the Scientific American website is rejecting my comments, so I’ll continue the discussion here in my own blog.
Your response regarding the definition of Big Data demonstrates the problem that I’m trying to expose: Big Data has not been defined in a manner that lends itself to intelligent discussion. Your definition does not at all represent a generally accepted definition of Big Data. It is possible that the naysayers with whom you disagree define Big Data differently than you do. I’ve observed a great many false promises and much wasted effort in the name of Big Data. Unless you’re involved with a broad audience of people who work with data in organizations of all sorts (not just academia), you might not be aware of some of the problems that exist with Big Data.
Your working definition of Big Data is somewhat similar to the popular definition involving the 3 Vs (volume, velocity, and variety) that is often cited. The problem with the 3 Vs and your “size, resolution, and scope” definition is that they define Big Data in a way that could be applied to the data that I worked with when I began my career 30 years ago. Back then I routinely worked with data that was big in size (a.k.a., volume), detailed in resolution, and useful for purposes other than that for which it was originally generated. By defining Big Data as you have, you are supporting the case that I’ve been making for years that Big Data has always existed and therefore doesn’t deserve a new name.
I don’t agree that the term Big Data emerged as a “way to refer to digital traces of human activity that were collected for operational purposes by service providers serving large populations, and that could be used for purposes that were beyond those for which the data was originally collected.” What you’ve described has been going on for many years. In the past we called it data, with no need for the new term “Big Data.” What I’ve observed is that the term Big Data emerged as a marketing campaign by technology vendors and those who support them (e.g., large analyst firms such as Gartner) to promote sales. Every few years vendors come up with a new name for the same thing. Thirty years ago, we called it decision support. Not long after that we called it data warehousing. Later, the term business intelligence came into vogue. Since then we’ve been subjected to marketing campaigns associated with analytics and data science. These campaigns keep organizations chasing the latest technologies, believing that they’re new and necessary, which is rarely the case. All the while, they never slow down long enough to develop the basic skills of data sensemaking.
When you talk about data visualization, you’re venturing into territory that I know well. It is definitely not true that data visualization has “progressed enormously during recent years.” As a leading practitioner in the field, I am painfully aware that progress in data visualization has been slow and, in actual practice, is taking two steps backwards, repeating past mistakes, for every useful step forwards.
What various people and organizations value from data certainly differs, as you’ve said. The question that I asked, however, is whether or not the means of gleaning value from data, regardless of what we deem valuable, are significantly different from the past. I believe that the answer is “No.” While it is true that we are always making gradual progress in the development of analytical techniques and technologies, what we do today is largely the same as what we did when I first began my work in the field 30 years ago. Little has changed, and what has changed is an extension of the past, not a revolutionary or qualitative departure.
I hope that Dr. Hidalgo will continue our discussion here and that many of you will contribute as well.
March 28th, 2014
All men are designers. All that we do, almost all the time, is design, for design is basic to all human activity. The planning and patterning of any act towards a desired, foreseeable end constitutes the design process. Any attempt to separate design, to make it a thing-by-itself, works counter to the inherent value of design as the primary underlying matrix of life. Design is the conscious effort to impose meaningful order.
Mankind is unique among animals in its relationship to the environment. All other animals adapt themselves to a changing environment (by growing thicker fur in the winter, or evolving into a totally new species over a half-million-year cycle); only mankind transforms earth itself to suit its needs and wants. This job of form-giving and reshaping has become the designer’s responsibility. A hundred years ago, if a new chair, carriage, kettle, or pair of shoes was needed, the consumer went to the craftsman, stated his wants, and the article was made for him. Today the myriad objects of daily use are mass-produced to a utilitarian and aesthetic standard often completely unrelated to the consumer’s need. At this point Madison Avenue must be brought in to make these objects desirable or even palatable to the mass consumer.
In an age of mass production when everything must be planned and designed, design has become the most powerful tool with which man shapes his tools and environments (and, by extension, society and himself). This demands high social and moral responsibility from the designer. It also demands greater understanding of the people by those who practice design and more insight into the design process by the public.
Design must become an innovative, highly creative, cross-disciplinary tool responsive to the true needs of men. It must be more research-oriented, and we must stop defiling the earth itself with poorly-designed objects and structures.
“Should I design it to be functional or to be aesthetically pleasing?” This is the most heard, the most understandable, and the most mixed-up question in design today. “Do you want it to look good, or to work?” Barricades erected between what are really just two of the many aspects of function. It is all quite simple: aesthetic value is an inherent part of function.
The response of many designers has been like that so unsuccessfully practiced by Hollywood: the public has been pictured as totally unsophisticated, possessed of neither taste nor discrimination. A picture emerges of a moral weakling with an IQ of about 70, ready to accept whatever specious values the unholy trinity of Motivation Research, Market Analysis, and Sales have decided is good for him.
The cancerous growth of the creative individual expressing himself egocentrically at the expense of spectator and/or consumer has spread from the arts, overrun most of the crafts, and finally reach even into design. No longer does the artist, craftsman, or in some cases the designer operate with the good of the consumer in mind; rather, many creative statements have become highly individualistic, auto-therapeutic little comments by the artist to himself. With new processes and an endless list of new materials at his proposal, the artist, craftsman, and designer now suffers from the tyranny of absolute choice. When everything becomes possible, when all the limitations are gone, design and art can easily become a never-ending search for novelty, and the desire for novelty on the part of the artist becomes an equally strong desire for novelty on the part of the spectator and consumer, until newness-for-the-sake-of-newness becomes the only measure.
To “sex-up” objects (designers’ jargon for making things more attractive to mythical consumers) makes no sense in a world in which basic need for design is very real. In an age that seems to be mastering aspects of form, a return to content is long overdue. Designing for the people’s needs rather than for their wants, or artificially created wants, is the only meaningful direction now.
None of the words above are mine, despite the fact that they reflect my thinking and values perfectly. These words were written in 1971 by the designer/teacher Victor Papanek, whose work I only recently discovered. You can read them yourself in Papanek’s important and thoughtful book entitled Design for the Real World (Academy Chicago Publishers). This is a true classic that all designers should read, especially those of us who design information displays.
Our designs affect the world for good or ill. We choose to either take responsibility by presenting information effectively or to do harm by presenting information in ways that lead to impoverished and erroneous thinking. If you choose the former, you owe it to yourself (and others) to read Design for the Real World for inspiration, direction, and an invitation to make a difference.
February 11th, 2014
I have an extensive library of books about data visualization. I purchase most of the books that are published on the subject, but only occasionally do I find one that exhibits sound thinking and deep understanding. The new book titled Data Insights: New Ways to Visualize and Make Sense of Data, by Hunter Whitney, is one of those rare exceptions.
Unlike most authors of books about data visualization or the broader topic of business intelligence, Whitney thinks clearly and writes beautifully. The two go hand in hand. Because data visualization has soared to great popularity in the last 10 years, publishers now give book contracts to anyone they believe will command an audience, with no concern whatsoever for the author’s expertise or the soundness of their content. If you have a blog with an adequate number of monthly visitors, you can find a publisher willing to sign you as an author. (The recently released book by Wiley Press, Data Visualization for Dummies, written by the independent SAP consultant Mico Yuk, is an example of this.) Gone are the days when publishers had standards for content and they approved book proposals on the basis of more than revenue projections. Fortunately, in the midst of the growing noise, signals in the form of a few good books still get published from time to time.
Whitney has developed deep expertise in data visualization in particular and user interface design in general over many years of experience. He’s smart, informed, and a skilled communicator. His knowledge is interdisciplinary, including neuroscience, which enables the broad perspective that’s needed for insightful data sensemaking. Unlike many who claim expertise in this arena, he understands the proper uses, limitations, and opportunities of technologies, as well as the needs, weaknesses, and strengths of humans. His knowledge has been forged through the process of building solutions to real problems.
Data Insights is not a “how to” book in the sense of comprehensive instruction in the principles and practices of data visualization, but more of a philosophical grounding in the thinking that a data sensemaker must have to make the journey from data to insights. Here are the two opening paragraphs of the book’s first chapter:
From our latest purchase decisions to global population trends, data of all kinds are increasingly swept up and carried along into ever-expanding streams. The surging flows are often so fast, and the volume so massive they can overwhelm people’s capacities to distill the essential elements, derive meanings, and gain insights. We invent tools to solve problems, accomplish tasks, and augment our abilities. We’ve devised instruments to see distant stars and view subatomic particles; now, people are creating new ways to peer at multiple layers of data that otherwise would be invisible to us. Visualizations offer a way to extend and enhance our innate powers of perception and cognition and get a “grander view” of the world around us.
However, no matter how necessary these visual representations might be or how reliant we’ve become on them, they don’t tell the complete story. The processes that go into making the visualization, the parts you don’t typically see, are still key components of the picture. The more you know about what goes into making a visualization, as well as its relative strengths and weaknesses, the more effective a tool it can be. Technology enables us to interact with data on more levels to accomplish objectives ranging from completing simple day-to-day tasks to solving long-term, seemingly intractable problems. Visualizations help us transcend the jumbles of data, allowing us to see more of the stories life has to tell.
Whitney takes the time that’s necessary to express his ideas clearly and fully. He doesn’t write like a PowerPoint presentation of bullet points, but weaves together the full set of words and images that are necessary to communicate his thoughts effectively. Not only does Whitney write well, he’s also designed a book that supports the principles of design that he advocates. The book is beautiful.
Whitney and I don’t see eye to eye on every single detail, but our disagreements seem inconsequential. If the publisher had asked me to review this book in advance, I would have suggested few improvements. Only two recommendations come to mind offhand. First, Whitney interviewed prominent representatives of a few data visualization software vendors and included their thoughts in the book, which were often insightful, but in a few cases examples of their products that were shown exhibited design problems that actually conflicted with Whitney’s recommendations. This underscores the fact that few vendors in the space do a good job of supporting effective data visualization practices. Second, many of the images that appear in the book as illustrations don’t add enough value to justify the space that they use. They’re examples of images that one might show on PowerPoint slides during a presentation just to have something relevant on the screen while covering particular content that doesn’t really need a visual, rather than temporarily blanking the screen. This causes little harm in a live presentation, but waste’s paper and inflates printing costs in a book. No other recommendations than these two are worth mentioning.
This is a fine book that adds real value to the field of data visualization. I recommend it highly to anyone who wants to become a skilled and effective practitioner.
January 10th, 2014
This blog entry was written by Bryan Pierce of Perceptual Edge.
The following 3-D treemap was brought to our attention by a participant in our discussion forum.
This graph was selected by Bill Gates to be included in a recent edition of Wired Magazine that he guest edited. He explained why he included the graph as follows:
I love this graph because it shows that while the number of people dying from communicable diseases is still far too high, those numbers continue to come down. In fact, fewer kids are dying, more kids are going to school and more diseases are on their way to being eliminated. But there remains much to do to cut down the deaths in that yellow block even more dramatically. We have the solutions. But we need to keep up the support where they’re being deployed, and pressure to get them into places where they’re desperately needed.
This is an important message and a noble goal. But how well does the graph above tell this story? Not very well, actually.
A treemap is a space-filling graph that uses the size of rectangles to encode one quantitative variable and color intensity to encode a second. This treemap was created by Thomas Porostocky to display worldwide years of life lost by cause using data from the University of Washington’s Institute for Health Metrics and Evaluation database.
Let’s see what we can learn from this graph. First, we notice that the green section representing injuries is significantly smaller than the other two, but the relative sizes of the other two sections are difficult to judge. Next, we see that the rectangles in the yellow section are mostly light yellow. If we check the color scale at the bottom it shows us that most of the diseases in that section are decreasing at an annual rate of between -2% and -3%. We can also see the names on the larger rectangles that represent the causes responsible for more years of life lost (e.g., Malaria), and get a sense of their relative sizes based on their areas, but again, we can’t compare them with any accuracy. Treemaps were invented by Ben Shneiderman as a means to display part-to-whole relationships between huge numbers of values; data sets that are too large to display using graphs that can be more easily and accurately read, such as bar graphs. Only with a huge set of values would it make sense to rely on the areas of rectangles and the intensities of their colors to represent values, given the fact that our brains cannot interpret these attributes of visual perception easily and accurately.
The 3-D effect that’s been added to the treemap doesn’t provide us any information and makes the treemap harder to decode. One problem introduced by this effect involves the darkened colors that appear on the sides of the treemap to represent shadows, which are meaningless and misleading. 3-D graphs are rarely a good idea, but this 3-D is completely gratuitous.
If a treemap had been the best way to show this information, it would have been better to separate the three major sections using borders rather than different colors. Then a single diverging color scale could have been used for the whole treemap. For instance, negative values could have been varying shades of red, values near zero could have been gray, and positive values could have been varying shades of blue. This would have made it significantly easier to decode the values—especially the values near zero, representing little change—than the current design that uses three different sequential color scales.
There is another problem with the treemap, though it’s not apparent unless you look at the underlying data. The color scale in the treemap shows annual percentage changes ranging from -3% to +3%. However, some of the items in the treemap changed by larger amounts than this. For instance, between 2005 and 2010 the years of life lost per 100,000 people to malaria decreased by 23.80%, which is an annual percentage reduction of 4.76%. This is a great improvement, but this outlier is completely lost when viewing the treemap, which shows malaria as one of the many infectious diseases that decreased annually between -2% and -3%.
The information that appears in the treemap can be easily shown in two side-by-side bar graphs in a way that tells the story clearly and accurately and is just as visually engaging without resorting to gimmickry. In fact, by using a third variable to display information about the death rate for each cause, instead of solely showing the information in terms of years of life lost, the story can be enriched to give a clearer picture of the world. Here is our redesign:
The bar graph on the left shows the years of life lost per 100,000 people in 2010 for each cause, which is the information encoded by the areas of the rectangles in the original treemap. The bars have been ranked and color coded to make it easy to compare causes of death. The years of life lost to each cause as percentages of the whole are also shown in the column of text, just to the left of the bars.
The bar graph in the center shows the percentage change between the years of life lost per 100,000 people in 2005 and 2010 for each of the causes. Unlike the original graph, we’re showing the total percentage change between those years, rather than an annualized version.
The bar graph on the right displays information that’s not shown in the original treemap: the death rate per 100,000 people for each cause. The fact that this information can be viewed together with the years of life lost information is useful and we’ll examine it in more detail a little later.
You might notice that our bar graphs include fewer items than the original treemap. The original treemap contains a little over 100 rectangles, many of which are unlabeled. We had access to the original dataset, so we could have made bar graphs that included items for each individual disease, but we decided it would have undermined the core story to include dozens of tiny bars, so we decided to aggregate the data into useful categories. For instance we aggregated all different types of cancer into a single “Cancer” bar and all different types of heart disease into a single “Heart disease” bar. Also, for items that contributed less than 1% of total deaths, if they couldn’t already be aggregated into an obvious category like cancer, we moved them into an “Other” category. For instance, deaths from diphtheria are included in the “Other communicable diseases (including meningitis and hepatitis)” bar. In cases when access to these lower-level details is important, a table containing all individual causes of death could be included to provide this information.
Notice how much easier it is to interpret the values represented by the bars than it was to decode the rectangle sizes and color intensities in the treemap. The fact that fewer years of life are being lost to communicable, maternal, neonatal, and nutritional disorders, represented by the gray bars, is immediately obvious, because all of the gray bars are showing decreases (negative values) in the center graph. By placing the years of life lost rate and the death rate for each cause in close proximity to one another, it’s easy to find discrepancies between their patterns, which can be informative. For instance, most of the gray bars have relatively short death rate bars, in comparison to the bars that represent years of life lost. This is because many of the gray bars represent diseases or issues that tend to kill children, so each death results in many years of life lost. For instant, on average, each death from malaria robs someone of 67.2 years of estimated life. Conversely, the three largest brown bars, “Cancer,” “Heart disease,” and “Stroke” all represent things that tend to kill older people, so each death has a relatively lower impact on years of life lost. For instance, each death to heart disease, on average, is responsible for an estimated 17.5 years of life lost.
By using bar graphs, we’ve made it easier to interpret and compare the data, so that it’s easy to focus on the stories contained in the data, rather than struggling to decode an inappropriate and ineffectively designed display.
January 2nd, 2014
I was first introduced to science fiction literature as a young man in undergraduate school. The first science fiction novel that I read was Stranger in a Strange Land by Robert Heinlein. This book was not typical of the genre in many respects—some called it the Bible of the 60’s counter-culture movement—but I loved it and eventually came to love science fiction in most of its sub-genres. Heinlein became one of my early favorites among science fiction authors. It is the memory of his novel Time Enough for Love that links my love of the genre to the topic of this blog article. The protagonist of the novel, Lazarus Long, who possesses genetics for longevity, was born on Earth in the 20th century and then lived for hundreds of years on many planets as a space traveler. The gift of a long life gave him time enough for love. The moral of the tale is simple: some things take time. Unlike Lazarus Long, none of us have been granted lives that will span many centuries, so what we learn to value in life must come, not through an extension of time, but through a better use of the time that we’re allotted.
Much of what fulfills me, both personally and professionally, exists in the realm of ideas. Thinking and what it produces are important to me, and as such, it deserves time. Some of my best thinking takes place while I’m driving long distances or taking long walks. Why these occasions in particular? Because they provide extended periods of time of activities that demand little attention, so my mind can concentrate on thinking. As I drive or walk, after a few minutes my mind has time enough for thinking. Insights in the form of connections, useful metaphors, changes in the ways ideas are organized, or simpler explanations of complex concepts rise to the surface of my consciousness long enough to be noticed and examined. I sometimes record these thoughts on my phone’s digital recorder app or I pull off to the side of the road and write them down.
There are many different ways to clear the stage of our busy lives for thoughts to emerge from the wings of consciousness and dance for us. Contemplation can be enabled in various ways, but the time and attention that it requires doesn’t occur naturally in our modern lives. In a time of perpetual connection to computer-mediated data, we must create an oasis of opportunity for thinking by occasionally disconnecting. This takes conscious effort. The futurist, Alex Soojung-Kim Pang, refers to this practice as contemplative computing in his book The Distraction Addiction.
Like me, Pang is a technologist who realizes its limitations and its potential for both good and ill. He promotes relationships with technologies that are thoughtfully constrained and suggests specific techniques for carving out time and space in the midst of a noisy world for contemplation. Here’s a sample of his advice:
The ability to pay attention, to control the content of your consciousness, is critical to a good life. This explains why perpetual distraction is such a big problem. When you’re constantly interrupted by external things—the phone, texts, people with “just one quick question,” clients, children—by self-generated interruptions, or by your own efforts to multitask and juggle several tasks at once, the chronic distractions erode your sense of having control in your life. They don’t just derail your train of though. They make you lose yourself.
He not only teaches ways to reduce distractions caused by technologies, but also how to use technologies in ways that enable greater focus on the tasks that they’re meant to support. Much of what he suggests falls into the category of uncommon common sense: ideas that seem obvious and sensible once they’re pointed out, but are seldom noticed until they are. If you’re looking for help in making time enough for thinking—something our world desperately needs and increasingly lacks—I suggest that you read The Distraction Addiction by Alex Soojung-Kim Pang.