Big Dataclast: My Concerns about Dataclysm
If you’re familiar with my work, you know that I am an iconoclast within the business intelligence (BI) and analytics communities, refusing to join the drunkard’s party of hyperbolic praise for information technologies. The “clast†portion of the term “iconoclast†means “to break.†I often break away from the herd, and break the mold of convention, to say what I believe is true in the face of misinformation. My opinion that so-called Big Data is nothing more than marketing hype is a prime example of this. Despite my opinion of Big Data, I approached Christian Rudder’s book Dataclysm with great interest. The following excerpt from the book’s opening paragraph gave me hope that this was not just another example of marketing hype.
You have by now heard a lot about Big Data: the vast potential, the ominous consequences, the paradigm-destroying new paradigm it portends for mankind and his ever-loving websites. The mind reels, as if struck by a very dull object. So I don’t come here with more hype or reportage on the data phenomenon. I come with the thing itself: the data, phenomenon stripped away. I come with a large store of the actual information that’s being collected, which luck, work, wheedling, and more luck have put me in the unique position to possess and analyze.
What can this large store of actual information reveal?
Practically by accident, digital data can now show us how we fight, how we love, how we age, who we are, and how we’re changing. All we have to do is look: from just a very slight remove, the data reveals how people behave when they think no one is watching.
You can imagine my excitement upon reading these words. Actual insights gleaned from large data stores; a demonstration of data’s merits rather than confusing, hyperbolic claims. What is this “phenomenon,†however, that Rudder has “stripped away†from the data? To my great disappointment, I found that the context that’s required to gain real insights was often stripped away. Only a few pages into the book I already found myself stumbling over some of Rudder’s assumptions and practices. I enjoyed his casual, hip voice, and I greatly appreciated the excellent, no-nonsense design of his graphs (excluding the silly voronoi tessellation treemap), but some of his beliefs about data sensemaking gave me pause.
Rudder explains that he is trying to solve a real problem with many of behavioral science’s findings. Experimental research studies typically involve groups of test subjects that are not only too small for trustworthy results but also cannot be generalized because they consist almost entirely of homogeneous sets of college students.
I understand how it happens: in person, getting a real representative data set is often more difficult than the actual experiment you’d like to perform. You’re a professor or postdoc who wants to push forward, so you take what’s called a “convenience sampleâ€â€”and that means the students at your university. But it’s a big problem, especially when you’re researching belief and behavior. It even has a name: It’s called WEIRD research: white, educated, industrialized, rich, and democratic. And most published social research papers are WEIRD.
Rudder’s concern is legitimate. His solution, however, is lacking.
Rudder is a co-founder of the online dating service OKCupid. As such, he has access to an enormous amount of data that is generated by the choices that customers make while seeking romantic connections. Add to this the additional data that he’s collected from other social media sites, such as Facebook and Twitter, and he has a huge data set. Even though the people who use these social media sites are more demographically diverse than WEIRD college students, they don’t represent society as a whole. Derek Ruths of McGill University and Jürgen Pfeffer of Carnegie Mellon University recently expressed this concern in an article titled “Social Medial for Large Studies of Behavior,†published in the November 28, 2014 issue of Science. Also, the conditions under which the data was collected exercise a great deal of influence, but Rudder has “stripped away†most of this context.
Contrary to his disclaimers about Big Data hype, Rudder expresses some hype of his own. Social media Big Data opens the door to a “poetry…of understanding. We are at the cusp of momentous change in the study of human communication.†He believes that the words people write on these sites provide the best source of information to date about the state and nature of human communication. I believe, however, that this data source reveals less than Rudder’s optimistic assessment. I suspect that it mostly reveals what people tend to say and how they tend to communicate on these particular social media sites, which support specific purposes and tend to be influenced by technological limitations—some imposed (e.g., Twitter’s 140 character limit) and others a by-product of the input device (e.g., the tiny keyboard of a smartphone). We can certainly study the effects that these technological limitations have on language, or the way in which anonymity invites offensive behavior, but are we really on the “cusp of momentous change in the study of human communication� To derive useful insights from social media data, we’ll need to apply the rigor of science to our analyses just as we do with other data sources.
Rudder asserts:
Behind every bit in my data, there are two people, the actor and the acted upon, and the fact that we can see each as equals in the process is new. If there is a “-clysm†part of the whole data thing, if this book’s title isn’t more than just a semi-clever pun or accident of the alphabet—then this is it. It allows us to see the full human experience at once, not just whatever side we happen to be paying attention to at a given time.
Having read the book, I found that the book’s title is only a “semi-clever pun.†Contrary to Rudder’s claim, his data does not “allow us to see the full human experience at once.†In fact, it provides a narrow window through which we can observe anonymous interactions between strangers in particular contexts that are designed for specific purposes (e.g., getting a date). Many of the findings that Rudder presents are fun and interesting, but we should take them with a grain of salt.
Fairly early in the book, Rudder presents his findings about women’s preferences in men and also about men’s preferences in women, but it isn’t clear what the data actually means because he’s stripped it of context. For example, when describing women’s age preferences—“the age of men she finds most attractiveâ€â€”he fails to mention the specific data he based his findings on and the context in which it was collected. Were women being shown a series of photographs, two men at a time, and asked to select the one they found more attractive of the two; was the data based solely on the ages of the men that women contacted in hope of a date; or was the data drawn from some other context? Scientists must describe the designs of their studies, including the specific conditions under which they collected their data. Without understanding this context, we can’t understand the findings and certainly can’t rely on them.
Later in the book, Rudder presents findings about the number of messages people receive on OKCupid in correlation to their physical attractiveness. His graph of his findings displays a positive correlation between the two variables that remains steady through most of the series but suddenly increases to an exponential relationship beginning around the 90th percentile of attractiveness. He sliced and diced his findings in several ways, but never provided a fundamental and important piece of information: how was physical attractiveness measured? Important insights might reside in this data, but we can’t trust them without the missing context.
Rudder seems to be endorsing a typical tenet of Big Data hype that concerns me, which I’ll paraphrase as, “With Big Data we no longer need to adhere to the basic principles of science.†I applaud Rudder’s efforts to expose bad science in the form of small, demographically homogeneous groups of test subjects, but his alternative introduces its own set of problems, which are just as harmful. I suspect that Rudder endorses this particular alternative because it is convenient. He’s a co-founder of a social media site that collects tons of data. It’s in his interest to enthusiastically endorse this Big Data approach. I trust that Rudder’s conscious intentions are pure, but I believe that his perspective is biased by his role, experience, and interests.
Sourcing data from the wild rather than from controlled experiments in the lab has always been an important avenue of scientific study. These studies are observational rather than experimental. When we do this, we must carefully consider the many conditions that might affect the behavior that we’re observing. From these observations, we carefully form hypotheses, and then we test them, if possible, in controlled experiments. Large social media data sets don’t alleviate the need for this careful approach. I’m not saying that large stores of social media data are useless. Rather, I’m saying that if we’re going to call what we do with it data science, let’s make sure that we adhere to the principles and practices of science. How many of the people who call themselves “data scientists†on resumes today have actually been trained in science? I don’t know the answer, but I suspect that it’s relatively few, just as most of those who call themselves “data analysts†of some type or other have not been trained in data analysis. No matter how large the data source, scientific study requires rigor. This need is not diminished in the least by data volume. Social media data may be able to reveal aspects of human behavior that would be difficult to observe in any other way. We should take advantage of this. However, we mustn’t treat social media data as magical, nor analyze it with less rigor than other sources of data. It is just data. It is abundantly available, but it’s still just data.
In one example of insights drawn from social media data, Rudder lists words and phrases that are commonly used by particular groups of people but aren’t commonly used by other groups. He also listed the opposite: words that are least often used by particular groups but are commonly used by other groups. His primary example featured the words and comments of men among the following four ethnic groups: white, black, Latino, and Asian. Here’s the top ten words and phrases that Rudder listed for white men:
my blue eyes
blonde hair
ween
brown hair
hunting and fishing
Allman brothers
woodworking
campfire
redneck
dropkick murphys
I’m a white man and I must admit that this list does not seem to capture the essence of white men in general. I found the lists interesting, when considered in context, but far less revealing that Rudder claimed. Here’s an example of the “broad trends†that Rudder discerned from this approach to data analysis:
White people differentiate themselves mostly by their hair and eyes. Asians by there country of origin. Latinos by their music.
One aspect of the data that Rudder should have emphasized more is that it was derived from people’s responses to OKCupid profile questions. This means that we shouldn’t claim anything meaningful about these words and phrases apart from the context of self-description when trying to get a date. Another more fundamental problem is that by limiting the list to words and phrases that were used uniquely by particular groups, the list fails to represent the ways in which these groups view themselves overall. In other words, if the top words and phrases used by each of these groups were listed without filtering them on the basis of uniqueness (i.e., little use by other ethnic groups), very different self-descriptions of these groups would emerge.
I appreciate Rudder’s attempts to mine the relatively new and rich data resources that are available to him. Just as others have done before, he has an opportunity to reveal unknown and interesting aspects of human behavior that are specific to the contexts from which he has collected data. If he had stayed within these natural boundaries, I would have enjoyed his observations thoroughly. Unfortunately, Rudder strayed beyond these boundaries into the realm of claims that exceed his data and methods of analysis. This is an expression of Big Data hyperbole and technological solutionism. The only “data science†that is worthy of the name is, above all, rooted in the principles and practices of science. We dare not forget this in our enthusiasm.
Take care,
4 Comments on “Big Dataclast: My Concerns about Dataclysm”
Hi Stephen,
Thanks for raising the issue of Data Science trying to appropriate Science without learning key tenets of experimental design and statistics. While I have a small amount of science training in my history, my background is more on the engineering side. I’m really excited to be learning the science aspect in my new role and have been sharing my discovery of higher quality quasi experimental designs like Interrupted Time Series and Regression Discontinuity with my coworkers. I was dumbfounded when one stated that they didn’t care about causality and it was a waste of time. I was quite taken aback since a few days before they had shown a time series chart that where they explained how implemented feature X had outcome Y.
That said, I also have many coworkers deeply interested in learning more. All my reference books for interesting discussion in this area are from social science and epidemiology. Do you know of any that are more product design focussed but cover more than just A/B testing? I have “Testing 1 – 2 – 3” but is too mathy for most of my product people.
Thanks,
Chris
An interesting review Stephen.
When the OK Cupid blog surfaced in 2009 I remember reading it with interest. I thought it was fun (especially as I was a user of the site at that time), but I was always a little dubious about the scientific claims that were made – in particular to the “so that tells us X about society” commentary.
Undoubtedly there are some entertaining articles on there, and it was revived earlier this year, no doubt with relation to this new book and I’d encourage anyone interested in OK Cupid’s data set to have a read of the articles. There are some interesting tidbits of information on there (it might help to understand explicitly how some features of the site works, however) and you can judge the merits of the “here’s our conclusions but no you certainly can’t see our data, our workings, our sampling, our assumptions etc…”
As an aside, all the contributors seem to have read your books Stephen – as the graphs presented are usually excellent.
Cheers,
Mike
While some of your criticisms of Rudder are fair, I think you missed some of his caveats.
I too read the book as a skeptic, but I picked up far more caution in his analysis that you have (perhaps some is only in the footnotes/endnotes).
For example the list of distinct words by race which you describe as unrepresentative. Well, I don’t think Rudder claims what you say he claims. He explains the methodology behind these lists and it is clear from his explanation that the lists are not and should not be interpreted as “the things different races talk about the most”. He is very clear that they are just the thinks that most differentiate the groups being analysed (many phrases are common to all the groups so they don’t distinguish among them). So, if you pick up his explanation of how the phrases were generated, you shouldn’t expect them to “capture the essence of xx in general”.
Overall I don’t think he is guilty of overclaiming too much. I came away with a clear idea of the limitations of his data. So I rate him on the scientific side of your divide, not on the side of hype.
Steve — I quoted specific examples of Rudder’s claims that I found over-reaching. I would appreciate it if you would explain how these claims are not excessive.