There Is No Science of Data

“Data Science” is a misnomer. Science, in general, is a set of methods for learning about the world. Specific sciences are the application of these methods to particular areas of study. Physics is a science: it is the study of physical phenomena. Psychology is a science: it is the study of the psyche (i.e., the human mind). There is no science of data.

Data is a collection of facts. Data, in general, is not the subject of study. Data about something in particular, such as physical phenomena or the human mind, provide the content of study. To call oneself a “data scientist” makes no sense. One cannot study data in general. One can only study data about something in particular.

Most people who call themselves data scientists are rarely involved in science at all. Instead, their work primarily involves mathematics, and usually the branch of mathematics called statistics. They are statisticians or mathematicians, not data scientists. A few years ago, Hal Varian of Google declared that “statistician” had become the sexy job of our data-focused age. Apparently, Varian’s invitation to hold up their heads in pride was not enough for some statisticians, so they coined a new term. When something loses its luster, what do you do? Some choose to give it a new name. Thus, statisticians become data scientists and data becomes “Big Data.” New names, in and of themselves, change nothing but perception; nothing of substance is gained. Only by learning to engage in data sensemaking well will we do good for the world. Only by doing actual good for the world will we find contentment.

So, you might be wondering why anyone should care if statisticians choose to call themselves data scientists, a nonsensical name. I care because people who strive to make sense of data should, more than most, be sensitive to the deafening noise that currently makes the knowledge that resides in data so difficult to find. The term “data scientist” is just another example of noise. It adds confusion to an overly and increasingly complicated world.

Signature

P.S. I realize that the term “data science” is only one of many misnomers that confuse the realm of data sensemaking. I myself am guilty of using another: “business intelligence.” This term is a misnomer (and an oxymoron as well) in that, as with data science, when it is practiced effectively, business intelligence is little more than another name for statistics. It has rarely been practiced effectively, however. Most of the work and products that bear the name business intelligence have delivered overwhelming mounds of data that is almost entirely noise.

35 Comments on “There Is No Science of Data”


By Philip Mair. January 23rd, 2017 at 2:18 pm

This looks like a minor issue in the new world of alternative facts, don’t you think Stephen?

By Stephen Few. January 23rd, 2017 at 3:05 pm

Philip,

Should we ignore what is perhaps minor compared to Trump? His emergence on the scene should not “trump” all other concerns.

By Fabian. January 23rd, 2017 at 5:31 pm

What about metadata? The study of the data that describe the data. What about the meta metadata?

By Stephen Few. January 23rd, 2017 at 5:56 pm

Fabian,

Are you aware of any scientific studies of data about data? Metadata is a rather simple concept that doesn’t seem to require scientific study.

By Fabian. January 23rd, 2017 at 6:46 pm

Yes, I was thinking about query optimizers.

But I’m agree, when words take the stage, I like to put them where belong, the dictionary, and use other terms and definitions to avoid more confusion.

By Konrad. January 24th, 2017 at 1:11 am

Actually there is a whole academic discipline dedicated to the study of information: https://en.m.wikipedia.org/wiki/Information_science

And what is data but information?

By Jason. January 24th, 2017 at 1:18 am

“One can only study data about something in particular” – but there’s a common toolbox that applies across the something-in-particulars, and that toolbox extends beyond what’s traditionally understood (and taught) as statistics.

Data science is a poor name for it, I agree, but that’s the one that’s stuck.

By Stephen Few. January 24th, 2017 at 2:41 am

Konrad,

Information science is also a misnomer. There is no science of information.

By Stephen Few. January 24th, 2017 at 2:47 am

Jason,

Even if there is a “common toolbox” that extends beyond statistics, it is not a science. Regardless, of what does this common toolbox consist besides statistics?

By Ezra Lee. January 24th, 2017 at 4:41 am

I always thought of the term “data science” as akin to “experimental science” or “theoretical science.”

By Stephen Few. January 24th, 2017 at 5:06 am

Ezra,

All science is both experimental and theoretical. So-called data science is neither.

By Nick. January 24th, 2017 at 6:23 am

Science is about collecting information (data) to interpret and draw inferences about our world. Data scientist are the epitome of scientists because the work with observations of what happens in our world. These observations expand across all areas of life.

By Stephen Few. January 24th, 2017 at 7:49 am

Nick,

When someone who calls himself a data scientist works with observations of what happens in our world, are those observations about data? They are not. Data is not the subject matter. As you yourself stated, the observations are about something that is happening in our world. That something is the subject matter of science, not the data that informs us about it. In other words, the science that so-called data scientists are engaging in is not about data. Someone who calls himself a data scientist can indeed engage in science, just not science about data.

By Nick Desbarats. January 24th, 2017 at 8:04 am

I’d never really thought about the actual term “data science” before but, yes, it clearly doesn’t hold up well to a parsing of its component terms.

While the current work that people who call themselves data scientists do includes statistics, I think that it also includes other important skills that don’t fall under that rubric but that are required in order to work with very large datasets. For example, people who call themselves data scientists also need to have a good understanding of tools such as Hadoop, MongoDB, Kafka, Redshift and the like. More importantly, they also usually require a solid grounding in the branch of computer science that deals with algorithms and, specifically, machine learning. Because these are entire specializations unto themselves (though they may or may not be “sciences”) that require a long time to master and they’re not required when working with small datasets, I’m not sure I’d classify these additional skills as “skills required to do statistics”.

If a data scientist were to only know about statistics but have limited technical or computer science knowledge, they’d be prone to designing analysis methods that may be statistically sound but that would require more memory than exists on Earth or more time than the age of the Universe to execute. I know some such people; they’re good statisticians but they’d be terrible data scientists.

“Data scientist” is obviously not the right term for people who do this work but I think “statistician” might be a misnomer as well unless we update the definition of statistician to also include the technical and computer science knowledge that are required to derive statistically valid knowledge from very large datasets within the limitations of currently-available technologies.

By Jeffrey Heer. January 24th, 2017 at 10:36 am

To provide some historical context: the term “data science” dates back to at least 2001, when it was proposed by Bill Cleveland (yes, that Bill Cleveland) as an antidote to a myopic focus of academic statistics, and as an expansion of what academic statistics typically covers. Note that this was roughly a decade before Hal Varian’s now-famous quote.

Here is a link to that original paper: https://utexas.instructure.com/files/35465950/download

More recently the term has been popularized as a position in industry, no doubt partly to generate enthusiasm but also to signify a position that often requires (for example) a stronger computational fluency than traditional statistics, such as the ability to work with large-scale data management technology. That said, in recent years this job role has become differentiated in some companies; it is not unusual to have data analyst / scientist roles alongside data engineer roles.

In academia today, data science typically refers to bridging data-driven methods (e.g., from statistics and computer science) with fundamental research questions in the natural and social sciences. For concrete examples, see the corresponding institutes at the University of Washington (eScience Institute), UC Berkeley (BIDS), and NYU (Center for Data Science).

While I would not wish to defend the choice of name, in practice data science is not synonymous with statistics.

By Stephen Few. January 24th, 2017 at 10:42 am

Nick Desbarats,

Professional titles describe the fundamental nature of one’s work. A furniture maker might use various tools during the course of his career, but he doesn’t change his title with each new tool, nor does he try to come up with a new title that describes the specific combination of tools that he uses. When statisticians started using calculators, they remained statisticians. When they started using computers, they remained statisticians. As new software tools became available for doing the work of statistics, they remained statisticians.

Algorithms are just computer programs that perform particular functions. When a statistician writes a statistical algorithm, she’s fundamentally doing statistics. Machine learning is also an application of statistics. The size of the data set does not change the fundamental nature of the work.

We don’t need a new professional title to describe statisticians who use modern tools. If you want to hire a statistician who is skilled in using particular tools, you list those tools in the job description, you don’t invent a new job title to describe that attempts to describe that specific list.

By Stephen Few. January 24th, 2017 at 11:00 am

Jeff,

Data science perhaps has as many different meanings as Big Data. William Cleveland’s original use of the term was different from the meaning that former FaceBook employees D.J. Patil and Jeff Hammerbacher, who are credited with coining the term “data scientist,” assigned to it. The point that I’m making is that it is a misnomer. There is no science of data. Even if it were not a misnomer, because it is not clearly defined, it creates confusion. The fundamental work that’s associated with so-called data science is statistics. Do we need a new term to describe statisticians who use modern computer-based tools? I don’t think we do. If we need to differentiate statisticians who don’t use these tools from those who do, isn’t it more straightforward to simply describe such people as statisticians who are capable of writing statistical algorithms, etc?

By Stephen Few. January 24th, 2017 at 11:17 am

This discussion has brought something to my attention that I never thought much about before, which is that fields of study that include “science” in their names tend not to be sciences (e.g., library science). It is as if they are attempting to lend gravity to themselves that isn’t entirely legitimate and is entirely unnecessary. The names of those fields of study that actually are sciences tend not to include “science” in their names, such as physics, psychology, sociology, etc. A more common and older misnomer than data science is computer science. The study of computers involves various sciences and computers support the work of various sciences, but is the study of computers itself a science? I don’t think so. What we call “computer science” is largely an engineering practice.

By Jason. January 24th, 2017 at 1:31 pm

Yes, computer science is also arguably poorly named, and in fact you can go further. Your argument about data science applies equally well to computer science – which is, after all, just a branch of applied maths. Why do we talk about “computer scientists” when they’re just mathematicians who happen to study the design and implementation of computer systems?

By Stephen Few. January 24th, 2017 at 1:36 pm

Jeff,

I’d like to say a bit more about the unnecessary and potentially harmful introduction of new terms. William Cleveland’s introduction of the term “data science” was not his first attempt to propose a new term to correct a problem. Long before this he proposed the term “data analyst” in an attempt to correct a problem in the statistical community. At the time and to some degree still, many statisticians believed that they could successfully ply their trade with little knowledge of the domain to which they applied it. For example, a statistician might get involved in the analysis of healthcare with little or no knowledge of healthcare. Cleveland’s concern was legitimate, but I believe that his solution not only failed to correct the problem but also introduced unnecessary confusion. An entire generation of professionals who called themselves data analysts arose who believed that they could make sense of data without any training in statistics. Cleveland didn’t intend or foresee this, but the term data analyst took on a life of its own.

I believe that attempts to correct problems in professions or fields of study by introducing new terms will often not only fail but will create new problems as well. I believe that Cleveland could have addressed his legitimate concern among statisticians by making a clear case that statisticians must become familiar with the domains to which they apply their work. Unnecessarily introducing new terms complicates the field and language in general.

Something similar happened years ago in the information visualization research community. The term visual analytics was coined and treated as a new field of endeavor somewhat independent from information visualization, when in fact information visualization and visual analytics are one and the same, or at least ought to be. Running independent tracks at visualization conferences for information visualization versus visual analytics has created confusion.

By Jeffrey Heer. January 24th, 2017 at 2:15 pm

Thanks Stephen. To be clear, I appreciate your concerns regarding naming. Do not get me started on visual analytics vs. visualization, or “information” visualization vs. “scientific” visualization for that matter.

The point of my earlier post was not to argue in favor of the name, but rather to share (heretofore missing) context for where the term came from and how it is being applied. Whether we’re stuck with a term or trying to change it, we can benefit from trying to understand it and its contexts of use.

Regarding this statement: “I believe that Cleveland could have addressed his legitimate concern among statisticians by making a clear case that statisticians must become familiar with the domains to which they apply their work.” That case was arguably made long before by Tukey and others. It fell on all too many deaf ears. Whether or not new terms are used, creating new communities of practice is often a necessary step.

By Stephen Few. January 24th, 2017 at 3:08 pm

Jeff,

I appreciate your clarification.

I’m familiar with Tukey’s earlier attempt to correct what Cleveland also later attempted to correct within the statistical community. The fact that neither attempt succeeded does not argue in favor of coining new names for existing professions. Corrections can be made and even new communities of practice can be started without giving the profession a new name. Fixing problems in professions, disciplines, etc., is incredibly difficult. Trust me when I say that I know this firsthand and intimately. My attempts to correct problems in the practice of data visualization have almost always felt like two steps backward for every step forward. Despite the difficulty, however, we should work to correct problems without creating new and sometimes greater problems in the process.

Regarding the distinction between scientific visualization and information visualization, even though these fields of study and practice are intimately related, they also differ in some very real and meaningful ways. Despite the unnecessary degree of separation between them these communities, which are more alike than different, I find the distinction useful.

By Dale Lehman. January 25th, 2017 at 5:43 am

Forgive me for going off in a different direction – the discussion about naming and the use of the word “science” is quite interesting and important, but there is (I think) an even more important topic that has been raised and is worthy of further discussion.

The (misused) term “data science” is really quite distinct from “statistics” in practice. The job market for “data scientists” is much larger than for “statisticians” and the former often does not even require training in statistics. It probably should require that but it does not. And there is a reason for this. Statisticians have, to a great extent, failed to address the rapid growth of tools available for analysis of data (not all of them, of course – but a survey from the Journal of Statistics Education from a couple of years ago that discussed statistics education in the 21st century did not even mention machine learning – this is but one example). Statisticians are often overly preoccupied with whether strict assumptions have been satisfied while the data scientists have been producing useful predictive models based on vastly available data. These models are not perfect but they are useful. And, they require virtually no assumptions since they are essentially mindless searching for patterns in data. If there is sufficient data available then these patterns are not likely to be spurious – and if they are, that is likely to be discovered rather quickly. Too many statisticians have ignored the growth of tools available, hiding behind the veil of theoretical validity.

There are a number of debatable points I raised in the previous paragraph – and they touch on many issues you have raised in the past, Stephen. These include automated analysis, the hard work of data sense-making, the need for subject matter expertise, etc. While there are a number of enlightened statisticians and data scientists, I find much of the discussion to be people talking past each other. In the best of all worlds, people would be adequately trained in statistics, computer “science,” philosophy of science, and have subject matter expertise. Nobody (I think) really disagrees with this, but as a practical matter, most practitioners will not have all of this training. Yet they will still work with data, some usefully and some not so much.

What I want to conjecture is that with ever more data available and ever more powerful tools to analyze it, statistics needs to change far more rapidly than it has thus far. The vast majority of statistics courses still spend the majority of their time on NHST (null hypothesis significance testing) and virtually no time on random forests, neural networks, etc. The job market is recognizing the value of the latter and is increasingly skeptical of the value of the former. So, while it is easy to defend the view that data scientists should be trained in statistics, the reality is increasingly that they don’t need to be. You can argue that they should be (and I would likely agree), but I think that argument is rapidly being lost.

Is this a bad thing? (I’d really like to hear thoughtful responses to that question).

By Stephen Few. January 25th, 2017 at 10:48 am

Dale,

It’s difficult to have a productive discussion about data science, because, similar to other terms that have been coined to breathe new life into long-existing professions and technologies (e.g., business intelligence, data analytics, and Big Data), it lacks a commonly accepted definition. For this reason, we can’t really say one way or the other what it is. At best we can say that most people who use the term seem to suggest that its meaning includes this or that. Activities that people tend to associate with data science, which you believe extend beyond the activities of statistics, only do so if you define statistics in a manner that is stuck in the past. You make a good point, however, that the statistical community itself has contributed to this problem by failing to evolve. This claim about the statistical community is based largely on its official academic publications, which are often stuck in the past, as many academic disciplines tend to be, but many practicing statisticians use the full spectrum of tools available to them.

The one claim that you made, which I find startling, is “While it is easy to defend the view that data scientists should be trained in statistics, the reality is increasingly that they don’t need to be.” Huh?! If so-called data science is about data sensemaking, then it is firmly rooted in statistics and must be. Making sense of quantitative data relies primarily on the tools of statistics.

By Dale Lehman. January 25th, 2017 at 11:06 am

I guess it may depend on how you and I are defining “statistics.” In the wider sense of the term (that I think you are using), then my statement would be startling and I would not have said it. I was thinking in terms of what you refer to as academic statistics – and I do think data sensemaking can (not should, but can) take place without that type of statistics.

By Stephen Few. January 25th, 2017 at 11:10 am

Dale,

I’m not aware of any reasonable definition that would render statistics less than central to data sensemaking. If you’re aware of one, please share it.

By Dale Lehman. January 25th, 2017 at 12:22 pm

I just did a job search on Kaggle: 753 out of the 1130 total jobs have the word “statistics” in the description. I suspect this may overstate the importance of statistics in the job market given the nature of Kaggle and the type of people and firms that frequent that site. But I will grant you that statistical reasoning should be central to data sensemaking and I certainly find it to be.

What I have noticed over the past several years is the increasing trend for the job market to focus on programming and computer skills and the relative decline of the necessity of statistical training (sorry but I don’t have evidence to cite, just my perception).

By Stephen Few. January 25th, 2017 at 12:39 pm

Dale,

The number of times that job postings mention statistics is not a reliable way to determine the applicability or usefulness of statistics. Most job postings for work related to data sensemaking haven’t a clue what to require of applicants. Those who write the job descriptions rarely understand the work. Unfortunately, the workplace is impoverished in its understanding of data sensemaking. We live in a data-ignorant world, despite the fact that we (but not I) like to call our time the “information age.”

By Lloyd Houghton. January 25th, 2017 at 1:17 pm

Dear Stephen,

It’s hard to avoid using new terms when they’re all around you in the air, despite one’s initial distrust of them. I notice just down the page, in your positive review of “Weapons of Math Destruction,” that you yourself refer (without additional comment) to the author as having worked as a data scientist for many years! I wouldn’t consider this usage as contributing noise and confusion to anything. And it is probably useful to have different job titles for a statistician as found in the academy (i.e. working in a branch of pure mathematics, on general theorems with no interest in applications) and a “data scientist” whose job is to understand statistical methods and apply them to help businesses analyse data.

By Stephen Few. January 25th, 2017 at 1:47 pm

Lloyd,

When I recently reviewed “Weapons of Math Destruction,” it was with trepidation that I used the term “data scientist,” which I did only out of respect for the author because she chose to use this term in describing herself. I considered taking a moment in that review to mention my concern with the term, but doing so in that context would have created more noise than it resolved.

In response to your statement that the term “data scientist” is useful to describe statisticians “whose job is to understand statistical methods and apply them to help businesses analyze data,” it’s important to point out that statisticians have always done this and also that the term data science, as it’s generally used, is not limited to business applications.

By Mark. January 29th, 2017 at 12:24 am

Dear Stephen,

you raise a very important and serious point. As someone labelled a “statistician” working in a field with “data engineers”, “data analysts”, “machine learning experts” and now “data scientists” amongst others, the noise has hit a crescendo. At the heart of the problem is not that the practitioner can use “statistical” tools, which many of these labels can do adequately, but if the practitioner fully understands the scientific method (which I place statistics at the heart of). Sadly, not all these labels equally understand (or even care about) experimental design, measurement theory, confounding, cause and effect, etc.

A common anecdote I often see is that a statistician along with domain experts may painstakingly spend years on small data to fully understand the limitations of their knowledge of it. A data scientist will discover something and report a finding in big data which is not fully understood, only to later find the discovery is an artifact of the data i.e. noise. The damage of this can be serious especially in the life sciences.

So back to your point. It is a serious one which I don’t have the answer to but I believe it will have big implications in the future.

https://www.ft.com/content/21a6e7d8-b479-11e3-a09a-00144feabdc0

By Stephen Few. January 29th, 2017 at 10:37 am

Mark,

Thanks for providing the link to the article about “Big Data.” It’s excellent.

By Stephen Few. January 31st, 2017 at 12:05 pm

Now that this discussion has wound down, let me summarize my position. There is no science of data. Data is pertinent to all sciences, but it does not constitute the subject matter of a science that studies data in general. The argument that, despite the nonsensical nature of the term “data science,” a name is needed for the activities of so-called data scientists, does not appear to be valid. Is there really a common set of quantitative data sensemaking activities that extends beyond the realm of statistics? No agreed-upon set of activities for so-called data scientists exists. There is no consensus for the meaning of this term. Also, to my knowledge, none of the activities that anyone has proposed for data science extend beyond the realm of statistics. Those who argue that statistics is narrower in scope than the work of so-called data scientists arbitrarily limit the scope of statistics. While it is no doubt true that the statistical community has often shot itself in the foot by arbitrarily narrowing its scope or by failing to adopt the full range of activities and tools available to it, the realm of statistics is broad and many statisticians pursue it as such. Nate Silver is an example of a statistician who embraces the full range of activities and tools that are pertinent to modern statistics. As far as I know, Silver does not feel obliged to call himself a data scientist.

By Mike Modiano. February 21st, 2017 at 3:37 pm

Steve, How do you feel about “Library Science”? Is it not–at core–untimately about information and data?

By Stephen Few. February 21st, 2017 at 8:02 pm

Mike,

Do we apply the methods of science to the study of collecting, preserving, cataloging, and making available books and other documents in libraries? Perhaps you know the answer to this question. I don’t, but I suspect that this does not constitute a scientific domain. Like many other areas of study, it probably does, however, benefit from the findings of other domains that are scientific, such as psychology.

Leave a Reply