I’ve written a great deal during the past few months about Big Data, which is the most annoying, constantly-in-your-face information technology term of recent history. Another term has arisen in connection with Big Data that has generated its own share of hype and confusion: Data Science. I haven’t written about data science until now, but the following email from a data analyst named Kelly Martin has spurred me into action.
I am writing to ask you to please compose a post specifically addressing the new ‘data scientist’ hype, as this term is filtering down into organizations and causing all kinds of havoc. As a data analyst with a solid combination of education and experience, I am used to having terms thrown about by management who think ‘data mining’ is using VLOOKUP in Excel and love to present all their metric results as ‘significant’. Usually we can ignore it; they’ll find some poor kid to do what they want, and hopefully someone will point out that the Emperor has no clothes before they present their latest innovative analysis to a VP.
The Data Scientist buzz is powerful right now—perhaps it’s the exposure of Nate Silver’s election predictions. Managers who have zero statistical or data understanding are now expecting their analysts to produce inferential statistics believing this to be the solution to all their problems (or at least a career builder). These same managers don’t even know the difference between descriptive and inferential statistics or that they are asking for a research project that requires methodological rigor and time. (Research design to these guys means searching the Internet.)
Sometimes I wonder if your courses shouldn’t be specifically directed to middle and upper management. Most of the people data analysts are managed by have no real understanding of data, analysis, or statistics. This is why they are such easy marks for BI Vendors and Consultants.
I personally am no longer willing to educate my bosses. Organizations restructure so frequently now, who has time to train 3 managers a year? Besides, they often just grab on to terminology and use it inappropriately in meetings to sound knowledgeable. But I believe if you were to do a post explaining what a data analyst is and what the ‘data scientist’ hype is all about, we analysts could quietly forward it to our bosses in hopes they could learn something other than what they hear from Vendors and Consultants.
By the way, I love the Aptitudes and Attitudes of Effective Analysts section in your book Now You See It—you nailed it. Nowhere have I read a better description—it should be used for job descriptions. I just wish I could have gotten one boss to read it.
Kelly raises a legitimate concern, shared by many. She articulated it so well and with so much feeling, I couldn’t ignore her plea.
Similar to the term Big Data, which was coined in 1997, the term data science isn’t new. Gil Press did a great job of tracing its historical roots in a recent Forbes article titled “A Very Short History of Data Science.” Precursors of the term can be found in the writings of Princeton statistician John Tukey dating back to the 1960s. The precise term first appeared in a book by Peter Naur’s titled Concise Survey of Computer Methods in 1974 and it was later used in the title of a conference for the first time in 1996 (“Data science, classification, and related methods” at the biennial conference of the International Federation of Classification Societies).
It was in a paper written in 2001, however, by statistician William S. Cleveland, one of data visualization’s great pioneers, that the term was first used in a manner that’s fairly consistent with its current use: “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics.”
This document describes a plan “to enlarge the major areas of technical work of the field of statistics. Because the plan is ambitious and implies substantial change, the altered field will be called “data science.” The focus of the plan is the practicing data analyst. A basic premise is that technical areas of data science should be judged by the extent to which they enable the analyst to learn from data.
Cleveland wanted to expand the perspective and toolkit of statisticians and apply their efforts to a broader range of real-world problems. He spoke of data science as a multidisciplinary effort, rooted in historical precedents.
The single biggest stimulus of new tools and theories of data science is the analysis of data to solve problems posed in terms of the subject matter under investigation. Creative researchers, faced with problems posed by data, will respond with a wealth of new ideas that often apply much more widely than the particular data sets that gave rise to the ideas. If we look back on the history of statistics—for example, R. A. Fisher inventing the design of experiments stimulated by agriculture data, John Tukey inventing numerical spectrum analysis stimulated by physical science and engineering data, and George Box inventing response surface analysis based on chemical process data—we see that the greatest advances have been made by people close to the analysis of data.
[Data science] carries statistical thinking to subject matter disciplines. This is vital. A very limited view of data science is that it is practiced by statisticians. The wide view is that data science is practiced by statisticians and subject matter analysts alike, blurring exactly who is and who in not a statistician.
Cleveland believed that statistics should be integrated more intimately into the real world and embrace a broader range of tools to explore and make sense of data. His perspective echoed that of Tukey who many years earlier expressed concern that the term statistician reflected a narrow approach to data sensemaking, so he encouraged use of the term data analyst to promote a more open-minded statistics that embraced the use of software (still in its infancy at the time, long before personal computers) and data visualization. Tukey’s use of the term data analysis and Cleveland’s use of the term data science encouraged statisticians to embrace all of the technologies, techniques, and skills that were needed to derive greater value from a rapidly growing body of data. In the same spirit, today we can work to free data sensemaking from compartmentalization into separate disciplines of statistics, financial analysis, computer programming, data warehousing, data mining, business intelligence, data storytelling, data visualization, predictive analytics, and any other confined specialty. We can accomplish more by collaborating, tearing down the boundaries that have traditionally and in many respects artificially separated these disciplines, sometimes creating conflict and competition between them.
When defined in this way, the term data science is meaningful and useful. Understood in this way, the work of data science is not new; it is what good data analysts by any name have been doing for quite awhile. As the term is used by technology vendors and technology analysts, or on resumes, however, data science is seldom more than marketing hype—a new name for old technologies and skills that rarely rise to the level of science.
When uttered from the lips of technology vendors, technology analysts, and suddenly rebranded data practitioners, like many marketing terms, data science is an ill-defined miasma of confusion. They usually use the term synonymously with Big Data. Here’s the description that appears on the website www.datasciencecentral.com: “Data Science Central is the industry’s online resource for big data practitioners.” New York University’s initiative in data science intimately joins these terms as well: “In order to unlock the powerful potential of this big data, the world needs researchers and professionals skilled in developing and utilizing automated methods of analyzing it. These individuals are called ‘data scientists’…” (http://datascience.nyu.edu/about/). Insight Data Source, which offers a training program that in six weeks equips people to “succeed as data scientists,” enthusiastically describes the term more broadly, but no more meaningfully.
Nowhere has the benefits of analyzing data been felt more strongly than at top technology companies. Silicon Valley companies are not only leading in the production of data, they are also on the cutting edge of using insights from that data to benefit their users. In fact, the role of data scientist, now used throughout industry to describe highly specialized analysts with deep quantitative abilities, was coined by the heads of the early data teams at Facebook and LinkedIn. They realized the process of asking questions about product use cases, taking measurements, verifying hypotheses and building upon those results closely mirrored the process by which science is done. The individuals, therefore, who apply their curiosity, quantitative skills and intellect toward understanding big data are now known as data scientists—a job title that is one of the most in-demand job roles at today’s leading technology companies.
Imagine developing the opportunity to become “highly specialized analysts with deep quantitative abilities” in six weeks. Despite the fact that it has been obvious to data sensemakers all along that their work is scientific in nature and method, applying the term “data scientist” to those who do this work exploded in the IT industry’s consciousness as a revelation: one that it could harness to make money.
In some cases, vendors offer incoherent descriptions of data science beyond an association with Big Data that seem intentionally designed to keep the term nebulous. It is often to a vendor’s advantage to define what they sell in broad, vague strokes to seemingly fit the diverse expectations of customers. Here’s how data science is described on IBM’s website:
So what does a data scientist do?
A data scientist represents an evolution from the business or data analyst role. The formal training is similar, with a solid foundation typically in computer science and applications, modeling, statistics, analytics and math. What sets the data scientist apart is strong business acumen, coupled with the ability to communicate findings to both business and IT leaders in a way that can influence how an organization approaches a business challenge. Good data scientists will not just address business problems, they will pick the right problems that have the most value to the organization.
The data scientist role has been described as “part analyst, part artist.” Anjul Bhambhri, vice president of big data products at IBM, says, “A data scientist is somebody who is inquisitive, who can stare at data and spot trends. It’s almost like a Renaissance individual who really wants to learn and bring change to an organization.”
Whereas a traditional data analyst may look only at data from a single source—a CRM system, for example—a data scientist will most likely explore and examine data from multiple disparate sources. The data scientist will sift through all incoming data with the goal of discovering a previously hidden insight, which in turn can provide a competitive advantage or address a pressing business problem. A data scientist does not simply collect and report on data, but also looks at it from many angles, determines what it means, then recommends ways to apply the data.
Data scientists are inquisitive: exploring, asking questions, doing “what if” analysis, questioning existing assumptions and processes. Armed with data and analytical results, a top-tier data scientist will then communicate informed conclusions and recommendations across an organization’s leadership structure.
So, according to IBM, a data scientist is someone who exhibits the following characteristics:
- An evolved state of business analysis or data analysis expertise
- Business knowledge
- An ability to communicate to technical and business people alike
- A focus on problems that matter
- Pattern detection
- A Jack of all trades (Renaissance individual)
- A lifelong learner
- A facilitator of change
- A multi-source data integrator
- A seeker of hidden gems
- An multi-perspective observer
- A person of action (insight applier)
- Someone who asks “what if?”
- Someone to whom leaders listen
Have I missed any? These are all great qualities, but they don’t constitute a useful definition of a data scientist. In fact, we could replace the title “data scientist” with “data analyst,” “statistician,” “BI professional,” or any of the other job titles that have been used over the years for expert data sensemakers and the list would work as well.
According to Wikipedia, data scientists are more narrowly defined: they must be trained in a specific scientific discipline and work as members of a team.
Data scientists solve complex data problems through employing deep expertise in some scientific discipline. It is generally expected that data scientists are able to work with various elements of mathematics, statistics and computer science, although expertise in these subjects are not required. However, a data scientist is most likely to be an expert in only one or two of these disciplines and proficient in another two or three. There is probably no living person who is an expert in all of these disciplines—if so they would be extremely rare. This means that data science must be practiced as a team, where across the membership of the team there is expertise and proficiency across all the disciplines.
There is no general agreement about the meaning of the term. Despite the good intentions of some who use it, such as William Cleveland when he first introduced it in 2001, it has led to confusion through freewheeling, self-serving use. When all is said and done, deriving value from data comes down to effective data sensemaking, resulting in greater understanding and better decisions. Regarding those who do the work, it only matters that they’re qualified to do it well. If you’re a hiring manager looking for someone to glean meaningful insights from your data, keep in mind that the title “data scientist” on a resume means nothing. IT titles are notoriously inflated.
In a recent article (“Data scientists don’t scale“, ZDNet, May 22, 2013) Andrew Brust described the situation poignantly:
There’s a risk that many technologists will become “data scientists” in the name of finding a better gig, in exactly the same way that happened with other lofty titles in technology (“architect,” for example). Title inflation happens in any field, but in the tech field, terms and titles are in any case viewed as metaphors, more than literal descriptions. Tech folks tend to take poetic license with titles, and those who don’t do so find themselves at a disadvantage compared to those who do.
If you manage people who work as data sensemakers, take the time to understand what they do and what’s required to do it well. Learn enough to see past the hype that vendors attach to this work to get you to buy their products and hire their consultants. Know enough to support your employees in their work, and then get out of their way. Most of all, never forget that only people can make sense of data. At best, tools can augment the abilities of talented people. There is indeed a science to data sensemaking, but data science by any other name (and there are many) would smell as sweet.