What is data science? At this point, roughly a decade after the general introduction of data science in industry, is there a standard definition used?
One often-cited tagline comes from a popular 2012 tweet by Josh Wills:
Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.
To be clear, Josh is a friend and I seek out his excellent writing and presentations as top go-to resources to learn more about our field. That said, something about that circa 2012 definition above always struck me the wrong way. It oversimplifies important nuances. The scope of “statistics + software engineering” fails to mention business priorities, collaboration, effective communication, decision making, understanding domain expertise, and other important “team sport” aspects which are so essential. We'll look at how other experts helped defined data science to reflect that nuance, and talk about how the space has evolved to be a driver of – and driven by – the scale of processing data.
A brief history in three papers
Data Science gained traction in industry circa 2008, just as tooling for big data was on the rise, and as business use cases for machine learning (ML) became popularized. Those three grew together in contrast to an earlier era of business intelligence (BI), which was initially popularized by Gartner analyst Howard Dresner. Most of BI was defined atop data warehouse (DW) practices, based on work by Barry Devlin and Paul Murphy, Ralph Kimball, Bill Inmon, et al. BI and DW were both introduced in the late 1980s, then became widespread practices throughout the 1990s.
Granted, data science work in enterprise makes much use of data warehouses and it often helps serve needs for business intelligence. These fields are not mutually exclusive. However, data science emerged in response to demand for more advanced techniques and larger scale-out than what the best practices from the prior decade could provide. Cloud resources were becoming popular, and crucial insights could be obtained more quickly and more cost-effectively due to popular open source tools such as Hadoop, Spark, plus a whole range of Python libraries.
While both approaches fit within a larger context of decisions support systems, there’s something in the delta between what DW and BI accomplished in the 1990s through the mid–2000s, and what emerged in the field of data science that points toward a clearer definition for the latter. First let’s look at three important papers that prefigured the adoption of data science.
In 1962, a Bell Labs mathematician named John Tukey wrote a paper called “The Future of Data Analysis”. I highly recommend reading that paper, if you haven’t before. Tukey urged a provocative new stance for applied mathematics which he called data analysis. Consider these section headings:
- “We should seek out wholly new questions to be answered.”
- “We need to tackle old problems in more realistic frameworks.”
- “We should seek out unfamiliar summaries of observational material, and establish their useful properties.”
- “And still more novelty can come from finding, and evading, still deeper lying constraints.”
You may have heard of Tukey before: he invented the word “bit” (a contraction for “binary digit”) while working on early computer designs with John von Neumann. Tukey had deep insights about how applications of mathematics could leverage the new era of digital computing resources. For example, part of his work was to establish the Statistics department at Princeton University. Formerly, statistics had been mostly a footnote within the larger field of mathematics, but Tukey worked to promote “data analysis” as its own discipline and make ample use of computation as a scientific pursuit itself. He encouraged people to debate about how to teach and how to practice data analysis in industry. In particular, he explored the word judgement in that 1962 paper, where there was an emerging responsibility for people working in data analysis to help make judgements based on applied math. Tukey had immense impact on the field, and a deep legacy. For example, if you read the excellent books on visualizing data by Ed Tufte, references to Tukey show up throughout most all of Tufte’s writing.
A generation later, another Bell Labs researcher named William Cleveland coined the term data science in a 2001 paper citing Tukey among others. Check out that paper, “Data science: An action plan for expanding the technical areas of the field of statistics”. Cleveland proposed an outline for a multi-disciplinary curriculum:
- (25%) Multidisciplinary Investigations: data analysis collaborations in a collection of subject matter areas.
- (20%) Models and Methods for Data: statistical models; methods of model building; methods of estimation and distribution based on probabilistic inference.
- (15%) Computing with Data: hardware systems; software systems; computational algorithms.
- (15%) Pedagogy: curriculum planning and approaches to teaching for elementary school, secondary school, college, graduate school, continuing education, and corporate training.
- (5%) Tool Evaluation: surveys of tools in use in practice, surveys of perceived needs for new tools, and studies of the processes for developing new tools.
- (20%) Theory: foundations of data science; general approaches to models and methods, to computing with data, to teaching, and to tool evaluation; mathematical investigations of models and methods, of computing with data, of teaching, and of evaluation.
This curriculum indicates what Cleveland thought the field required, namely that data science is a space in which statistics and computing needed to interact, to provide the necessary resources and scale.
That same year, a UC Berkeley professor named Leo Breiman wrote “Statistical Modeling: The Two Cultures”. Most definitely recommended! Breiman was trying to document a sea change in the industry, between a previous era which he called data modeling and a new trend emerging which he called algorithmic modeling. That culture of data modeling was what Tukey had argued against, and what Cleveland was trying to push beyond. The newer culture embraced much larger data rates and more computation (does that sound familiar?) and also leveraged machine learning algorithms to help automate decisions at scale.
Note that phrases such as “larger data rates” or “at scale” used here imply scaling computation in multiple dimensions. Work toward distributed file systems and durable object stores would make much larger data storage capacities feasible and more widely available – as a prerequisite for data collection. Leveraging multiple computing cores to parallelize compute workloads had been a standard practice on mainframes. Beginning in the mid-1990s, people working in Big Data began to leverage “commodity” hardware in frameworks that provided parallelism and scale-out. Along with these two long-term industry trends, we also saw substantial changes in the price/performance curves for large memory spaces. Taken together, these three dimensions of storage, processing, and memory began to scale data pipelines dramatically. Persistent questions regarding these three dimensions have explored whether it’s more effective to “bring the compute to the data”, “cache intermediate processing results in memory”, or “bring the data to the compute” – and that has led to a variety of open source solutions for Big Data. Comparisons of early Apache Hadoop vs. Apache Spark vs. Ray, respectively, illustrate those questions. As system capabilities evolve those arguments continue today.
Big Data emerges
While not mentioned directly by Breiman, his observations were contemporary (and spot on) with the rise of successful ML applications at scale in new tech start-ups at the time. The current heyday of data science began when some of these applications which required more data started to become tractable, reliable, and cost-effective (in that order).
Check out these histories by lead architects at those firms – roughly centered on Q3 1997, which turned out to be a key inflection point for the Dot Com Boom:
The timing for those projects was during the peak of data warehouses and business intelligence adoption. However, a common theme among those four architects’ reflections is that they recognized how they’d need to scale ecommerce applications but could not do so with available tooling. Instead they turned to open source tools (such as Linux) for early data science work on proto clouds, leveraging ML at scale for ecommerce. Their timing was impeccable, particularly for Amazon: just in time to monetize the first big wave of ecommerce in the holiday season of Q4 1997. The rest is history.
Figure 1: post first big ecommerce success
Figure 1 and Figure 2 show a “before and after” contrast. The gist is that ecommerce firms split their web apps using a principle of horizontal scale out, i.e., proto cloud work on server farms. Those many servers generated lots of log files (proto Big Data), which in turn were analyzed using machine learning algorithms, which in turn provided predictive analytics that improved customer experience in the web apps. A virtuous cycle emerge, with data as a product.
Figure 2: data products, a virtuous cycle
Prior to that point, many people working in statistics had not taken Tukey’s admonitions to heart. They had focused on statistics for defending arguments, such as in courtroom hearings. When you read Breiman’s paper, check out the nastygrams from famous statisticians in the appendix, plus Breiman’s humorous rebuttals. However, after Q4 1997 the world of data changed, predictive analytics loomed large, and by 2001 the old guard was thoroughly unhappy about it. Breiman described that sea change quite succinctly:
A new research community using these tools sprang up. Their goal was predictive accuracy. The community consisted of young computer scientists, physicists and engineers plus a few aging statisticians. They began using the new tools in working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets.
Plenty of other people also helped further the cause of “data science” and deserve credit, such as Jeff Wu who likely coined the phrase (in its contemporary usage) during his U Michigan appointment lecture “Statistics = Data Science?”
A new discipline, in a personal context
Somewhere among the admonitions by Tukey, the curriculum from Cleveland, the industry observations of Breiman, we find a definition for data science which goes well beyond merely “statistics + software engineering”. It’s also well beyond the scope of what DW + BI could deliver. Circa early 2010s, DJ Patil (LinkedIn) and Jeff Hammerbacher (Facebook) had popularized the phrase data scientist and were showing examples of data science teams in industry, working at scale, demonstrating ROI. Those examples led directly to the Strata Data Conference and other popular forums – for example, the IBM Data Science Community here!
Along the way, a group of Stanford professors launched a multi-disciplinary degree in the 1970s which is now called Mathematical and Computational Science, in a foreshadowing of Cleveland’s proposed curriculum. I stumbled into becoming one of their undergraduates in the early 1980s, while the program was led by Brad Efron (see the rebuttals in Breiman’s paper). While many of our classmates became insurance actuaries, I went to grad studies in AI and took a gamble on several years of R&D work in a niche field called neural networks. Later I moved into early ecommerce projects, eventually leading data teams. Through those projects I became a “guinea pig” for Amazon AWS when it first launched. We were running a large instance of a new open source project called Hadoop for a large commercial recommender system. That followed with a role as the analytics director for the company where “Lean Startup” was invented. A few years later I joined an open source project called Apache Spark as their community evangelist, became an O’Reilly author, served as co-chair for JupyterCon, plus some other side gigs. A “portfolio career” appears to be the polite term these days.
Key takeaways and a question
I look forward to writing about data science here, exploring many more details about our field. I also look forward to getting to know the people within the IBM Data Science Community, and learning about your projects and interests.
The main takeaway from this article:
Looking at decades of history, data science found its place by applying increasingly advanced mathematics for novel business cases, in response to surges in data rates and compute resources.
I would argue that the term “data rates” used here is about availability driving demand. In the example of Apache Hadoop and Big Data, people found ways to leverage available data first. Potential ROI for those use cases (e.g., advertising network optimization) in turn drove demand for increasingly reliable solutions as apps moved into production for core business. Broader adoption and economies of scale in the marketplace, drove those solutions to become more cost-effective.
In the latest wave of AI applications in industry, we have the term ABC emerging to describe a winning combo of “AI”, “Big Data”, and “Cloud Computing” – as the latest embodiment of that takeaway described above.
Even through these waves of innovation, the two most important underlying aspects of data science (and its forebears in data analysis) are about the curiosity with which people approach any new surge in data + compute + advanced math, plus the resulting judgements leveraged in business use cases. Tukey pointed specifically at those two aspects, urging that “We should seek out wholly new questions to be answered.” I find the combo of data, advanced math, compute, curiosity, collaboration, learning, business, and judgement convey a much better definition for data science than “statistics + software engineering”. We’ll explore those themes here, in these monthly columns within this data science community.
Speaking of which … I need your help. Recently, Ben Lorica and I have been conducting industry surveys about AI Adoption in Enterprise. We have completed three surveys now and the most recent one turned up a question. Beyond the well-known roles of data scientist and data engineer, there’s another important role emerging which has not yet been named. We found that 23% of the enterprise organizations attempting to leverage data science, machine learning, artificial intelligence, etc., cite recognize business use case as a critically missing skill within their teams. What would you call that role? Where and how does a person learn to perform it? We’ll start a thread here to discuss. See you there!