Data-intensive Science describes research and engineering efforts which depend on the processing of large datasets which are acquired from instruments such as cameras, genome sequencers, super microscopes and other devices. Data-intensive science requires scalable and flexible IT infrastructure to effectively acquire, store and manage huge amounts of data, and to analyzes them quickly. The whole research or engineering effort depends on the availability of respective IT resources and will suffer or fail without sizeable compute, storage and network resources which provide the required capacity, performance and availability.
Figure 1 shows a typical growth pattern of organizations which adopt data-intensive science. In the beginning of the journey to data-intensive science (phase 1) data volumes are moderate: A few servers are sufficient to process all the data, and NAS filers and scale-out NAS systems are capable to provide the required storage capacity and performance. Though over the years the acquired data sums up to 10s or 100s PB of data, which push traditional file storage systems to the limit.
Figure 1. Data growth in data-intensive science projects sums up to 10s or 100s PB of data.
There is a second growth pattern which is easily missed by organizations which adopt data-intensive science until they hit this less obvious wall. After collecting data for some time, there is an inflection point from which onward data-intensive science projects require scalable performance. The orange curve depicts the required throughput. In the example there is a bend in the year 2014 after which the throughput increases dramatically (phase 2). The red arrow in Figure 2 makes this shifting of gears more visible.
Figure 2. Mature data-intensive science projects require scalable performance.
The scalable performance required in phase 2 is typically triggered by a combination of multiple factors. Technical progress improves devices which generate data so that newer devices come along with higher data volumes and higher data rates. For instance, in genomics new sequencing systems create larger data sets in shorter runtime than older sequencing systems. In modelling new versions of models process all existing data to validate or train the new model. For instance, test car fleets acquire each day massive amounts of new data to improve AI models for autonomous driving. New training runs use recently acquired data and older data which was acquired months or years ago. AI engineering teams improve their development pipelines over time which enables them to create new models faster reducing the time between the start of training or validation runs.
In phase 2 traditional storage systems such as NAS filers and scale-out NAS systems hit their limits. A set of NAS filers or scale-out NAS systems may be able to meet the capacity requirements but typically they are not capable to provide the required scalable performance. At this point organizations therefore need to start using a different kind of IT infrastructure to support the increasing capacity and performance requirements of maturing data-intensive science projects.
High-performance computing (HPC) and cloud computing provide proven approaches to meet the capacity and performance requirements of mature data-intensive science projects. CIOs need to understand the self-enforcing spiral illustrated in figure 3. Changing business requirements require to develop and run new applications (e.g., autonomous driving). These new applications determine the workload (e.g., development, training and validation of AI models to autonomously drive a car) and therefore the requirements to the underlying IT infrastructure. Data-intensive science requires scalable infrastructure to acquire and process huge amounts of data. This infrastructure is different to infrastructure for traditional enterprise IT such as database systems. The infrastructure determines the required skills. Organizations who are adopting data-intensive science therefore need to plan to acquire new skill sets. In the end, the available infrastructure and skills determine the capability to support the business and therefore the success to adopt data-intensive science.
Figure 3. Data intensive science requires a new kind of applications and workloads. Organizations which adopt data-intensive science need to plan for a new kind of infrastructure and respective new skills sets to enable the new applications and workloads.
In my experience, organizations which adopt data-intensive science are already overwhelmed in phase 1. They face data volumes which they never have seen before. At the same time, they must plan for the infrastructure and skills to support the business in phase 2, although they are not yet aware that they require a new kind of infrastructure and skill.
Table 4 contrasts differences of traditional IT, traditional HPC and scalable infrastructure for data-intensive science. CIOs need to understand the spiral above and the differences in IT infrastructure to get their IT organization ready to effectively support the lines of business on their journey to data-intensive science.
Table 4. Difference of traditional IT, traditional HPC and data-intensive science
||Data Intensive Science
||Enable core business processes
||Enable compute and data intensive research
||Enables core business processes
||Different teams for network, servers, storage, applications, infrastructure services, etc.
||Typically, dedicated team for HPC system in addition of team for Traditional IT
||Integrated team for all aspects of the IT solution
||Each team is responsible for their component / application
||A separate team is responsible for the HPC system
||One team is responsible for software-defined infrastructure including the underlying hardware and the integration in corporate IT landscape
||Component specific skill. Reaches-out to other teams for matters related to other components.
||End-to-end skill for HPC system. Reaches out to Traditional IT for selected infrastructure services and remote access to HPC system.
||End-to-end skill for corporate wide software-defined infrastructure. Reaches out to Traditional IT for selected infrastructure services and external networks.
I started my professional career in storage in 1997. In 2014 I got involved in the design and implementation of a new compute and storage platform for data-intensive science. My deep experience in first-of-a-kind projects enabled me to lead this project in close collaboration with the customer. I was good in gathering and documenting requirements, workflows and workloads, and in driving the development of the high-level solution architecture. But I was struggling when it came to the deployment and validation of the designed solution. We successfully resolved this situation by involving experts with HPC background. In a hindsight I was not aware that infrastructure for data intensive science requires different skills than infrastructure for traditional enterprise IT.
Today the largest super computers of the world comprise thousands of compute nodes and 100s of petabytes of file storage. They are architected, operated and used by teams which have long and deep experience with HPC. For almost all organizations who adopt data-intensive science, small or medium size HPC-like systems are enough. Small HPC systems comprise up to a few ten compute nodes and up to a few petabytes of file storage. Medium size HPC systems comprise up to a few hundred compute nodes and up to a few ten petabyte of file storage. Small and medium size HPC systems are not rocket science, but they can get problematic if your business depends on it although you have no experience with HPC.
The case studies of DESY , Continental , University of Queensland  and University of Birmingham  show that software-defined infrastructure can change how scientists do data-intensive science. Based on my learning experience in my first infrastructure project for data-intensive science, I advised several research institutes and commercial customers in life-sciences and automotive to successfully master their journey to data intensive science. Planning to acquire new skill sets is key to success and as relevant as choosing the right infrastructure. My advice to others would be to involve an experienced partner such as IBM to accompany you on your journey to data-intensive science.
Acknowledgement: Many of concepts described in this article are the result of long discussions with my friend Tomer Perry.
What do you think? What is your infrastructure for data-intensive science? What have been your challenges and solutions on your journey to data-intensive science? Please share your thoughts and insights in the comments section below.