According to the World Economic Forum <link>, 463 exabytes of data will be created every day by 2025, much of this being generated by Internet of Things (IoT) devices, wearables and social media. Making sense of these massive volumes of data cannot be undertaken with traditional data processing techniques. Artificial Intelligence (AI) can help extract the signal from the noise in these huge data volumes allowing businesses to take advantage of the insights able to be gained from this.
Many of the problems that AI is now being called upon to solve, have previously been solved using High Performance Computing (HPC) platforms. Rather than this being an either/or problem, each approach brings certain capabilities that the other doesn’t, but they have more in common than sets them apart. For example, current HPC systems have many attributes that make them a good fit for the needs of AI:
- Large numbers of compute nodes enable highly scalable parallel processing of workloads. These compute nodes may include specialised hardware such as GPUs that can accelerate AI workloads.
- High performance data storage capabilities such as parallel filesystems can hold large volumes of data for either training AI models or as data inputs for AI inference using these models.
- High speed, low-latency networks interconnecting compute nodes and storage to ensure compute nodes are able to access the data needed for computation as and when the compute nodes need it. This both enables AI model training as well as makes AI inference systems more responsive.
So, if there is inherent commonality between HPC and AI, how might these two, best be brought together?
AI as a precursor to HPC
AI can be used as a mechanism to “pre-process” a larger set of possibilities before handing this reduced set off to an HPC system to perform more in-depth analysis.
Consider a pharmaceutical company looking to design a new range of medicines. They might use AI to narrow down a large field of potential molecules to identify the most promising candidates. AI analysis of research papers as well as genetic and epidemiological studies, might help reduce a wide range of potential candidates and their biological mechanisms to a more focused target list. This reduced target list can then be processed by an HPC system which will simulate the interactions between a drug molecule and a potential biological receptor to identify whether any of these candidates are likely to work in practice.
AI and HPC together
Rather than treating AI as a precursor to HPC, the next example comes where HPC and AI overlap.
There are a couple of ways this might occur. The first comes in the use of the AI models themselves.
AI models need to be trained and training these models requires an HPC environment. Consider the Vela supercomputer, built by IBM Research and running on IBM Cloud <link>. This platform resembles many HPC environments. Lots of compute nodes containing GPUs to accelerate computation, a high speed, low-latency network and high-performance parallel storage. The use of traditional HPC workload schedulers optimises use of the underlying compute resources to ensure the model is trained in an efficient fashion.
Once you have your AI model, chances are that this is likely to need some ‘adjustment’ to better fit the business problem that you need it to solve. This is where prompt engineering and fine tuning come in. While these are less computationally intense than the original model training work, these will still need access to similar GPU-accelerated HPC infrastructures to be performed.
Assuming that your model is now ready for production use, you will analyse your input data through a process called inference. While some of this could be offloaded to specialised processors such as the IBM AIU <link> or the AI engine in the z16 Telum processor <link>, in many cases this may also be performed in an HPC environment, once again using an HPC workload scheduler to deliver the required compute resources needed for inference activities to the requesting users or applications. Different uses of AI have different compute needs. AI for transactional workloads such as fraud detection in payments systems can run without requiring GPU acceleration whereas text generation AI models used by digital assistants etc. have much more demanding computational needs.
The second major overlap between AI and HPC comes in the use of AI to ‘replace’ the need to perform massive computational tasks with an AI algorithm that can provide a “good-enough” approximation of the results to avoid the need to perform these HPC calculations for certain problems, thus freeing these HPC resources to undertake other tasks.
HPC as a precursor to AI
Just as we might use AI to identify a more focused set of possibilities that can then be processed by an HPC system, we can similarly see HPC being used to create the necessary input datasets that will subsequently be processed by an AI model.
The use of HPC systems by financial firms to run Monte Carlo simulations is well understood. As the numbers of these simulations grow over ever-increasing portfolios of financial assets and instruments, the interpretation of these as an entirety becomes a growing challenge. The use of AI to assist a professional in their assessment of these simulations can both improve accuracy as well as deliver more timely insights. In a business where time-to-market is a key differentiator, this has direct business value.
Where does the cloud fit in all this?
In all the above scenarios, the AI and HPC workloads could run on the same physical infrastructure (HPC cluster) or on different infrastructures if an organisation requires this separation of concerns. Being able to run HPC and AI in different environments if required delivers resilience and greater business flexibility.
Like it or not, the significant hardware footprints required to both train and use AI models may be unachievable for most organisations to run on-premises. The skills required to both setup and operate an AI platform are in short supply. Similarly, constraints in the availability of GPUs are further driving many companies to meet their HPC and AI needs from the cloud.
While there are good reasons that hybrid HPC arrangements make the most sense <link>, the same cannot be said for AI. An organisation consuming AI services from the cloud, should also be considering a cloud-based or hybrid HPC environment to optimise business flexibility and to minimise the movement of data between cloud and on-premises systems. Integration between AI and HPC is greatly enabled when both are available as cloud services. This also allows AI services to be consumed by organisations who lack the resources to do this on-premises in a cost-effective manner.
Looking forward
Looking forward, one might anticipate that future HPC infrastructures will need to become better suited to supporting the needs of AI. One might also predict that such infrastructures will leverage the best of both worlds, with greater use of AI acceleration devices such as GPUs and inference accelerators within HPC compute nodes, and with these being directly managed by HPC workload schedulers.
There is also the case where AI is used to help create and tune HPC applications. A lot of current AI interest is in the area of large language models (LLMs). These are AI algorithms designed to perform various natural language processing tasks. Readers are likely familiar with LLMs such as OpenAI’s GPT, Google’s Gemini and Meta’s LLaMA. The creation of parallelised code has long been a challenge holding back the use of HPC for many organisations. The ability of LLMs to help create new programs will make this simpler in the future. Similarly, tuning and optimisation of these codes to better exploit the hardware on which they run will become easier with AI assistance.
Conclusions
HPC and AI have more in common than sets them apart. AI needs HPC-like systems to better support the training and deployment of models. HPC can work together with AI to better focus available resources on the most important business problems. We expect to see the line between AI and HPC becoming more blurred as we move forward. Working together, AI and HPC can solve the largest and most complex problems faced by organisations today with HPC providing large scale data processing and AI delivering new or more accurate insights from that data.