High Performance Computing

 View Only

Achieving the impossible – Rightsizing your HPC

By Terry Fisher posted Wed February 28, 2024 03:33 PM


“We love our cluster – it even has a name”
We all know that science, engineering and research is massively dependent on having access to large high performance computing (HPC) cluster systems to support simulations, calculations and ultimately new discoveries or insights. These systems are cherished amongst their user community, with almost every one having a name and are usually housed in a local Data Centre, providing a resource that’s shared amongst the community of users. Systems can range from a small cluster having a few hundred CPU cores to the extreme exascale systems having hundreds of thousands of CPU cores.

To ensure this processing power is available to users, a workload manager or scheduler is used to accept workloads, allocate compute resources and dispatch jobs to some or all of the cores available in the cluster. As the scheduler’s queues fill up, workloads are run on the cluster till the system reaches full utillisation. For the HPC System owners and administrators, this is the ideal situation, as a system that’s busy all the time is delivering maximum return on investment. A slightly over-subscribed HPC cluster with a short wait time in the queue is what they typically aim for.

Oversubscribed systems hamper progress
But what happens when the system is so busy the wait times are now extending beyond just a short period of time? Users are now faced with a substantial delay in getting their workloads completed because of a lack of access to sufficient compute resources, putting deadlines and completion dates at significant risk. For researchers, this could delay the release of their scientific paper and for commercial users could mean delayed product launches, missed milestones or lost opportunities. These delays can translate into lost revenue, extra expense, unhappy clients and a damaged reputation.

The cost of being late
As an example, if a major car manufacturer missed the launch date for one of their new vehicles, the revenue impact can be measured in millions of dollars per day for every day they are late. Whilst this amount of lost revenue could be seen as an extreme case, this could still have a major impact on business outcomes and financial results.

So, what can be done to address the “Cost of being late”? We believe that a hybrid cloud solution for HPC could be the answer, complimenting and augmenting your on-premises investments rather than migrating entire HPC systems to the cloud. This can deliver additional compute resources on demand, providing the extra temporary capacity needed to serve users who are waiting too long for access to the local systems. A fully automated hybrid cloud solution manages the policy-based orchestration of cloud HPC resources on demand, ensuring compute capacity is deployed only when the right conditions have been met.

Policy-based cloud bursting
This can be based on a number of factors which are formulated into a policy which determines when it is time to burst to the cloud. That way, the system can make the burst decision rather than it being a manual process. Once additional compute resources are provisioned, the queueing system pushes workloads to the cloud for execution until completed. As the burst workloads are completed, if the capacity is no longer required the system can automatically shut down the cloud servers to ensure users are not paying for idle compute time.

GPUs or other special requirements
Another challenge users sometime face is gaining access to resources that don’t exist within the local HPC facilities, such as GPUs, large memory machines, or needing more capacity or time on the system than it’s possible to provide. Users are now faced with a dilemma – adjust or scale back their science, engineering or research in order to fit within the local system’s boundaries. This could mean running simulations with less fidelity or granularity, or not being able to run certain workloads at all. Having temporary access to the right resources to suit the workload can make a massive difference to the users, removing their constraints and opening the door to larger and more detailed simulations or increasing the number of jobs that could be run, broadening the scope of what’s possible.

Using the scale of cloud to reduce time to results
Another way the cloud can change things for HPC productivity and remove barriers for users is giving access to a large amount of resources for a short period of time. Because of the capacity of available compute resources in the cloud, large scale workloads that were impossible to run locally due to a lack of capacity could now be run in the cloud. By requesting a large cloud HPC system, users can massively shorten the time to results compared to running locally by provisioning a large amount of CPU cores, without it costing any more money. With the economics of cloud, it costs the same amount of money to run 1 cloud server for 100 hours as it does to run 100 servers for 1 hour, so it makes sense to use as many machines as budgets allow to tackle these workloads and reduce the time to results.

Cloud capacity changes time to results and changes the scale of problems that can be solved

Science and engineering unleashed with cloud capacity
As well as being able to reduce the time to results as previously mentioned, another dimension that HPC on Cloud brings is the ability to tackle problems that were previously impossible due to the capacity constraints of the locally available resources. By having the ability to run many simulations in parallel, or increase the fidelity of the simulation or optimisation being attempted or both, new dimensions of discovery could now be possible. Using large cloud capacity for a short period could have a significant impact on the approach to certain engineering, scientific and mathematical problems.

To achieve the impossible – You should try to rightsize your HPC

When sizing a new HPC system, its hard to know how big it needs to be to deal with the increasing demands being placed on resources by HPC Users over time. Is it best to size it for today’s needs, or those in 2 years time, or somewhere in between? However big the system is, there will be very few occasions where there is exactly the right amount of resources to match the demands placed on it. There will either be too much capacity or not enough for most of the life of the cluster.

We firmly believe that a hybrid approach to HPC is the best way to achieve what’s impossible with a fixed pool of compute resources; rightsizing. Having the right amount of computing power to meet the workload needs regardless of demand and local constraints. This could transform the approach to science and engineering across many different industries and research groups, ensuring users have access to the HPC capacity to unleash their full potential for discovery.

To learn more about IBM Cloud HPC, please visit: https://www.ibm.com/high-performance-computing 

And/or you can contact me to discuss further: terry.hpc@ibm.com