Written with Brian Belgodere, STSM IBM Research.
With the explosion of interest and investment in generative AI, IBM has been busy training and open-sourcing its own series of large language models, collectively known as the Granite family. See: https://huggingface.co/ibm-granite and https://arxiv.org/abs/2405.04324.
Recently IBM gave the world a peek behind the curtain, releasing a paper that describes the infrastructure used to fuel the training of this new family of models. The paper describes two separate compute clusters, named Vela and Blue Vela. Both are intended to support IBM's LLM training mission. See: https://arxiv.org/abs/2407.05467.
Vela came online in 2022 with the goal of developing a cloud-like HPC cluster focused on flexibility and open-source technologies integrated with IBM Cloud’s IaaS.
IBM Research and IBM Cloud teams manage multiple OpenShift clusters on Vela based on workload. The VM-based stack was tuned to approach bare-metal performance for the training workloads. The use of VMs allows for rapid reprovisioning of the system with different software stacks, such as reprovisioning between training and inference stacks.
Blue Vela was built in parallel with Vela and completed in the Summer of 2024. The goal with Blue Vela was to create a large dedicated AI training cluster within an extremely short time frame. Every part of the stack was chosen and optimized for this task. Given their proven track record at scale, the Blue Vela stack was built around RedHat Enterprise Linux, IBM's LSF scheduler, and Storage Scale.
While the primary workload of Blue Vela is large distributed training of language models, there is also a substantial amount of work involved in fine-tuning, testing, context extensions, and quantization.
The IBM Research team and LSF Development have had an amazing partnership over the years, working on countless projects, the most significant of which are the Summit and Sierra Supercomputers Systems.
Given the cost, scale, and complexity of these systems, one of the critical scheduling challenges faced is creating policies that allow high utilization of the GPUs while also ensuring that the various workloads of the cluster get their needed share of the resources and avoid job starvation. The LSF Simulator has been an invaluable tool in allowing us to explore scheduling policies outside of the production environment and implement reservations, guaranteed resource pools, and resource limits to maximize utilization.
IBM Research developed a custom observability stack for Blue Vela using several open-source technologies. We created a Prometheus exporter for LSF metrics to bridge job information to metrics fed from the compute hosts, network, and scale storage. We built specific Grafana dashboards based on personas to support operators, researchers, and executives. These dashboards also incorporated data from LSF Explorer. This LSF Prometheus exporter was adapted to support IBM Cloud’s HPC offerings.
We encourage you to read the paper: https://arxiv.org/abs/2407.05467.
#ai-featured-area-3
#Featured-area-3