I am putting this entry together to describe a project our IBM Storage SWAT team for the Americas have been tasked with to assist the setup of a Kubernetes based infrastructure leveraging IBM ESS at a GPU cloud provider.
Context:
- Kubernetes community version
- Ubuntu Linux distribution
- Customer already uses a different storage vendor
- Customer was provided with IBM ESS
- About 1000 nodes
- 5 control plane x86 nodes
- ARM nodes with 4 latest nVidia GPUs each
As IBM does yet have CNSA available for the environment, it has been decided to deploy the IBM Scale CSI driver that is available with the environment until IBM engineering delivers CNSA for the platform later this year.
Our goal for this project is to demonstrate that IBM Scale is a superior storage platform for AI/ML workloads and hopefully encourage the GPU cloud provider to increase its fleet of IBM ESS over time.
Challenges:
- Automate the configuration of Kubernetes nodes
- Deploy SSH keys
- Deploy IBM Scan packages for both x86 and ARM nodes
- Automate the configuration of IBM Scale CSI
- Add node as a client when it is added to the Kubernetes cluster
- Remove the node from the configuration when it leaves the Kubernetes cluster
- Automate the configuration of the IBM Scale cluster
- Add each node as an IBM Scale client when it joins the Kubernetes cluster
- Add the node to the correct node class
- Start IBM Scale services on the node
- Mount the IBM Scale filesystem
In order to achieve these goals we created the following tools:
- CSI Reserve component (one pod per node controlled via a Kubernetes deployment CR)
- Uses a custom Python base container image using Red Hat UBI-9 as a base
- One init container to deploy the correct SSH keys
- One init container to deploy the IBM Scale packages (x86 or ARM)
- One main container
- Intercept the request to bring the node out of the cluster
- Sets a label when the main container starts (
csireserve=RUNNING
)
- Sets a label when the termination signal is received (
csireserve=TERMINATING
)
- Sets a label when the pod is terminated past the grace period (
csireserve=TERMINATED
)
- The grace period for the pod to effectively terminate is set via an environment variable.
- CSI Auto component (one pod only controlled via a Kubernetes deployment CR)
- Uses a custom Python base container image using Red Hat UBI-9 as a base
- Main loop controls the creation and deletion of CSI Reserve deployment
- One thread per node to monitor label changes on each node
- Node addition - Each thread takes care of each task in sequence
- When the CSI Reserve pod reaches the running state (
csireserve=RUNNING
)
- Add the node to the IBM Scale cluster as a client node if ARM, as a quorum node if x86
- Add the node to the correct node class
- Start the IBM Scale services on the node
- Mount the IBM Scale filesystem on the node
- Add the node to the IBM Scale CSI Driver operator CR configuration (
spec.nodeMappings
section)
- Sets the
scale=true
label on the node (label controls the IBM Scale CSI daemonset
- Node removal - Each thread takes care of each task in sequence
- When the CSI Reserve pod reaches the terminating state (
csireserve=TERMINATING
)
- Sets the
scale=false
label on the node
- Removes the node from the IBM Scale CSI Driver operator CR configuration (
spec.nodeMappings
section)
- Unmounts the IBM Scale filesystem from the node
- Stop the IBM Scale services on the node
- Remove the node from the IBM Scale configuration
All this code is available on the IBM GitHub under my account jclopez@ibm.com
for those who want to take a look.
How did we make all of this happen will you ask. First we created a test lab available on IBM premises so we could test the concepts and identify the first bugs and potential pitfalls such as deployment configuration and management, identify IBM Scale IBM ReST API calls and their parameters, DNS and network configuration, host naming conventions, ...
Once we tested the basics in this lab known to us as the SWAT lab to then hop over to the final production environment that was to be used by IBM Research for the next generation Granite models. Testing in the SWAT lab took us about a week and we had to make a few changes to the CSI Auto code up to a point when we were confident we could give it a try in the real production environment. During the expansion we made a few improvements to the CSI Auto pod so that it could offer a ReST API to modify the log level dynamically, modify the hostFilter
dynamically
Once we landed in the final production environment, although we did have to make a few changes to the CSI Auto code to account for a few corner cases, most of our problems were induced by either DNS name resolution and DNS configuration on the Kubernetes nodes (we had to create a complete DNS service usable by the pods above but more importantly by the IBM Scale nodes.
After a week of fine tuning and adding one node at a time (using a special hostFilter
function built-in the CSI Auto pod), we started adding the nodes in bulk one rack at a time. It took us about a week and dedication from Matt Klos on our team to monitor the addition of those racks so that they would be put under the control of the tool.
We are now at a point where we started testing benchmarking the actual IBM ESS performance and compare it to the customer existing storage environment. Initial testing suggests IBM ESS is 30% faster but remains mainly untuned.
I would like to extend many thanks to all the member on our team that helped make this possible in alphabetical order:
- Brian Belgodere
- Shawn Houston
- Matt Klos
- Abdoulaye Traore
- Brent Wolfe
- Joanna Wong
#LLM