You can deploy an IBM Spectrum Conductor cluster or an IBM Watson Machine Learning Accelerator (WML Accelerator) cluster within an IBM Spectrum LSF (LSF) cluster.
In this architecture, LSF is the base scheduling system, and IBM Spectrum Conductor and WML Accelerator are integrated as add-ons to LSF. This architecture enables to add AI capabilities in a native way to LSF. This solution provides the benefits of these products within an LSF cluster, including: optimized resource utilization and sharing, security, user management, high availability, graphics processing unit (GPU) scheduling, and more. The integration supports existing LSF clusters (LSF 10.1.0.6 or later). The WML Accelerator versions that are supported for the integration are 1.2.3 or later.
The integration enables the products to share cluster resources between AI workloads and high-performance computing (HPC) workloads, by using fine-grain slot-level resource sharing in the LSF cluster. The integration enables to consolidate resources for optimal utilization and sharing, and to avoid resource silos. In addition, resource planning and workloads tracking and accounting, for IBM Spectrum Conductor and WML Accelerator workloads, are done through LSF.
To start an IBM Spectrum Conductor or WML Accelerator cluster within an LSF cluster, you can use either LSF bsub command to submit an IBM Spectrum Conductor controller job, or run an independent start script from the CLI. This command or script starts an IBM Spectrum Conductor or WML Accelerator management cluster within the LSF cluster. The management cluster runs system and instance group management services, and supports automatic cluster resizing. Hosts are acquired exclusively from LSF for use in the management cluster.
After an IBM Spectrum Conductor or WML Accelerator management cluster is started, you can access the management cluster and the instance groups by using the IBM Spectrum Conductor or WML Accelerator GUI (cluster management console), CLI commands, and RESTful APIs. The workloads of IBM Spectrum Conductor and WML Accelerator run as jobs on compute hosts in the LSF cluster, using fine-grain resource sharing with other workloads that run in the LSF cluster.
IBM Spectrum Conductor 2.5.0 includes new features and enhancements in the LSF integration, specified here:
The following features were enabled:
- Configuration modification.
- Data connectors.
The following features were enhanced:
- Automatic GPU allocation, sharing and reclaim.
- Monitoring of the LSF jobs that are generated by IBM Spectrum Conductor and WML Accelerator.
- Running elk-shippers on compute hosts.
- Cleanup of slots and allocation jobs.
- Logging improvements.
The following new features were implemented:
- Start and manage the IBM Spectrum Conductor or WML Accelerator cluster by a non-root user.
- Start and stop an IBM Spectrum Conductor or WML Accelerator cluster directly without LSF bsub or bkill.
- Revert an IBM Spectrum Conductor or WML Accelerator cluster from LSF mode to standard mode.
The following documentation were enhanced with the following features:
- New Swagger RESTful API documentation for the LSF parameters configured for instance groups.
- Enhancements in the documentation of the LSF integration, including new features, and more elaboration and clarifications based on deployment experience.
For more information about the LSF integration, see IBM Spectrum LSF integration.
Give it a try and tell us what you think by downloading IBM Spectrum Conductor 2.5.0 on Passport Advantage or the evaluation version! We hope you are as excited as we are about this new release!
Log in to this community page to comment on this blog post. We look forward to hearing from you on the integration and feature enhancements, and what you would like to see in future releases.