High Performance Computing

High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

 View Only

Git'n Down with LSF

By Bill McMillan posted Fri February 28, 2020 04:47 AM

  
IBM Spectrum LSF is no stranger to version control, after all the product is more than 25 years old and still going strong.

LSF’s configuration is held in several text files and clients have used CVS, SVN and more recently git to manage changes to those files. While these tools provide excellent management of the configuration, there are still challenges in using them with production clusters, especially in a multi-cluster or cloud environment – you’ve committed changes to the configuration but applying those changes to the running environment is a separate manual step.

Can we automate this?

Git and GitHub provide API’s that allow changes and commits to monitored. Using these, we can detect new configurations being committed and automatically trigger LSF to read and apply those new configurations without interrupting operations. This methodology can also be applied to LSF Process Manager computational workflows, where a commit of a new workflow to git will trigger the cloning and import of the workflow to the Process Manager server (Figure 1).

LSF Git Integration
Figure 1


This same technique can be extended to support the automated processing of data such as automatically launching a workflow when new files are committed to a repository, or when files are updated. Git can be used as the trigger for initiating computational workflows and pipelines.

Managing LSF Configuration with Git

As previously mentioned, LSF’s configuration is held in several text files, and it seems a trivial matter to make a change and apply the new configuration. In many cases it is, but from our Support tickets, sometimes the Administrator may have been interrupted but a higher priority issue, and changes don’t get applied – or even worse, only partial changes have been made and the configuration may now have errors.

With the integration with git, checking and application of the configuration is done automatically when those changes are committed.

Single Cluster Configuration Management

Before using the integration, you first need to commit a copy of the LSF_ENVDIR directory structure to an instance of github. With the files in git, you can then start the “lsf-git-configure” python script on the LSF master host which will then watch for changes in the github repository. Any changes to the repository will be pulled, checked, and if there are no errors, applied to the cluster. For example:

$ ./lsf-git-configure.py &
$ badmin showconf mbd | grep HPC
LSB_ENABLE_HPC_ALLOCATION = Y

The Administrator clones a copy of the configuration where they can make changes without impacting production. For example, turning off the HPC_ALLOCATION algorithm.

$ git clone lsf.master.host:/lsf/cluster/conf

Edit lsf.conf and set “LSB_ENABLE_HPC_ALLOCATION=N”

$ git commit -m 'added parameter LSB_ENABLE_HPC_ALLOCATION=N' -a
$ git push

When the change is committed, the script detects the change, clones it from the repository, checks it for correctness and applies the change – in this case it would execute “badmin mbdrestart”. The result of this change can be seen in the active configuration:

$ badmin showconf mbd | grep HPC
LSB_ENABLE_HPC_ALLOCATION = N

Multi-cluster Configuration Management

LSF’s multi-cluster capability allows work to be sent from one cluster to another – this may be on-prem to on-prem, or on-prem to cloud. When using multi-cluster it is important that all the clusters have the same common understanding of resource definitions. This is supported by using common #include files (Figure 2). As in the previous example, if a change is applied in one cluster, and not in another, there may be unexpected consequences.

Figure 2


This is where the integration with git has significant value. Each cluster is monitoring the common repository for changes. This means that when a change is committed, every cluster participating in the multi-cluster environment will automatically clone and apply the change, ensuring that all clusters have a common view of the world (figure 3).

Figure 3


This can be done in the same manner as a single cluster, where all clusters share the same LSF_ENVDIR repository or can be split into different clusters and shared files as shown in figure 3. To manage a shared configuration directory, you only need to indicate them in `--shared_envdir` directory to enable shared configuration management.

$ ./lsf-git-configure.py --shared_envdir /path/to/shared/conf/

Managing LSF Process Manager with Git

Process Manager allows a user to graphically define a computational workflow which can then be shared with other users, enabling consistent and repeatable execution of key computational operations. These workflows are stored as a set of XML files.

As with multi-cluster, multiple Process Manager instances may be running in the same cluster of different clusters. The Process Manager workflow definitions can be managed in a similar manner to LSF’s configuration such that when new flow definitions are committed to the repository they are automatically applied to all instances.

$ ./ppm-git-trigger.py --path /path/to/ppm/repo/

A workflow can be defined with precondition that it waits until a file exits in a directory. For example, a file should be ready in directory `workflow/my-workflow/data/` to trigger a workflow running with the data. This is useful to have file triggered workflow managed by a git pull request for the file.

Benefits

By integrating LSF and git we are bringing many of the benefits of DevOps to the management of LSF:
  • Version Control – rather than maintaining lots of copies of your configuration you have a single version-controlled repository, which can also be used to easily roll back changes.
  • Data provenance – clearly see what changes have been made and who made them, making maintenance and audit much simpler
  • Access anywhere by everyone with git(hub)
  • Git can be operated anywhere when you have connection to the remote git.
  • Git has security transport and detailed ACL control. So, anyone can submit a pull request to github and administrator can accept it or reject it. It can help to decrease the workload of the primary administrator.

If you are interested in trying this prototype, the source code for this can be found in the Spectrum Computing GitHub repository.  We'd love to hear your feedback.   We're also looking at extended this methodology to cover additional files that need to sync'd between clusters such as application templates/submission scripts.


Finally,  I'd like to thank Xun Pan, Yong Wang & Gang Pu from our Xi'An Development Lab for creating this LSF integration with git.


GitOps details in https://www.weave.works/technologies/gitops/
LSF configuration file list in https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/lsf_config_ref/part_files.html
LSF git script source code in https://github.com/IBMSpectrumComputing/lsf-git-ops

#LSF
#SpectrumComputingGroup
0 comments
56 views

Permalink