High Performance Computing

High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

 View Only

How to Deploy IBM Spectrum LSF on IBM Cloud for HPC Workloads

By Andy Morris posted Thu October 24, 2019 11:00 AM

  

IBM Spectrum LSF Suite for Enterprise is HPC scheduling software that redefines HPC cluster virtualization and workload management.

9 min read

By: Xiao Peng Chen and Suraksha Vidyarthi

It does this by providing a tightly integrated solution that can increase both user productivity and hardware utilization while decreasing system management costs.

Historically, customers have been using IBM Spectrum LSF Suite for on-premises deployments, but with the advent of cloud, the software solution has evolved and enabled capabilities that suit a cloud deployment model. The resource connector function that allows a LSF deployment to scale up and down automatically is one such function. This allows customers to configure and optimally manage the cost associated with their cloud-based HPC clusters.

IBM Cloud provides compute, networking, and storage configurations that can easily replace the on-premise configurations typically used to build HPC clusters. With a combination of resource connector function running on IBM Cloud, it's a true IBM software on IBM Cloud solution with full support and coverage of the entire HPC stack, managed by single vendor.

Overview

In this step-by-step tutorial, users will learn how to to do the following:

  1. Order a virtual server from IBM Cloud.
  2. Install IBM Spectrum LSF Suite for Enterprise 10.2.0.8 on virtual servers in IBM Cloud.
  3. Configure the IBM Spectrum LSF master node to enable the LSF resource connector to autoscale the LSF Cluster based on configured resource utilization rules.

References

Prerequisites

Before starting the tutorial, please ensure that the following prerequisites are met:

  • Download the appropriate package from IBM Passport Advantage: IBM Spectrum LSF Suite for enterprise 10.2.0.8 Installation Package for Linux on x86-64 English (CC2Q0EN ) — lsfsent10.2.0.8-x86_64.bin
  • Paid account on IBM Cloud
  • Familiarity with IBM Spectrum LSF Suite

Step-by-step instructions

Step 1: Provision virtual server from IBM Cloud

In this step, the user will create two virtual servers from IBM Cloud:

  • lsf-master.hpc.ibmcloud: Configured as LSF_Master node, GUI_Host, and DB_Host
  • lsf-slave.hpc.ibmcloud: Configured as LSF_Server slave node

Step 1.1: Order LSF Master

  • Log in to your IBM Cloud account, navigate to the Catalog, and select the Virtual Server service under the Compute category:
    vs1

  • Choose to order Public Virtual Server and then click Continue:
    vs2

  • On the Configuration page, specify as following and then Click Create:
    • Type of virtual server: Public
    • Hostname: lsf-master 
    • Domain: hpc.ibmcloud 
    • Popular profiles: Memory M1.4x32: 4 vCPUs 32 GB RAM. The user will install the IBM Spectrum LSF Suite for Enterprise LSF master host, GUI server host, database host, and LSF suite installation repository on a single host. Check the host prerequisites for LSF Suite.
    • Image: Redhat 7.x Minimal (64 bit) -HVM
    • Attached Storage disks: 100 GB (SAN). Check the host prerequisites for LSF Suite.
    • Uplink port speeds: 1 Gbps Public/Private Network Uplink
      vs3

    • Private VLAN: Choose a private VLAN. Make sure the LSF Master and LSF Server are in the same private VLAN.
    • For other parameters, you can keep the default value. The final summary should look something like the following:
      vs4

Step 1.2: Order LSF Server

Refer to Step 1.1 and follow the same steps to order the LSF Server. On the Configuration page, specify as following:

  • Type of virtual server: Public
  • Hostname: lsf-slave
  • Domain: hpc.ibmcloud
  • Popular profiles: Memory M1.4x32: 4 vCPUs 32 GB RAM. LSF server hosts require at least 2000 MB of free memory.
  • Image: Redhat 7.x Minimal (64 bit) -HVM
  • Attached Storage disks: 25GB (SAN). 
  • Uplink port speeds: 1 Gbps Private Network Uplink:
    vs5

  • Private VLAN: Choose a private VLAN. Make sure the LSF Master and LSF Server are in the same private VLAN.
  • For other parameters, you can keep the default values.

Step 2: Install IBM Spectrum LSF Suite for Enterprise 10.2.0.8 on lsf-master.hpc.ibmcloud node

Step 2.1: Log in to lsf-master.hpc.ibmcloud and lsf-slave.hpc.ibmcloud

  • Log in as root on lsf-master.hpc.ibmcloud with a SSH client:
    # ssh root@<lsf_master_public_ip_addr> 
  • The details for lsf-master.hpc.ibmcloud—such as public IP address, private IP address, etc.—can be found in the Device List (Navigation Menu > Classic Infrastructure > Devices > Device List):
    vs6

  • The password for root can be found using the Passwords tab on the left navigation bar:
    vs7

  • Log in to lsf-slave.hpc.ibmcloud with a SSH client; since lsf-slave.hpc.ibmcloud is only available in a private network, it can be connected to from lsf-master.hpc.ibmcloud only:
    # ssh root@<lsf_slave_private_ip_addr>

Step 2.2: Configure lsf-master.hpc.ibmcloud and lsf-slave.hpc.ibmcloud

  • Make sure that the hosts are reachable from the deployer host (in this demo, the deployer host will be the LSF Master host), and make sure the LSF Master and LSF Server can ping each other successfully:
    • In lsf-master.hpc.ibmcloud, edit /etc/hosts to make sure it has following info:
      <lsf_master_private_ip_addr> lsf-master.hpc.ibmcloud lsf-master
      <lsf_slave_private_ip_addr> lsf-slave.hpc.ibmcloud lsf-slave
    • In lsf-slave.hpc.ibmcloud, edit /etc/hosts to make sure it has following info:
      <lsf_master_private_ip_addr> lsf-master.hpc.ibmcloud lsf-master
      <lsf_slave_private_ip_addr> lsf-slave.hpc.ibmcloud lsf-slave
  • Setup password-less SSH for root from this master node to slave node as well as slave node to master node in the cluster:
    • Generate a SSH key pair in the lsf-master.hpc.ibmcloud (press enter to use default value):
      # ssh-keygen
    • Get the public key:
      # cat ~/.ssh/id_rsa.pub
    • Copy the public key into /root/.ssh/authorized_keys in lsf-master.hpc.ibmcloud.
    • Copy the public key into /root/.ssh/authorized_keys in lsf-slave.hpc.ibmcloud. Make sure the permissions for the authorized_keys file is set to root only:
      # chmod 600 authorized_keys
    • Repeat the first and fourth steps for lsf-slave.hpc.ibmcloud.
    • Log in to your server using SSH keys from the lsf-master node.
    • After completing the steps above you should be able log in to the remote server without being prompted for a password. To test it just through master node, try to log in to a server via SSH:
      # ssh root@lsf-master.hpc.ibmcloud
      # ssh root@lsf-slave.hpc.ibmcloud
    • Such as:
      vs8

    • Install yum-utils on lsf-master.hpc.ibmcloud:
      # yum -y install yum-utils
    • The end should looks something like this:
      vs9

Step 2.3: Copy and extract the install package

  • Copy install package lsfsent10.2.0.8-x86_64.bin to the deployer host (that is lsf-master.hpc.ibmcloud).
  • For example, copy the package from the local to home folder on remote. The download process can take time depending on the speed of your network connection:
    $ scp lsfsent10.2.0.8-x86_64.bin root@<lsf_master_public_ip_addr>:~/
    vs10

  • Make the package executable:
    # chmod 744 lsfsent10.2.0.8-x86_64.bin
  • Run it:
    # ./lsfsent10.2.0.8-x86_64.bin
  • Follow the prompt for accepting the license terms to start the installation, the complete process can take a few mins:
    vs11

  • At the end it would print something like this for successful deployment:
    vs12

Step 2.4: Edit and test configuration files

  • Change directory to /opt/ibm/lsf_installer/playbook on the deployer host (lsf-master.hpc.ibmcloud).
  • Edit the lsf-inventory file to list the hosts in the cluster and their roles:
    • [LSF_Masters]
      lsf-master.hpc.ibmcloud
    • [LSF_Servers]
      lsf-slave.hpc.ibmcloud
    • [GUI_Hosts]
      lsf-master.hpc.ibmcloud
    • [DB_Host]
      lsf-master.hpc.ibmcloud
      vs13

  • Edit the lsf-config.yml file to specify a cluster name: lsf-demo. Keep the default value for other cluster properties:
    • my_cluster_name: lsf-demo
      vs14

  • Test the configuration and host access by running the commands:
    • # ansible-playbook -i lsf-inventory lsf-config-test.yml
    • # ansible-playbook -i lsf-inventory lsf-predeploy-test.yml
  • Make sure these tests pass successfully before moving to the next step. It should print some output like below:
    vs15

Step 2.5: Install IBM Spectrum LSF Suite

  • Run the command below. The process can take a several minutes:
    # ansible-playbook -i lsf-inventory lsf-deploy.yml
  • Successful completion should print some output like below:
    vs16

Step 2.6: Verify the installation

  • In lsf-master.hpc.ibmcloud, source your environments from the command line:
    # source /opt/ibm/lsfsuite/lsf/conf/profile.lsf
  • Run the lsid to see your cluster name and master host name:
    # lsid
    vs17

  • Run the lshosts command to see that the single host belongs to your cluster:
    # lshosts
    vs18

  • Run the bhosts command to check that the status of the host is ok and the cluster is ready to accept work:
    # bhosts
    vs19

  • Run the lsclusters command to view cluster status and size:
    # lsclusters
    vs20

  • Log in to the GUI as the lsfadmin user. If the lsfadmin user was created by the installation and did not exist in your system, you might need to create a password for lsfadmin with the passwd command. Use thefollowing command to set password for lsfadmin:
    [root@lsf-master playbook]# passwd lsfadmin
  • Open your browser and enter the GUI portal URL: http://<lsf-master_public_ip>:8080
  • Log in with the lsfadmin user.
  • On the Resources > Dashboard page, you can see the summary of the deployed environment:
    vs21

  • On the Resources > Hosts page, you can see list of the two registered servers:
    vs22

Step 3: Configure the LSF Resource Connector

In this step, we will configure the LSF Resource Connector to enable autoscaling (i.e., dynamically adding virtual servers to the LSF cluster based on configured resource utilization rules).

Step 3.1: Build a customized image template for virtual compute host

A customized image template will be created based on lsf-slave. This template will be used to add new virtual servers to the LSF cluster:

  • Connect to lsf-slave via SSH from lsf-master.hpc.ibmcloud node:
    ssh root@lsf-slave.hpc.ibmcloud 
  • Modify /opt/ibm/lsfsuite/lsf/conf/ego/lsf-demo/kernel/ego.conf:
    • Remove or comment out the following line:
      EGO_GETCONF=lim
    • Add or uncomment the following line:
      EGO_GET_CONF=lim
      vs23

  • Modify /root/.bash_profile to add rc_account tag:
    export rc_account=lsf-demo-dynamic-host
    vs24

  • Log in to IBM Cloud, navigate to the Dashboard, and select the View resources.
  • Click Devices to show devices.
  • Find and click device lsf-slave.hpc.ibmcloud to show Device Details.
  • Click Actions > Create Image Template, fill Image Name (such as LSFSlaveHostImage), check "I agree to have my computing instance powered off," and click the Create Template button. It will take several minutes to create the image template. The lsf-slave.hpc.ibmcloud will be powered on again after the creation:
    vs25

    vs26

  • Click Devices > Manage > Images to show Image Templates:
    vs27

Step 3.2: Create a provisioning script for virtual compute host

IBM Cloud can download the provisioning script from an HTTPS server and run it automatically when the virtual server is started and all its virtual devices are connected. The provisioning script is required to configure the LSF worker node to join the correct cluster and present correct shared resources. You can host your provisioning script on another virtual server in IBM Cloud and configure it to use only the IBM Cloud internal network for increased security. You can also host your script on any HTTPS server available on the network.  

In this guide, just put the provisioning script to default http server document root /var/www/html on deployer host so that the script can be accessed via http://<lsf_master_private_ip_addr>/provisioning.sh.

An example provisioning script is shown below. It reads from the IBM Cloud getUserMetadata API to get the configuration variables set by the resource connector during provisioning.

#!/bin/bash
logfile=/var/log/postprovisionscripts.log
echo START `date '+%Y-%m-%d %H:%M:%S'` >> $logfile
  
  
#Do not remove this part of the script to support passing LSF user data to VM run time environment
STARTTIME=`date +%s`
TIMEOUT=60
URL="https://api.service.softlayer.com/rest/v3/SoftLayer_Resource_Metadata/getUserMetadata.txt"
USERDATA=`curl -s $URL` 2>>$logfile
#
while [[ "$USERDATA" == [Nn]"o user data"* ]] && [[ `expr $NOWTIME - $STARTTIME` -lt $TIMEOUT ]]; do
    sleep 5
    NOWTIME=`date +%s`
    USERDATA=`curl -s $URL` 2>>$logfile
done
  
  
# check if we got user data eventually
if [[ "$USERDATA" != [Nn]"o user data"* ]]; then
    # user data is expected to be a semicolon-separated key=value list
    # like environment variables; split them into an array
    IFS=\; read -ra ARR <<<"$USERDATA"
    for VAR in ${ARR[@]}; do
    eval "export $VAR"
    done
else
    echo "USERDATA: $USERDATA" >>$logfile
    echo EXIT AT `date '+%Y-%m-%d %H:%M:%S'` >>$logfile
    exit -1
fi
echo "CURRENT ENVIRONMENT:" >>$logfile
env >> $logfile
  
  
#Set the correct path for LSF_TOP, where LSF is installed on the VM host
LSF_TOP=/opt/ibm/lsfsuite/lsf
LSF_CONF_FILE=$LSF_TOP/conf/lsf.conf
source $LSF_TOP/conf/profile.lsf
  
  
#Add softlayer boolean resource for slave host
sed -i '$ a LSF_LOCAL_RESOURCES=\"[resource softlayerhost]\" '  $LSF_CONF_FILE
  
  
###Disable ego in the slave host
#sed -i "s/LSF_ENABLE_EGO=Y/LSF_ENABLE_EGO=N/" $LSF_CONF_FILE
###if ego enabled need to create a soft link
ln -s /opt/ibm/lsfsuite/lsf/conf/ego/lsf-demo/kernel/ego.conf /etc/ego.conf
  
  
#Do not remove this part of the script to support rc_account resource for SoftLayer
#You can similarly set additional local resources if needed
if [ -n "${rc_account}" ]; then
    sed -i "s/\(LSF_LOCAL_RESOURCES=.*\)\"/\1 [resourcemap ${rc_account}*rc_account]\"/" $LSF_CONF_FILE
    echo "update LSF_LOCAL_RESOURCES lsf.conf successfully, add [resourcemap ${rc_account}*rc_account]" >> $logfile
fi
  
  
#If there is no DNS server to resolve host names and IPs between master host and VMs,
#then uncomment the following part and set the correct master LSF host name and IP address
master_host='lsf-master.hpc.ibmcloud'
master_host_ip='<lsf_master_private_ip_addr>'
echo ${master_host_ip} ${master_host} >> /etc/hosts
echo $master_host > $LSF_ENVDIR/hostregsetup
lsreghost -s $LSF_ENVDIR/hostregsetup
  
  
#Create a script to start LSF daemons as cron job
  
  
cat > $LSF_TOP/start_lsf.sh << "EOF"
#!/bin/sh
# This script: is check and start LSF Daemons in cron job
logfile=/tmp/lsf_daemons_status
LIMSTOP="lim is stopped..."
LIMSTATUS=`/opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/lsf_daemons status|grep lim` 2>> $logfile
if [[ $LIMSTATUS = $LIMSTOP ]]; then
        nohup /opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/lsf_daemons start <&- >&- 2>&- & disown
else
        echo ${LIMSTATUS} >>$logfile
fi
EOF
chmod +x $LSF_TOP/start_lsf.sh
  
  
crontab -l > /tmp/mycron
echo "* * * * * /opt/ibm/lsfsuite/lsf/start_lsf.sh" >> /tmp/mycron
crontab /tmp/mycron
rm -f /tmp/mycron
crontab -l >>  $logfile
  
  
echo END AT `date '+%Y-%m-%d %H:%M:%S'` >> $logfile

Step 3.3: Modify configuration files

  • Connect to lsf-master via SSH:
    # ssh root@lsf-master.hpc.ibmcloud

Step 3.3.1 Modify $LSF_ENVDIR/lsf.cluster.lsf-demo

  • Remove or comment out the following line:
    lsf-master  !        !        1    (mg)
  • Check the following line does exist:
    lsf-master.hpc.ibmcloud ! ! 1 (mg) 
  • Add the following line between Begin Parameters and End Parameters:
    LSF_HOST_ADDR_RANGE=*.*.*.* 
  • Such as:
    vs28

Step 3.3.2: Modify $LSF_ENVDIR/resource_connector/softlayer/conf/credentials

  • Log in and go to this page: https://cloud.ibm.com/iam/apikeys.
  • Click Classic infrastructure API key > Action > Details.
  • Please make a note of API user name and API key from the popup dialog and update the credentials file for softlayer_access_user_name and softlayer_secret_api_keyrespectively.

Step 3.3.3: Modify $LSF_ENVDIR/resource_connector/softlayer/conf/softlayerprov_config.json

Set correct credentials file path as the value of SOFTLAYER_CREDENTIAL_FILE. For example, "/opt/ibm/lsfsuite/lsf/conf/resource_connector/softlayer/conf/credentials":

vs29

Step 3.3.4: Modify $LSF_ENVDIR/resource_connector/softlayer/conf/softlayerprov_templates.json

  • Change the value of maxNumber to 10.
  • Change the value of ncpus to ["Numeric", "4"].
  • Change the value of mem to ["Numeric", "8192"].
  • Change the key “softlayercomp” to “softlayerhost”.
  • Change the value of imageId to your image template name “LSFSlaveHostImage”.
  • Change the value of datacenter to the one for the master host, say “dal10”.
  • Change the value of vlanNumber to the master host private vlan number:
    • Log in to IBM Cloud.
    • Click View resources > Devices.
    • Locate the master host “lsf-master.hpc.ibmcloud” and click the device name to show Device Detail.
    • Locate the Network details > Private interface > VLAN and click the vlan name to show Private VLAN Detail.
    • Locate the VLAN Number.
  • Change the value of privateNetworkOnlyFlag to false.
  • Change the value of postProvisionURL to the provisioning script url that is referenced in Step 3.2. For example: http://<lsf_master_private_ip_addr>/provisioning.sh

The softlayerprov_templates.json will look like the following:

{
    "templates": [
        {
            "templateId": "Template-1",
            "maxNumber": 10,
            "attributes": {
                "type": ["String", "X86_64"],
                "ncpus": ["Numeric", "4"],
                "ncores": ["Numeric", "1"],
                "mem": ["Numeric", "8192"],
                "softlayerhost": ["Boolean", "1"],
                "customattr": ["String", "somedata"]
            },
            "imageId": "LSFSlaveHostImage",
            "datacenter": "dal10",
            "vlanNumber": "VLAN_NUMBER",
            "useHourlyPricing": true,
            "localDiskFlag": false,
            "privateNetworkOnlyFlag": false,
            "dedicatedAccountHostOnlyFlag": false,
            "postProvisionURL": "http://<lsf_master_private_ip_addr>/provisioning.sh",
            "userData": "customattr=somedata"
        }
    ]
}

Step 3.3.5: Modify $LSF_ENVDIR/lsbatch/lsf-demo/configdir/lsb.modules

Add or uncomment the following line:

schmod_demand  ()   ()
vs30

Step 3.3.3: Modify $LSF_ENVDIR/lsbatch/lsf-demo/configdir/lsb.queues

Add the following line in the normal queue section—the queue with QUEUE_NAME   = normal after #USERS row:

RC_HOSTS        = softlayerhost
RC_ACCOUNT  = lsf-demo-dynamic-host

Such as:

vs31

Step 3.3.4: Modify $LSF_ENVDIR/lsf.conf

Add following lines to enable the LSF resource connector feature:

LSB_RC_EXTERNAL_HOST_FLAG="softlayerhost"
LSF_REG_FLOAT_HOSTS=Y
LSF_DYNAMIC_HOST_WAIT_TIME=60
LSF_DYNAMIC_HOST_TIMEOUT=10m
LSB_RC_EXTERNAL_HOST_IDLE_TIME=10

Step 3.3.5: Modify $LSF_ENVDIR/lsf.shared

Add or uncomment the following line:

softlayerhost Boolean ()       ()       (instances from SoftLayer)
vs32

Step 3.3.6:  Disable master host as compute node

Run following command to disable lsf-master.hpc.ibmcloud as compute node:

# badmin hclose lsf-master.hpc.ibmcloud
vs33

Step 3.4: Restart the LSF daemons on the master host for the changes to take effect

# lsadmin limrestart
# lsadmin resrestart
# badmin mbdrestart

 

vs34

Step 3.5: Submit jobs

The following steps will help us validate that resource connector configuration or autoscaling is working as expected.

  • Connect to master host as lsfadmin:
    # ssh lsfadmin@lsf-master.hpc.ibmcloud 
  • Submit a job that require instances that are launched from IBM Cloud as the resource provider:
    • In this demo, lsf-slave.hpc.ibmcloud is installed as a static compute node that can run four jobs at the same time as it's using four CPUs, so we need to "fill" it before we can get resource provider to request new dynamic worker node. 
      $ bsub "sleep 2000" 
      $ bsub "sleep 2000"
      $ bsub "sleep 2000"
      $ bsub "sleep 2000"
      $ bsub "sleep 10"
    • Or, we can just disable lsf-slave.hpc.ibmcloud (as in step 3.3.6)—then any submitted job will be run on new dynamically added compute node.
  • Check jobs running status:
    $ bjobs -a

Conclusion

This recipe allows customers looking for ways to move their on-premises IBM Spectrum LSF-based deployments to IBM Cloud to take advantage of new hardware configurations and manage their cost by only spinning up additional machines when needed—a truly consumption-based approach.

The steps here specifically describe LSF Suite for Enterprise 10.2.0.8; some of the variable names used can be modified as per your needs. When you decide to use a different name, please make sure you update all the dependent steps, to match the name you have chosen.

Learn more by checking out the IBM Spectrum LSF Suite for Enterprise V10.2 documentation.

Xiao Peng Chen

Xiao Peng Chen

Developer - HPC on IBM Cloud

Suraksha Vidyarthi

Suraksha Vidyarthi

Product Manager, IBM Cloud

 


#SpectrumComputingGroup
0 comments
9 views

Permalink