IBM Spectrum Computing Group

 View Only

Optimizing the speed of deployment on cloud with LSF’s support for Amazon EC2 Fleet

By Martin Gao posted Fri August 12, 2022 03:13 PM

  

Many enterprises are leveraging the agility, cost advantages, and convenience offered by the public cloud services model. High performance computing (HPC) is no exception, with organizations looking to provide cost- effective resources, with the ability to scale capacity to meet workload demands. Moreover, the use of cloud for HPC is growing rapidly: according to the Worldwide HPC Cloud Forecast, 2020-2026 (June 2022) from Hyperion Research (Hyperion Research #HR12.0035.06.10.2022 at https://hyperionresearch.com/wp-content/uploads/2022/06/Hyperion-Research-HPC-Cloud-Forecast-June-2022.pdf), “the market for HPC spending in the cloud will outpace the on- premises market (17.6% compared to 6.9% five-year CAGR, respectively), yet will remain a smaller portion of the overall HPC market over the next five years.”

Keeping pace with the growth of cloud use in HPC, IBM Spectrum LSF provides dynamic hybrid HPC cloud support, which enables organizations to intelligently use cloud resources based on workload demand, supporting all major cloud providers, including Amazon Web Services (AWS). LSF is continuously improving dynamic hybrid cloud support to better leverage the advantage of clouds, to deliver the performance and scale needed, while providing the necessary tools to maximize your cloud spend.

IBM Spectrum LSF Fix Pack 13, with an LSF fix to support Amazon EC2 Fleet API (see the download locations later on), enables organizations to optimize their LSF cloud bursting configuration, and more rapidly spin up large amounts of cloud resources of multiple types, across multiple availability zones. This  this Amazon EC2 Fleet support for LSF offers greater speed of access to available cluster resources in the cloud, and greater resiliency from multiple zones.

Let’s look more closely at the capabilities provided by Amazon EC2 Fleet and how to configure support for it in LSF.

What is Amazon EC2 Fleet?

Firstly, why use Amazon EC2 Fleet? An EC2 Fleet contains the configuration information to launch a fleet or group of instances. From a single EC2 Fleet API call, a fleet can launch multiple instance types across multiple availability zones, using the on-demand instance, reserved instance, and spot instance purchasing options together. It’s versatile and flexible, and now can also be integrated with your LSF environment.
For more information about Amazon Fleet, refer to the AWS site: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-fleet.html

Extending the LSF resource connector with EC2 Fleet

How does EC2 Fleet work together to improve my LSF deployment? The LSF resource connector already provides support for spot fleet and on-demand instances, with customers creating 100,000 or more instances. Although the LSF resource connector is useful for certain scenarios on its own, extending your environment with EC2 Fleet provides further advantages to avoid:

  • Management of different APIs.
  • Template limitations (for example, different methods cannot be used in the same template).
  • Support for only a single availability zone for all on-demand instances.
  • Significant time needed to explicitly handle and process capacity errors.

Bring in the EC2 Fleet API to help avoid and alleviate the aforementioned limitations. Extending LSF resource connector with EC2 Fleet provides these additional benefits:

  • Combines on-demand, spot, and even reserved instances together in one configuration file in an LSF resource connector fleet template.
  • Splits allocation between on-demand and spot instances, using the onDemandTargetCapacityRatio setting in the LSF fleet template.
  • Supports multiple available zones.
  • Supports instance weighing, which defines the capacity units that each instance type would contribute to your application workload.
  • Supports instance priority, which defines the priority of instances based on instance type (vmtypes).
  • Handles capacity errors through a EC2 Fleet API call.

 Enabling Amazon EC2 Fleet support with LSF

1) Prerequisites to use the Amazon EC2 Fleet API:

  1. Download and install LSF 10.1 Fix Pack 13. For detailed installation steps, see IBM Documentation.
  2. Download and install the LSF fix for supporting Amazon EC2 Fleet.
  3. Create a launch template in AWS, which includes information about the instances to launch, such as AMI, instance type, network, availability zone, and so on.
  4. Ensure you have the AWSServiceRoleForEC2Fleet role, which grants the EC2 Fleet permission to request, launch, terminate, and tag instances.
  5. Ensure you have permissions for EC2 Fleet IAM users.

For more information about how to create AWS roles and grant permissions, refer to the Amazon EC2 Fleet prerequisites.

2) Configure AWS for LSF resource connector:

  1. Configure AWS for the LSF resource connector according to the guidance in IBM Documentation.
  2. Create an EC2 Fleet configuration file and save it under the $LSF_TOP/conf/resource_connector/aws/conf/ directory.
  3. Create a new LSF template with new EC2 Fleet parameters.

Example usage: mixing on-demand and spot instances

Now that you have EC2 Fleet support enabled in LSF, let’s look at an example of how to use it. Consider the scenario of leveraging a mix of on-demand and spot instances for your workload.

First, create an EC2 Fleet JSON configuration file. Here is an example:

{

   "LaunchTemplateConfigs":[

      {

         "LaunchTemplateSpecification":{

            "LaunchTemplateId": "lt-0c40f9718d18e61ba",  >> Must create a launch template in AWS already

            "Version":"1"

         },

         "Overrides":[    >> Overrides parameters from launchTemplate, like type, subnet…

                {

                      "InstanceType":" c3.large",

                      "SubnetId":"subnet-0fe69d290ae026155",
                      "WeightedCapacity":1,   >> Define based on the table below

                      "Priority": 3        >> Define the priority of this instanceType, lower number higher priority

                },

                {

                      "InstanceType":" c3.2xlarge",

                      "SubnetId":"subnet-0dfee843e19bfeb52",

                      "WeightedCapacity":2,

                      "Priority": 2

                }

         ]

        }

   ],

   "TargetCapacitySpecification":{

      "TotalTargetCapacity": $LSF_TOTAL_TARGET_CAPACITY,

      "OnDemandTargetCapacity": $LSF_ONDEMAND_TARGET_CAPACITY,

      "SpotTargetCapacity": $LSF_SPOT_TARGET_CAPACITY,

      "DefaultTargetCapacityType": ”spot"   >> Define default type of fleet pool, either “spot” or “ondemand”

   },

   "SpotOptions": {

        "AllocationStrategy": "capacity-optimized-prioritized"   >>allocation strategy, default is lowest-price

    },

    "Type":"instant"      >> The type is either “instant” or “request”

}

In the EC2 Fleet JSON configuration file, WeightedCapacity parameter must be defined. The value of WeightedCapacity should be equal to the slots number of the corresponding vmtype in LSF. WeightedCapacity can depend on your EGO_DEFINE_NCPUS parameter configuration in your lsf.conf file:

vmtype

cores

vcpus

WeightedCapacity
(EGO_DEFINE_NCPUS=cores)

WeightedCapacity
(EGO_DEFINE_NCPUS=threads)

c3.large

1

2

1

2

C2.2xlarge

2

4

2

4

 
Here is an example of an LSF resource connector template:

{

    "templates": [

        {

            "templateId": "template-1",

            "maxNumber": 100,

            "attributes": {

                "type": ["String", "X86_64"],

                "ncores": ["Numeric", "1"],

                "ncpus": ["Numeric", "2"],   >>

                "mem": ["Numeric", "512"],

                "awshost": ["Boolean", "1"]

            },

            "priority": "121",  >> LSF template priority, big number higher priority

            "onDemandTargetCapacityRatio":"0.5",   >>The ratio for on-demand or spot instances

            "ec2FleetConfig": "simple_fleet.json"  >> this is the fleet json file from above example

        }

    ]

}

The maximum capacity is slot based, so this LSF resource connector LSF template provides 200 slots in total (maxNumber* ncpus=100*2=200).

To better illustrate this scenario, let’s review a use case: submit a workload request of 100 slots, with workflow as follows:

  1. LSF calculates $LSF_TOTAL_TARGET_CAPACITY=100, $LSF_ONDEMAND_TARGET_CAPACITY=50, and $LSF_SPOT_TARGET_CAPACITY=50, and passes the values to the json file, replacing the three variables with real numbers.
  2. The LSF resource connector initiates a fleet API call with the results from the json file.
  3. AWS determines the types of instances to be fulfilled for on-demand and spot instances, based on the attributes specified in the template, such as for WeightedCapacity, Priority, and AllocationStrategy, and so on.
  4. LSF resource connector tracks this fleet request with a json file, and waits until the fleet request is fulfilled.
  5. Fulfilled instances will be recorded in a json file; they join the LSF cluster as dynamic hosts waiting for the LSF scheduler to dispatch jobs. When all instances join the cluster, the job will be dispatched.
  6. After job is complete, LSF relinquishes hosts based on IDLE or TTL timeout settings.

In the above use case scenario, the LSF cluster may get either 50 “c3.large” or 25 “c3.2large” instances for on-demand instances (based on lowest on-demand unit price), and 25 “c3.2large” instances for spot instances (based on allocation strategy, priority, and WeightedCapacity). If the fleet API call encounters capacity errors for any instance type, it will automatically try another available instance type to bypass the error and fulfill the fleet request.

Optimization with Amazon EC2 Fleet

Prior to LSF supporting EC2 Fleet, to make the same use case work similarly, you required defining multiple templates (and these LSF templates may be optimized by only one template):

{

    "templates": [

        {

            "templateId": "aws-ondemand-template1",

            "maxNumber": 50,

            "attributes": {

                "type": ["String", "X86_64"],

                "ncores": ["Numeric", "1"],

                "ncpus": ["Numeric", "1"],

                "mem": ["Numeric", "512"],

                "awshost": ["Boolean", "1"],               

                "pricing": ["String", "ondemand"]

            },

            "imageId": "ami-040d1258d11f3f3cd",

            "subnetId": "subnet-0fe69d290ae026155",

            "vmType": "c3.large",

            "keyName": "ib19b07",

            "securityGroupIds": ["sg-08f1a36be62fe02a4"],

            "priority": "10",

            "userData": "pricing=ondemand"

        },

        {

            "templateId": "aws-ondemand-template2",

            "maxNumber": 25,

            "attributes": {

                "type": ["String", "X86_64"],

                "ncores": ["Numeric", "1"],

                "ncpus": ["Numeric", "2"],

                "mem": ["Numeric", "512"],

                "awshost": ["Boolean", "1"],               

                "pricing": ["String", "ondemand"]

            },

            "imageId": "ami-040d1258d11f3f3cd",

            "subnetId": "subnet-0fe69d290ae026155",

            "vmType": "c3.2xlarge",

            "keyName": "ib19b07",

            "securityGroupIds": ["sg-08f1a36be62fe02a4"],

            "priority": "40",

            "userData": "pricing=ondemand"

        },

        {

            "templateId": "aws-spot-template1",

            "maxNumber": 50,

            "attributes": {

                "type": ["String", "X86_64"],

                "ncores": ["Numeric", "1"],

                "ncpus": ["Numeric", "1"],

                "mem": ["Numeric", "512"],

                "awshost": ["Boolean", "1"],                

                "pricing": ["String", "spot"]

            },

            "imageId": "ami-040d1258d11f3f3cd",

            "subnetId": "subnet-0dfee843e19bfeb52",

            "keyName": "ib19b07",

            "vmType": "c3.large",

            "fleetRole": "arn:aws:iam::700071821657:role/EC2-Spot-Fleet-role",

            "securityGroupIds": ["sg-08f1a36be62fe02a4"],

            "spotPrice": "0.3",

            "allocationStrategy":"lowestPrice",    

            "priority": "20",                  

            "userData": "pricing=spot"

        },

        {

            "templateId": "aws-spot-template2",

            "maxNumber": 25,

            "attributes": {

                "type": ["String", "X86_64"],

                "ncores": ["Numeric", "1"],

                "ncpus": ["Numeric", "2"],

                "mem": ["Numeric", "512"],

                "awshost": ["Boolean", "1"],               

                "pricing": ["String", "spot"]

            },

            "imageId": "ami-040d1258d11f3f3cd",

            "subnetId": "subnet-0dfee843e19bfeb52",

            "keyName": "ib19b07",

            "vmType": "c3.2xlarge",

            "fleetRole": "arn:aws:iam::700071821657:role/EC2-Spot-Fleet-role",

            "securityGroupIds": ["sg-08f1a36be62fe02a4"],

            "spotPrice": "0.8",

            "allocationStrategy":"lowestPrice",    

            "priority": "30",                  

            "userData": "pricing=spot"

        }

    ]

}

For the same use case of a workload with 100 slots, this old configuration (that is, without LSF supporting EC2 Fleet) progresses as follows:

  1. The LSF scheduler sends a request for 25 “c3.2xlarge” on-demand instances and 25 “c3.2xlarge” spot instances.
  2. The LSF resource connector keeps tracking the requests for fulfillment.
  3. After all instances are fulfilled, the job will be started.
  4. After job is done, LSF relinquishes the instances based on IDLE or TTL timeout settings.

If there are capacity errors, LSF will need to wait for a couple of timeout durations, to get the next available template to replace unfulfilled instances. If the error is from the AWS side and handled on the LSF side; the same error for AWS fleet will be handled inside the AWS fleet request.

Clearly, the same use case scenario run using Amazon EC2 Fleet with the LSF resource connector is advantageous.

Try it today and leave us your feedback

Give it a try! Optimize your LSF cluster and deployment on cloud with LSF’s support for Amazon EC2 Fleet. It’s efficient and makes an impact on your productivity!
As always, we value any feedback. Share your experience with this new feature and let us know if there’s anything we can improve in a future release.

0 comments
116 views

Permalink