High Performance Computing

 View Only

Dask in IBM® Spectrum Conductor 2.5.0

By Jenna Lau Caruso posted Fri February 05, 2021 09:39 AM


In the release of IBM Spectrum Conductor 2.5.0, we revamp the instance group by extending it to run new components with new types of workloads. The debut component for this new-and-improved instance group is Dask. The component enables Dask workload to run in the cluster alongside existing Spark workload, and includes support for:

  • Authentication
  • Impersonation
  • SSL
  • CPU or GPU workload
  • Fair share scheduling between clients
  • Smart reclaim for Dask workers
  • Jupyter notebook integration

Preparing to run Dask applications on your cluster begins in the conda environment. The Dask instance group component is built on top of familiar open source packages in python, including the dask, distributed, and dask-jobqueue packages. GPU workload is supported through the dask-cuda and dask-cudf packages that are developed by RAPIDS.

Sample environments are packaged in IBM Spectrum Conductor 2.5.0 to get you going quickly. Select one of the samples for your workload type to create a conda environment with the packages required to run a Dask application on the cluster. You can install this sample on top of a conda environment that contains the Jupyter notebook packages if you intend to use Dask in a notebook.

Select the sample conda environment for Dask from the Add or Update Conda Environment dialog.

Now you are ready to create an instance group with Dask, optionally including the Jupyter notebook, Spark, or both according to your workload requirements. Here you can configure the detailed specifications for Dask schedulers and workers within the instance group. By default, the Dask component is configured to run adaptive CPU workers with a 1:1 slot-to-worker ratio.

Select a consumer and resource group for the Dask manager service, and the Dask workers. All instance groups that are deployed with the Dask component include a Dask manager service instance, which acts as a proxy between Dask schedulers and the IBM Spectrum Conductor cluster. Workers are started on-demand as Dask workload is submitted.

You are prompted to select a resource group for Dask schedulers. While Dask schedulers run in the process space of the application, the component plug-in and scheduler configuration file is deployed to the hosts where the schedulers are run. If you run applications from a notebook, select the same resource group that is used for your notebook here in the following example screen capture:

Select Dask consumers and resource groups for the instance group.

Now, the fun part. Let’s start a Dask application from the Jupyter notebook!

Open the Python kernel in your Jupyter notebook, or the Spark Python kernel if you plan to work with both Dask and Spark. Initializing a Dask scheduler and connecting to the IBM Spectrum Conductor cluster is simplified, create a ConductorCluster, connect the client, and you’re good to go.

Run Dask on the Spectrum Conductor cluster from the Jupyter notebook

You might notice that no parameters are required when the ConductorCluster is created from a Jupyter notebook in the same instance group. By default, the configuration from the Dask component in the instance group is used, and authentication and SSL are handled automatically through the packaged notebook integration.

In some cases, an application might need to override the scheduler or worker specifications. For example, whether the application requires a fixed number of workers, or a higher worker memory limit. This specification can be achieved by passing more arguments when the ConductorCluster is created. For details on which arguments can be overridden by the application, run the help(ConductorCluster) method.

After you run the previous cell, you now have a Dask scheduler that runs in your notebook kernel, which is registered with the Dask manager service in your instance group. The scheduler in the example (see the screen capture) uses the default configuration with adaptive workers, so initially the scheduler has zero workers and remains this way until a cell that requires Dask workers is run.

The specified client is imported directly from the distributed package in your conda environment and can be used to interact with the Dask scheduler and workers as usual. The standard Dask dashboard is also available, giving you a view into task status, workers, and more.

Now you are ready to start coding!

Give it a try and tell us what you think by downloading IBM Spectrum Conductor 2.5.0 on Passport Advantage or the evaluation version! To get the most out of Dask, apply the sc-2.5-build600176 interim fix, which contains several enhancements on top of the base integration. We hope you are as excited as we are about this new release!

Log in to this community page to comment on this blog post. We look forward to hearing from you on the new features, and what you would like to see in future releases.