High Performance Computing

High Performance Computing Group

Connect with HPC subject matter experts and discuss how hybrid cloud HPC Solutions from IBM meet today's business needs.

View Only

Back to Blog List

Looking for actionable insight into your cluster usage?

By Tina Langridge posted Fri February 19, 2021 10:33 AM

Oh, look … another visualization that aggregates a single resource usage metric over time. Been there, done that. That’s not what this blog is about. This blog is about providing a solution to the following questions:

Do you want more insight?
Do you want to know how many Spark applications failed or are waiting in the last hour?
For failed or waiting jobs, are these jobs isolated to a particular resource group, instance group, or user?
Are the resources of the cluster over or unknowingly under-allocated?

IBM® Spectrum Conductor 2.5.0 introduces a new Elasticsearch index that has Spark application resource usage information, which is combined with details on the resource plan and Spark priority information. With this extra information, you can learn valuable cluster information, such as understanding why slots are changing across Spark applications when new ones are submitted. In addition, when priorities or plans are changed.

This blog explores the new Elasticsearch index that uses a dashboard and visualizations that are created with Kibana as a sample download. You can integrate Kibana into your IBM Spectrum Conductor cluster for data exploration. The sample also includes the IBM Spectrum Conductor dashboard that is explored in this blog to provide the same visualizations, replaced with real-time Spark application resource usage metrics fed directly from your IBM Spectrum Conductor cluster.

Introductions

The IBM Spectrum Conductor dashboard is created with Kibana, which is fed by the Spark application resource usage metrics from the IBM Spectrum Conductor cluster and stored in the new Elasticsearch index.

The dashboard consists of a mixture of visualizations. All visualizations refer to a specified time range, which is initially configured to range from 4 hours before now.

The upper row contains a mixture of metric and pie charts to provide quick stats that include the following details:

Total Spark applications – Total Spark applications that are submitted.
FINISHED – Total Spark applications that completed successfully.
KILLED, FAILED, RECLAIMED – Total Spark applications that completed unsuccessfully with the respective state.
Total Spark applications for top 5 instance groups – Total Spark applications that are submitted for the top five instance groups, ordered by descending count.
CPU slots allocated for a Spark application – Median, average, and max CPU slots allocated for a Spark application in a 30-second interval.
Total CPU slots allocated for Spark applications for top 5 instance groups – Total CPU slots that are allocated for Spark applications for the top five instance groups, ordered by descending total CPU slots allocated.
CPU memory used by a Spark application – Median, average, and max CPU memory used by a Spark application in a 30-second interval.
Total CPU memory used by Spark applications for top 5 instance groups – Total CPU memory that is used by Spark applications for the top five instance groups, ordered by descending total CPU memory used.

The left section consists of stacked, vertical bar charts over time that includes the following details:

Total Spark applications by state – Total Spark applications in a particular state.
Total Spark applications by state for top 5 instance groups – Total Spark applications in a state that is grouped by the top five instance groups, ordered by descending average CPU slots allocated.
Total Spark applications by state for top 5 users – Total Spark applications in a state that is grouped by the top five users, ordered by descending average CPU slots allocated.

The right section consists of heat maps over time that includes the following details:

Total CPU memory used by Spark applications – Total CPU memory that is used by Spark applications.
Total CPU slots allocated for Spark applications – Total CPU slots that are allocated for Spark applications.
Total CPU slots allocated for Spark applications for top 5 instance groups in top 5 executor resource groups – Total CPU slots that are allocated for Spark applications that are grouped by the top five instance groups in the top five executor resource groups, ordered by descending total CPU slots allocated.
Total CPU slots allocated for Spark applications for top 5 users – Total CPU slots that are allocated for Spark applications that are grouped by the top five users, ordered by descending total CPU slots allocated.

The final row includes the following details:

Spark application state for 40 Spark applications – Visualization that displays the state of 40 Spark applications over time, ordered by descending submit time.
Resource Usage for 40 Spark applications – Table that displays the average resource usage metrics for 40 Spark applications in a 30-second interval, ordered by descending submit time.

Now, lets dive in

From the dashboard, we can see 105 Spark applications submitted (or previously submitted and active) in the last 4 hours. It’s clear, no Spark applications finished unsuccessfully. Good.

We can see that a few jobs are in waiting state and are in this state for less than 1 minute. No concerns here.

But, wait a minute. The average CPU memory that is used by a Spark application is more than double the median and much lower than the max. What does this usage mean (no pun intended)? The median is the numerical value that separates the upper half of the CPU memory is used from the lower half or middle value. While the average, is the arithmetic average of CPU memory used and is largely influenced by outliers. For a fully utilized cluster, we would expect the median, average, and max CPU memory is used to reflect a similar large number.

The total CPU memory by Spark application heat map provides visual clues the CPU memory is under-utilized at several periods in the 4-hour time interval. We can see periodic spikes in CPU memory usage, which can be highlighted by scrolling over the largest bucket in the histogram 21,000-28,000. With 105 Spark applications and the CPU Memory used statistics, the cluster is easily able to handle the workload.

Total CPU memory used by Spark applications

For curiosity, we can isolate the Spark application or group of Spark applications that causes this CPU memory spikes by adding a filter when the CPU memory that is used by a Spark application is greater than 11,000 in a 30-second interval.

From the filter, we can narrow down to two spikes to the Spark application app-20210205100332-0003-faab4aae-3ada-48d0-820d-9cadf9d785d8 submitted by user5. The CPU memory spikes line up nicely with an increase in CPU slots that are allocated for the Spark application in the following visualization:

Sample IBM Spectrum Conductor dashboard with memory used filter

Next, remove the CPU memory used filter and add a filter on user5.

Here, we can see user5 has three long running Spark applications. Displaying infrequent blips in the job that are memory intensive, likely related to a stage with a high number of tasks. The allocated CPU slots scale up and down as needed and the jobs never go into waiting state.

Sample IBM Spectrum Conductor dashboard with user filter

Show me more!

Wouldn’t it be great if we can easily plot the number of running CPU executors to see whether the CPU memory spikes line up with an increase in running CPU executors for the Spark application?

You can. With just a few clicks.

Clone the Total CPU slots allocated for Spark applications for top 5 users visualization from the dashboard. Modify the aggregation field to ExecutorsRunning_CPU. Add a filter for the Spark application ID (optional), and click Update > Save and return.

How data is collected

IBM Spectrum Conductor uses a combination of data loaders to collect operational data from data sources at regular intervals and the Elastic Stack to extract, transform, and load data into Elasticsearch. The Spark resource usage data loader (sparkresusageloader) in IBM Spectrum Conductor 2.5.0 collects details for all running or finished Spark applications during a 30-second interval and writes a unique document for each application to this index, stored historically including:

Spark application metadata: instance group UUID or name, application ID, user, and more.
Resource usage metrics: allocated slots, memory, cores, and executor counts.
Instance group consumer: share ratios, resource groups, job priority.
Timestamps: submit, start, and end.

New Elasticsearch index

The Spark resource usage data loader (sparkresusageloader) in IBM Spectrum Conductor 2.5.0 writes to the ibm-cws-reporting-spark-application-metric-* indexes.

ibm-cws-reporting-spark-application-metric index

Now, download and try the sample

To use Kibana to access data collected by IBM Spectrum Conductor and stored in an Elasticsearch cluster, you first need to download Kibana to connect to the Elasticsearch cluster configured in IBM Spectrum Conductor. You can get full instructions and sample scripts and dashboards here.

The sample includes an IBM Spectrum Conductor dashboard to provide visualizations fed by the Spark resource usage metrics from your IBM Spectrum Conductor cluster. You can modify all visualizations, including the dashboard, to fit your business requirements.

System requirements

Kibana utilizes the Elasticsearch cluster in your IBM Spectrum Conductor cluster. Before you complete the Kibana integration, ensure that the following prerequisites are met:

You have an IBM Spectrum Conductor cluster that is installed and running.
You might need to scale the Elasticsearch clients (elk-elasticserach service) to meet increased querying demands. For more information, see Scaling Elasticsearch and Logstash services to accommodate heavy indexing, queries, or both topic.
Increase the Elasticsearch client and data heap size to meet increased querying demands. For more information, see Tuning the heap sizes for Elasticsearch and Logstash to accommodate heavy load topic.

Give it a try and tell us what you think by downloading IBM Spectrum Conductor 2.5.0 on Passport Advantage or the evaluation version! We hope you are as excited as we are about this new release!

Log in to this community page to comment on this blog post. We look forward to hearing from you on the new features, and what you would like to see in future releases.

#SpectrumComputingGroup

0 comments

53 views

Permalink

https://community.ibm.com/community/user/blogs/tina-langridge1/2021/02/17/looking-for-actionable-insight-into-your-cluster-u

High Performance Computing

High Performance Computing Group

Looking for actionable insight into your cluster usage?

By Tina Langridge posted Fri February 19, 2021 10:33 AM

Introductions

Now, lets dive in

Show me more!

How data is collected

New Elasticsearch index

Now, download and try the sample

System requirements

Permalink

Additional
Resources

Office

Quick Links

High Performance Computing

High Performance Computing Group

Looking for actionable insight into your cluster usage?

By Tina Langridge posted Fri February 19, 2021 10:33 AM

Introductions

Now, lets dive in

Show me more!

How data is collected

New Elasticsearch index

Now, download and try the sample

System requirements

Permalink

Additional Resources

Office

Quick Links

Additional
Resources