View Only

Nvidia GPU Infiniband Monitoring – because it’s not all about crypto

By Raul Gonzalez posted Fri April 05, 2024 05:20 AM


A couple of weeks one of my dear friends Emily asked me what do I know about GPUs, and I thought to myself ‘another cryptobro that wants to talk about Bitcoin’, but, to my surprise, that was not the case.

She had a meeting with a financial services company in the UK that were having issues in their low latency trading system and they couldn’t understand where the issue was. Hence her question, as she wanted to check what else they could do to have a better view of what’s happening on their platform and finally find the problem.

Let’s start the process

First thing I did is to dig a little bit deeper and understand what kind of equipment they were using, and the answer was Nvidia Infiniband.

Of course the first step for a SevOne user like me was to check if there is a device certification for this type of device…and bingo! Nvidia Infiniband is a type of network device that SevOne has already certified and where we can get a lot of information from.

Object Types Available in Nvidia Infiniband

Typical question that I get from customer: Can we monitor CRC errors? Well that’s a clear Yes for Nvidia Infiniband devices

Indicators Available in Ethernet Nvidia Infiniband

This was a promising start, we can monitor all their switches and routers and get all the relevant data from day 1, with 0 effort whatsoever from the end customer needing to configure anything. Just leave SevOne to do the discovery and it will start monitoring all the KPIs that are important to you.

Next Step -> GPU

Our job wasn’t finished yet, we solved the network visibility problem, but we still needed to monitor the actual GPUs of the clusters running important workloads. As I said in the beginning, not only crypto miners need GPUs!

How do we monitor these GPUs? My first idea was to go the running OS of the cluster but, unfortunately,  that option didn’t give us the detailed information that we needed.

After some digging online, I found Nvidia DCGM (Data Center GPU Manager) that, as part of its features, also a tool that allows us to collect all the detailed GPU metrics.  And to top it all off, this technology can share the data using Prometheus, a technology that we know very well in SevOne. It looks like it’s time to configure SevOne to start collecting data from Nvidia DCGM!

In this case, I wanted to do some fast tests (fail quick!) to test the data available in Nvidia DCGM, hence I decided to use IBM RNA (Rapid Network Automation) to build the workflow to collect data from Prometheus and ingest it into SevOne.

Prometheus Data Collection using IBM RNA

A few minutes later GPU data started flowing into SevOne.

Example of GPU DCGM Data Collection

Collection is not enough, we need advanced analytics!

This was a great step, managing to ingest the GPU performance data into SevOne was definitely a positive thing….however, as we say in SevOne, collecting data is only half of the job, you still need to analyse it. And when you are collecting thousands or millions of KPIs, you can’t analyse all this data by yourself and on the fly! You need help from a platform that, at least, is able to understand what is normal and what is not.

This is the main reason that Emily’s customer was asking for help, they were having intermittent issues on their low latency trading system but they couldn’t put their finger on it.

After one week of data collection, SevOne learned the normal behaviour of each of the KPIs gathered, and we started receiving anomaly detection notifications…and finally we found the issue.

The Solution

Thanks to the notification from SevOne that had detected an anomaly, we went back to the SevOne report and we saw that after a reset in the workloads running on that GPU cluster, normally the GPU should reduce slightly its utilization and then go back to normal levels, however in this situation something happened and the GPU started running higher than expected.

Comparison Between Current Behaviour vs Expected Behaviour (Baseline)

NOTE: anomaly detection doesn't mean static thresholds, it may be normal that a metric is over 90% during a specific time of the day. Furthermore, when you learn the normal behaviour of any metric, you need to have seasonality, you can't learn the normal behaviour if you only consider the last X polls.

Correlating this data with other anomalies that we received from SevOne, we also saw that the temperature of the GPU was being impacted and therefore the GPU started working slower.

Data Correlation in a Single Report

The end

Less than 20 days since we started working on this issue, where the bank couldn’t figure out the problem they were having on their GPU clusters, we managed to:

  • Monitor extensively all their Nvidia switches and routers
  • Monitor their GPU clusters
  • Figure out the issue using anomaly detection

As we say in IBM, SevOne allows you to monitor all the network data (including GPU data), and, on top of that, analyse the data to pinpoint where the issue is.