Hi,
back in 2021/22, there was a document, that was a "kind of" LSF DCGM support matrix? Is there any update?
We run the latest LSF Service Pack 15, and DCGM 4.2.3, but now we see messages like this
May 25 05:35:58 2025 87712 3 10.1 checkGPUStatus: Fail to get GPU healthy status: API version mismatch
and this (which might not be related, though):
May 25 05:35:59 2025 87712 3 10.1 controlDevicesForJobCgroup: The bpf processing for job <1234567> doesn't exist any more.
In the bjobs -l output for the job, we also get this one:
PENDING REASONS:
Failed to send fan-out information to other SBDs;
Note: the job is a multi-node GPU job!
Any hints? Thanks!
------------------------------
Bernd Dammann
------------------------------