Hi,
thanks for the hint, but I don't think it helps! I can recognize some of the information from the 'struct gpuJobData', and the enclosed other data structures, but this "add some more information" is the problem. How to decode this extra information, if there is no documentation how it was created? Example: this is what I get from the "struct jobFinishLog" when passing the lsb.acct file, for two different GPU jobs. The first one was dispatched to a GPU in MIG mode, the second to a GPU in non-MIG mode:
GPU_ALLOC_COMPAT: 480:1,1,370:10:hostA,5,0,0,4,79:0,0,0,1,68:1,4,513,513,-1,-1,0,1,43:NVIDIAA100_PCIE_40GB,40960,8,0,40239,0,1,1,79:1,0,0,1,68:1,4,513,513,-1,-1,0,1,43:NVIDIAA100_PCIE_40GB,40960,8,0,40239,0,1,1,79:2,0,0,1,68:1,4,513,513,-1,-1,0,1,43:NVIDIAA100_PCIE_40GB,40960,8,0,40239,0,1,1,79:3,0,0,1,68:1,4,513,513,-1,-1,0,1,43:NVIDIAA100_PCIE_40GB,40960,8,0,40239,0,1,1,1,HOST_NVIDIA_CNT:2,1,EFFECTIVE_GPU_REQ:num=1:mode=exclusive_process:mps=no:j_exclusive=yes:aff=no:gvendor=nvidia:mig=1/1,
GPU_ALLOC_COMPAT: 454:0,1,352:9:hostB,1,4,0,4,75:0,1,50:0,1,41:NVIDIAA10080GBPCIe,81920,8,0,81038,0,0,0,0,1,GPU_MEM_RSV:0,0,75:1,1,50:0,1,41:NVIDIAA10080GBPCIe,81920,8,0,81038,0,0,0,0,1,GPU_MEM_RSV:0,0,75:2,1,50:0,1,41:NVIDIAA10080GBPCIe,81920,8,0,81038,0,0,0,0,1,GPU_MEM_RSV:0,0,75:3,1,50:0,1,41:NVIDIAA10080GBPCIe,81920,8,0,81038,0,0,0,0,1,GPU_MEM_RSV:0,0,1,HOST_NVIDIA_CNT:2,1,EFFECTIVE_GPU_REQ:num=1:mode=exclusive_process:mps=no:j_exclusive=yes:aff=no:gvendor=nvidia,
How to interpret this two very different lines?
------------------------------
Bernd Dammann
------------------------------
Original Message:
Sent: Wed July 17, 2024 10:04 PM
From: Ji Shan Xing
Subject: LSF: decoding GPU information in lsb.acct
Please refer to the structure "struct gpuJobData" in lsbatch.h. This string is generated from gpuJobData but add some more information.
------------------------------
Ji Shan Xing
Original Message:
Sent: Wed July 17, 2024 07:15 AM
From: Bernd Dammann
Subject: LSF: decoding GPU information in lsb.acct
Hi,
is there some documentation, or even better, an API call, to decode the GPU_ALLOC_COMPAT string in the lsb.acct file? We have our own internal version of 'bacct', that provides e.g. JSON output, we can feed into our internal accounting/billing tool. With different models in the GPU queues, we also need to get at least the GPU model, and eventually the 'mig' value from this string.
Thanks!
------------------------------
Bernd Dammann
------------------------------