Hi,
thanks for the hint, but I don't think it helps! I can recognize some of the information from the 'struct gpuJobData', and the enclosed other data structures, but this "add some more information" is the problem. How to decode this extra information, if there is no documentation how it was created? Example: this is what I get from the "struct jobFinishLog" when passing the lsb.acct file, for two different GPU jobs. The first one was dispatched to a GPU in MIG mode, the second to a GPU in non-MIG mode:
GPU_ALLOC_COMPAT: 480:1,1,370:10:hostA,5,0,0,4,79:0,0,0,1,68:1,4,513,513,-1,-1,0,1,43:NVIDIAA100_PCIE_40GB,40960,8,0,40239,0,1,1,79:1,0,0,1,68:1,4,513,513,-1,-1,0,1,43:NVIDIAA100_PCIE_40GB,40960,8,0,40239,0,1,1,79:2,0,0,1,68:1,4,513,513,-1,-1,0,1,43:NVIDIAA100_PCIE_40GB,40960,8,0,40239,0,1,1,79:3,0,0,1,68:1,4,513,513,-1,-1,0,1,43:NVIDIAA100_PCIE_40GB,40960,8,0,40239,0,1,1,1,HOST_NVIDIA_CNT:2,1,EFFECTIVE_GPU_REQ:num=1:mode=exclusive_process:mps=no:j_exclusive=yes:aff=no:gvendor=nvidia:mig=1/1,
GPU_ALLOC_COMPAT: 454:0,1,352:9:hostB,1,4,0,4,75:0,1,50:0,1,41:NVIDIAA10080GBPCIe,81920,8,0,81038,0,0,0,0,1,GPU_MEM_RSV:0,0,75:1,1,50:0,1,41:NVIDIAA10080GBPCIe,81920,8,0,81038,0,0,0,0,1,GPU_MEM_RSV:0,0,75:2,1,50:0,1,41:NVIDIAA10080GBPCIe,81920,8,0,81038,0,0,0,0,1,GPU_MEM_RSV:0,0,75:3,1,50:0,1,41:NVIDIAA10080GBPCIe,81920,8,0,81038,0,0,0,0,1,GPU_MEM_RSV:0,0,1,HOST_NVIDIA_CNT:2,1,EFFECTIVE_GPU_REQ:num=1:mode=exclusive_process:mps=no:j_exclusive=yes:aff=no:gvendor=nvidia,
How to interpret this two very different lines?
------------------------------
Bernd Dammann
------------------------------
Original Message:
Sent: Wed July 17, 2024 10:04 PM
From: Ji Shan Xing
Subject: LSF: decoding GPU information in lsb.acct
Please refer to the structure "struct gpuJobData" in lsbatch.h. This string is generated from gpuJobData but add some more information.
------------------------------
Ji Shan Xing
------------------------------