I can confirm what Larry writes: the default setting in LSF was (and maybe still is?) to include the files cached as part of the memory usage. We had ML workloads, where the memory usage of the application was only a few GB, but as the training data consisted of several hundred GB (and thousands of files), the I/O caching of the OS made those jobs exceeding their memory limit easily! The magic setting is LSB_CGROUP_MEM_INCLUDE_CACHE=N, in lsf.conf.
Original Message:
Sent: Wed April 26, 2023 11:59 AM
From: Larry Adams
Subject: LSF max memory limit overrun
Cory,
As a rule, on the more modern OS' starting with RHEL7++, you should always used CGROUP accounting in LSF. It's much more accurate. Also, it's imperative that you do NOT include disk cache in the CGROUP metric, which is actually an LSF setting. If you include Disk Cache, the jobs will crash even though they are not using an RSS value that should otherwise allow them to keep running. Many years ago, that was the design, and they introduced a setting to ignore CACHE memory in the Memory total for the job.
------------------------------
Larry Adams
Original Message:
Sent: Wed April 26, 2023 08:29 AM
From: Cory Engebretson
Subject: LSF max memory limit overrun
Thanks Yi and Larry for the replies. We have RTM tracking in place, and when this problem arises again I will try to compare RTM stats with memory usage reported directly by the linux VM. It's still possible that the problem HPC jobs are actually using the reported amount of memory and LSF is doing what it should. I posted here to see if anyone had experienced errors in LSF measuring the job memory usage. So far, I'm not hearing of anyone seeing that problem.
Cory
------------------------------
Cory Engebretson
Original Message:
Sent: Tue April 25, 2023 10:48 AM
From: Larry Adams
Subject: LSF max memory limit overrun
Cory,
This is always a challenge, not only the reservation piece but the memory limit part, which should always be higher. We track such things in RTM pretty heavily at various levels, even organizationally. With those time series charts, management can push their teams to have better quality reservations. Of course there is also the ML approach, which I believe IBM sells a product to assist you with. It's called Predictor I think. It will use Machine Learning to 'estimate' what a jobs' reservation will be, and when activated, take over the memory reservations. Kind of a nice topic. We do the same where we are, but wrote our own system.
When you tighen the screws too much, you then can run into OOM issues. IBM has a nice feature where it will automatically kill the job abusing their memory reservations the most in order to save the host. As Sun Yi mentioned, you can always call the IBM support organization to press for additional details on such things.
Larry
------------------------------
Larry Adams
Original Message:
Sent: Mon April 24, 2023 10:03 AM
From: Cory Engebretson
Subject: LSF max memory limit overrun
A colleague of mine has a job they are running on our linux HPC cluster using LSF. Our LSF environment is configured to require a memory resource request for each bsub and jobs that go over their requested memory by 10% or more are killed. LSF is killing about 25% of his recent jobs due to overrunning the memory requested. Those killed jobs can be run again and easily stay under the requested memory. These jobs are using 8-16GB of memory typically, but the ones that are killed are being reported to use 128-512GB. We cannot explain the memory spike from the job, and I'm beginning to wonder if LSF is wrongly detecting the memory usage.
Has anyone else experienced behavior like this? Any suggestions on what to monitor?
Thanks,
Cory Engebretson
------------------------------
Cory Engebretson
------------------------------