High Performance Computing Group

 View Only

Application integrations that leverage bkill to terminate workers

  • 1.  Application integrations that leverage bkill to terminate workers

    Posted Tue April 13, 2021 10:34 AM

    I am sometimes surprised by how many people have LSF but don't use RTM.  For us it has been the killer feature of the stack. So this may not be impacting as many people but has been a growing nuisance for us.

    But recently we have had a significant increase in Dask usage and our Spark usage has transformed from a very rigid each worker is a single job occupying a whole node to arbitrary and fairly small individual workers but many more of them.  This has lead to our average daily run rate going from the 200k to the 300k+ but because these integrations are not tight they tend to submit jobs with processes that just sit around waiting for work to hit a message queue of some type then to close them out they are just bkilled.  This was a bit ugly but manageable when it was primarily Spark and a big cluster might have 20 or 40 workers that would run for an extended time.  But now we are seeing much wider jobs that are turning over faster.  A dask cluster can easily have thousands of workers and last for just a few minutes.  This is great from a utilization stand point because the users can slip into all these little cracks here and there with hosts that have a few leftover slots but it is making RTM's grid_bjobs showing the EXIT jobs nearly unusable most of the time.  When there can be 10s of thousands of these jobs turning over every hour and from the user's stand point they did exactly what they were supposed to do but they are ending up in the bucket with all the successful jobs it is hard to find the really failed jobs.

    I did work with support and there isn't a direct option for bkill currently to say "treat this job as successful even though I am killing it."  Locally we have experimented with adding wrappers to catch something like a sigusr1 and return an exit 0 with that but all the various methods have felt like a house of cards and we have received some pushback from the users who were helping with those tests that they really preferred the escalating method of bkill's default behavior to make sure that the job is killed if something were to go wrong.  So at this point we have submitted an RFE to address it.  In a perfect world it would result in something as simple as adding a flag that would just force the job to be logged as DONE rather than EXIT while still recording that it was TERM_OWNER.  

    So if this is impacting you or could impact you as you scale up I encourage you to take a look at this RFE:  http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=149868

    If you don't want to follow the link you can find it on the RFE list as:

    Add bkill option that will set the job status to DONE rather than EXIT


    Best regards,
    Rob Lines
    Janelia Research Campus
    Howard Hughes Medical Institute