Decision Optimization

Expand all | Collapse all

docplex.cp.model out-of-memory running on a cluster

  • 1.  docplex.cp.model out-of-memory running on a cluster

    Posted Thu July 29, 2021 08:04 AM
    I am trying to run the following docplex.cp.model with a large dataset. This is with some sample data:

    import numpy as np
    from docplex.cp.model import CpoModel
    N = 180000
    S = 10
    k = 2
    
    u_i = np.random.rand(N)[:,np.newaxis]
    u_ij = np.random.rand(N*S).reshape(N, S)
    beta = np.random.rand(N)[:,np.newaxis]
    
    m = CpoModel(name = 'model')
    R = range(1, S)
    
    idx = [(j) for j in R]
    I = m.binary_var_dict(idx)
    m.add_constraint(m.sum(I[j] for j in R)<= k)
    
    total_rev = m.sum(beta[i,0] / ( 1 + u_i[i,0]/sum(I[j] * u_ij[i-1,j]  for j in R) ) for i in range(N) )
    
    m.maximize(total_rev)
    
    sol=m.solve(agent='local',execfile='/Users/Mine/Python/tf2_4_env/bin/cpoptimizer')
    
    print(sol.get_solver_log())

    I have tried to run this on a cluster with following settings:

    #SBATCH --nodes=2
    #SBATCH --ntasks-per-node=4
    #SBATCH --mem-per-cpu=4571
    This stops with out-of-memory as shown in the output :

    ! --------------------------------------------------- CP Optimizer 20.1.0.0 --
     ! Maximization problem - 9 variables, 1 constraint
     ! Presolve      : 360001 extractables eliminated
     ! Initial process time : 28.95s (28.77s extraction + 0.19s propagation)
     !  . Log search space  : 9.0 (before), 9.0 (after)
     !  . Memory usage      : 623.2 MB (before), 623.2 MB (after)
     ! Using parallel search with 28 workers.
     ! ----------------------------------------------------------------------------
     !          Best Branches  Non-fixed    W       Branch decision
                            0          9                 -
     + New bound is 80920.82
    Traceback (most recent call last):
      File "sample.py", line 22, in <module>
        sol=m.solve(agent='local',execfile='/home/wbs/bstqhc/.local/bin/cpoptimizer') #agent='local',execfile='/Users/Mine/Python/tf2_4_env/bin/cpoptimizer')
      File "/home/wbs/bstqhc/.local/lib/python3.7/site-packages/docplex/cp/model.py", line 1222, in solve
        msol = solver.solve()
      File "/home/wbs/bstqhc/.local/lib/python3.7/site-packages/docplex/cp/solver/solver.py", line 775, in solve
        raise e
      File "/home/wbs/bstqhc/.local/lib/python3.7/site-packages/docplex/cp/solver/solver.py", line 768, in solve
        msol = self.agent.solve()
      File "/home/wbs/bstqhc/.local/lib/python3.7/site-packages/docplex/cp/solver/solver_local.py", line 209, in solve
        jsol = self._wait_json_result(EVT_SOLVE_RESULT)
      File "/home/wbs/bstqhc/.local/lib/python3.7/site-packages/docplex/cp/solver/solver_local.py", line 545, in _wait_json_result
        data = self._wait_event(evt)
      File "/home/wbs/bstqhc/.local/lib/python3.7/site-packages/docplex/cp/solver/solver_local.py", line 448, in _wait_event
        evt, data = self._read_message()
      File "/home/wbs/bstqhc/.local/lib/python3.7/site-packages/docplex/cp/solver/solver_local.py", line 604, in _read_message
        frame = self._read_frame(6)
      File "/home/wbs/bstqhc/.local/lib/python3.7/site-packages/docplex/cp/solver/solver_local.py", line 664, in _read_frame
        raise CpoSolverException("Nothing to read from local solver process. Process seems to have been stopped (rc={}).".format(rc))
    docplex.cp.solver.solver.CpoSolverException: Nothing to read from local solver process. Process seems to have been stopped (rc=-9).
    slurmstepd: error: Detected 2 oom-kill event(s) in step 379869.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

    What I observed is the optimisation is running parallel as it says Using parallel search with 28 workers and there are 28 cores per node. However looks like its only using 1 node.

    Can you please help me to overcome the out-of-memory issue?





    ------------------------------
    SHanaka Perera
    ------------------------------


  • 2.  RE: docplex.cp.model out-of-memory running on a cluster

    Posted Thu July 29, 2021 11:50 AM
    Dear SHanaka,

    I understand the SBATCH parameters are configuration parameters of the SLURM workload manager that you are using to launch your jobs.
    I guess it launches one job that runs the solver, but as each of your nodes have 28 cores, the solver attempts to use them all. If your model is too large it is better to reduce the number of workers that the solver uses by setting its 'Workers' parameter to a smaller value. If the model consumes a lot of memory compared to what your node has to offer, please consider using Workers=1. Here is the list of parameters: http://ibmdecisionoptimization.github.io/docplex-doc/cp/docplex.cp.parameters.py.html
    I hope this helps,

                Renaud

    ------------------------------
    Renaud Dumeur
    ------------------------------



  • 3.  RE: docplex.cp.model out-of-memory running on a cluster

    Posted Thu July 29, 2021 11:57 AM
    Hi Renaud Dumeur,

    Thanks for your response. So I have 2 follow up questions here:

    1. Is there a way to use more than 1 cluster node when running the optimisation? 

    2. Yes I already tried setting Workers=1 by using: 
    from docplex.cp import parameters
    parameters.Workers = 1

    but it goes back to using all the cores. Can you please help me with setting this parameter correctly? 

    Thanks,
    Shanaka 



    ------------------------------
    SHanaka Perera
    ------------------------------



  • 4.  RE: docplex.cp.model out-of-memory running on a cluster

    Posted Thu July 29, 2021 12:18 PM
    Dear SHanaka,

    The documentation https://ibmdecisionoptimization.github.io/docplex-doc/cp/docplex.cp.model.py.html states that solve( ... ) can take  parameters as kwargs:
    so doing mdl.solve( ... , Workers=1) 
    should work.
    I hope this helps.

            Renaud

    ------------------------------
    Renaud Dumeur
    ------------------------------



  • 5.  RE: docplex.cp.model out-of-memory running on a cluster

    Posted Thu July 29, 2021 02:21 PM
    Great, yes that helps in getting the Workers parameter working. 

    But can someone please help me with my original query - Is there a way to use more than 1 cluster node when running the optimisation? 

    ------------------------------
    SHanaka Perera
    ------------------------------



  • 6.  RE: docplex.cp.model out-of-memory running on a cluster

    Posted Fri July 30, 2021 03:48 AM
    Dear SHanaka,

    Your question is unrelated to our product, but reading https://slurm.schedmd.com/quickstart_admin.html I think that you should check that you have properly set up your partition (the set of machines availables when launching jobs with SLURM).
    I hope this helps.

    Cheers,

    ------------------------------
    Renaud Dumeur
    ------------------------------