Hi
I am trying to get LSF working with conda environments. This is on a single node.
When I run it standalone, it's fine. nvidia-smi shows that it is running.
But when I run it using bsub it fails
-------------
(pytorch1.10) [root@newell1 benchmark-dso]# more out.txt
THCudaCheck FAIL file=../aten/src/THC/THCGeneral.cpp line=52 error=100 : no CUDA-capable device is detected
Traceback (most recent call last):
File "./EDSR/src/main.py", line 35, in <module>
main()
File "./EDSR/src/main.py", line 25, in main
_model = model.Model(args, checkpoint)
File "/opt/benchmark-dso/EDSR/src/model/__init__.py", line 26, in __init__
self.model = module.make_model(args).to(self.device)
File "/opt/anaconda3/envs/pytorch1.10/lib/python3.6/site-packages/torch/nn/modules/module.py", line 386, in to
return self._apply(convert)
File "/opt/anaconda3/envs/pytorch1.10/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/opt/anaconda3/envs/pytorch1.10/lib/python3.6/site-packages/torch/nn/modules/module.py", line 199, in _apply
param.data = fn(param.data)
File "/opt/anaconda3/envs/pytorch1.10/lib/python3.6/site-packages/torch/nn/modules/module.py", line 384, in convert
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
File "/opt/anaconda3/envs/pytorch1.10/lib/python3.6/site-packages/torch/cuda/__init__.py", line 163, in _lazy_init
torch._C._cuda_init()
----------------------------
What am I doing wrong?
------------------------------
GILBERT THOMAS
------------------------------
#SpectrumComputingGroup