Thanks for your reply, The problem is that users are running on the submission node e.g.
module load tensorflow srun myprogram So they get the tensorflow version (and PATH/PYTHONPATH) of the submission node's version of tensorflow (and any additional default modules). There is never a chance to run the "module add ${SLURM_CONSTRAINT}" or remove the unwanted modules that were loaded (maybe automatically) on the submission node and aren't working on the execution node. Thanks, Yair. On Tue, Dec 19 2017, "Loris Bennett" <loris.benn...@fu-berlin.de> wrote: > Hi Yair, > > Yair Yarom <ir...@cs.huji.ac.il> writes: > >> Hi list, >> >> We use here lmod[1] for some software/version management. There are two >> issues encountered (so far): >> >> 1. The submission node can have different software than the execution >> nodes - different cpu, different gpu (if any), infiniband, etc. When >> a user runs 'module load something' on the submission node, it will >> pass the wrong environment to the task in the execution >> node. e.g. "module load tensorflow" can load a different version >> depending on the nodes. >> >> 2. There are some modules we want to load by default, and again this can >> be different between nodes (we do this by source'ing /etc/lmod/lmodrc >> and ~/.lmodrc). >> >> For issue 1, we instruct users to run the "module load" in their batch >> script and not before running sbatch, but issue 2 is more problematic. >> >> My current solution is to write a TaskProlog script that runs "module >> purge" and "module load" and export/unset the changed environment >> variables. I was wondering if anyone encountered this issue and have a >> less cumbersome solution. >> >> Thanks in advance, >> Yair. >> >> [1] https://www.tacc.utexas.edu/research-development/tacc-projects/lmod > > I don't fully understand your use-case, but, assuming you can divide > your nodes up by some feature, could you define a module per feature > which just loads the specific modules needed for that category, e.g. in > the batch file you would have > > #SBATCH --constraint=shiny_and_new > > module add ${SLURM_CONSTRAINT} > > and would have a module file 'shiny_and_new', with contents like, say, > > module add tensorflow/2.0 > module add cuda/9.0 > > whereas the module 'rusty_and_old' would contain > > module add tensorflow/0.1 > module add cuda/0.2 > > Would that help? > > Cheers, > > Loris