Thanks for your reply,

The problem is that users are running on the submission node e.g.

module load tensorflow
srun myprogram

So they get the tensorflow version (and PATH/PYTHONPATH) of the
submission node's version of tensorflow (and any additional default
modules).

There is never a chance to run the "module add ${SLURM_CONSTRAINT}" or
remove the unwanted modules that were loaded (maybe automatically) on
the submission node and aren't working on the execution node.

Thanks,
    Yair.

On Tue, Dec 19 2017, "Loris Bennett" <loris.benn...@fu-berlin.de> wrote:

> Hi Yair,
>
> Yair Yarom <ir...@cs.huji.ac.il> writes:
>
>> Hi list,
>>
>> We use here lmod[1] for some software/version management. There are two
>> issues encountered (so far):
>>
>> 1. The submission node can have different software than the execution
>>    nodes - different cpu, different gpu (if any), infiniband, etc. When
>>    a user runs 'module load something' on the submission node, it will
>>    pass the wrong environment to the task in the execution
>>    node. e.g. "module load tensorflow" can load a different version
>>    depending on the nodes.
>>
>> 2. There are some modules we want to load by default, and again this can
>>    be different between nodes (we do this by source'ing /etc/lmod/lmodrc
>>    and ~/.lmodrc).
>>
>> For issue 1, we instruct users to run the "module load" in their batch
>> script and not before running sbatch, but issue 2 is more problematic.
>>
>> My current solution is to write a TaskProlog script that runs "module
>> purge" and "module load" and export/unset the changed environment
>> variables. I was wondering if anyone encountered this issue and have a
>> less cumbersome solution.
>>
>> Thanks in advance,
>>     Yair.
>>
>> [1] https://www.tacc.utexas.edu/research-development/tacc-projects/lmod
>
> I don't fully understand your use-case, but, assuming you can divide
> your nodes up by some feature, could you define a module per feature
> which just loads the specific modules needed for that category, e.g. in
> the batch file you would have
>
>    #SBATCH --constraint=shiny_and_new
>
>    module add ${SLURM_CONSTRAINT}
>
> and would have a module file 'shiny_and_new', with contents like, say,
>
>   module add tensorflow/2.0
>   module add cuda/9.0
>
> whereas the module 'rusty_and_old' would contain
>
>   module add tensorflow/0.1
>   module add cuda/0.2
>
> Would that help?
>
> Cheers,
>
> Loris

Reply via email to