Dear SLURM users,

I would like some suggestions on how to spread out in time the start of 
multiple parallel jobs with srun.
I have a very basic script which specifies number of nodes and tasks with just 
one command: srun myjob. The problem is that 10-20 tasks start accessing files 
at the same time, and that causes some tasks to quit.
What I would like to do is somehow tell SLURM to start each task with a delay, 
like next task 5 seconds after the previous one. What I have tried so far:
1) Using a random number generator helps, but it is not 100% safe. 
2) If tasks run 1 per node, I can use node hostnames, but that doesnt help if I 
run all tasks on one node.
3) Parallel module has an option to delay the start, but we dont have it 
available.
Is there a way to get a task number? I know there is SLURM_ARRAY_TASK_ID 
variable, but all job array related variables dont work for me. I guess, job 
array capabilities arent enable on our SLURM.
Any other suggestions?
Thanks in advance!

Best,
Renat.

Reply via email to