Re: [slurm-users] Reproducible irreproducible problem (timeout?)

2023-12-20 Thread Laurence Marks
In terms of dependencies, please think about timing. Currently one loop takes ~70 minutes, and say there is a queue time T for any job. If you split the slow part to run serial one loop takes ~190 minutes + 2T. The time for N iterations would be ~ 190N +570*T versus 70N+T. --- Professor Laurence M

Re: [slurm-users] Reproducible irreproducible problem (timeout?)

2023-12-20 Thread Laurence Marks
Dependencies is not an appropriate approach. --- Professor Laurence Marks (Laurie) www.numis.northwestern.edu "Research is to see what everybody else has seen, and to think what nobody else has thought" Albert Szent-Györgyi On Wed, Dec 20, 2023, 14:40 Renfro, Michael wrote: > Is this Northweste

Re: [slurm-users] Reproducible irreproducible problem (timeout?)

2023-12-20 Thread Renfro, Michael
Is this Northwestern’s Quest HPC or another one? I know at least a few of the people involved with Quest, and I wouldn’t have thought they’d be in dire need of coaching. And to follow on with Davide’s point, this really sounds like a case for submitting multiple jobs with dependencies between t

Re: [slurm-users] Reproducible irreproducible problem (timeout?)

2023-12-20 Thread Laurence Marks
It is a University "supercomputer", not a national facility. Hence they are not that expert, which is why I am asking here. I am pretty certain that it is some form of communication issue, but beyond that it is not clear. If I get suggestions such as "why don't they look for ABC in XYZ" then I may

Re: [slurm-users] Reproducible irreproducible problem (timeout?)

2023-12-20 Thread Gerhard Strangar
Laurence Marks wrote: > After some (irreproducible) time, often one of the three slow tasks hangs. > A symptom is that if I try and ssh into the main node of the subtask (which > is running 128 mpi on the 4 nodes) I get "Authentication failed". How about asking an admin to check why it hangs?

Re: [slurm-users] Slurm compute node with Intel 12th gen CPU

2023-12-20 Thread Ole Holm Nielsen
On 20-12-2023 15:59, Michael Bernasconi wrote: I'm trying to get slurm working on an Intel 12th gen CPU. slurmd instantly fails with the error message "Thread count (24) not multiple of core count (16)". I have tried adding "SlurmdParameters=config_overrides" to slurm.conf, and I have experimen

Re: [slurm-users] Reproducible irreproducible problem (timeout?)

2023-12-20 Thread Davide DelVento
Not an answer to your question, but if the jobs need to be subdivided, why not submit smaller jobs? Also, this does not sound like a slurm problem, but rather a code or infrastructure issue. Finally, are you typically able to ssh into the main node of each subtask? In many places that is not allo

Re: [slurm-users] Slurm compute node with Intel 12th gen CPU

2023-12-20 Thread Chip Seraphine
Probably not the answer you’re looking for, but in my environment I simply disabled hyperthreading in the BIOS. This avoided situations where we had things like “2 processes running on different threads on the same core while another core is sitting idle”. If you are more often constrained b

[slurm-users] Slurm compute node with Intel 12th gen CPU

2023-12-20 Thread Michael Bernasconi
I'm trying to get slurm working on an Intel 12th gen CPU. slurmd instantly fails with the error message "Thread count (24) not multiple of core count (16)". I have tried adding "SlurmdParameters=config_overrides" to slurm.conf, and I have experimented with various combinations of "Sockets", "Coresp

Re: [slurm-users] Adding an association to a different account

2023-12-20 Thread Chip Seraphine
Thank you, I struggled with that. Very unintuitive to use “create user” on an existing user! I think I was actually looking at the answer a few times but assumed they were doing something else, given the syntax. From: slurm-users on behalf of Michael Gutteridge Reply-To: Slurm User Communi

[slurm-users] Reproducible irreproducible problem (timeout?)

2023-12-20 Thread Laurence Marks
I know that sounds improbable, but please readon. I am running a reasonably large job on a University supercomputer (not a national facility) with 12 nodes on 64 core nodes. The job loops through a sequence of commands some of which are single cpu, but with a slow step where 3 tasks each with 4 no