Am Wed, 29 Mar 2023 15:51:51 +0200
schrieb Ole Holm Nielsen :
> As for job scheduling, slurmctld may allocate a job to some powered-off
> nodes and then calls the ResumeProgram defined in slurm.conf. From this
> point it may indeed take 2-3 minutes before a node is up and running
> slurmd, dur
Hi Thomas,
I think the Slurm power_save is not problematic for us with bare-metal
on-premise nodes, in contrast to the problems you're having.
We use power_save with on-premise nodes where we control the power down/up
by means of IPMI commands as provided in the scripts which you will find
i
Am Wed, 29 Mar 2023 14:42:33 +0200
schrieb Ben Polman :
> I'd be interested in your kludge, we face a similar situation where the
> slurmctld node
> does not have access to the ipmi network and can not ssh to machines
> that have access.
> We are thinking on creating a rest interface to a contro
I'd be interested in your kludge, we face a similar situation where the
slurmctld node
does not have access to the ipmi network and can not ssh to machines
that have access.
We are thinking on creating a rest interface to a control server which
would be running the ipmi commands
Ben
On 29-
Am Mon, 27 Mar 2023 13:17:01 +0200
schrieb Ole Holm Nielsen :
> FYI: Slurm power_save works very well for us without the issues that you
> describe below. We run Slurm 22.05.8, what's your version?
I'm sure that there are setups where it works nicely;-) For us, it
didn't, and I was faced with h
Hi Thomas,
FYI: Slurm power_save works very well for us without the issues that you
describe below. We run Slurm 22.05.8, what's your version?
I've documented our setup in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving
T
Am Mon, 06 Mar 2023 13:35:38 +0100
schrieb Stefan Staeglich :
> But this fixed not the main error but might have reduced the frequency of
> occurring. Has someone observed similar issues? We will try a higher
> SuspendTimeout.
We had issues with power saving. We powered the idle nodes off, caus
Hi,
since a half year we using the suspend/resume support for Slurm. This works
quite well but sometimes it breaks and no nodes are suspended or resumed
anymore.
In this case we see the following message in the log:
error: power_save module disabled, NULL SuspendProgram
A restart of slurmctld