Re: [slurm-users] After reboot nodes are in state = down

2019-09-26 Thread Henkel, Andreas
Hi Rafal, How do you restart the nodes? If you don’t use scontrol reboot Slurm doesn’t expect nodes to reboot therefore you see that reason in those cases. Best Andreas Am 27.09.2019 um 07:53 schrieb Rafał Kędziorski mailto:rafal.kedzior...@gmail.com>>: Hi, I'm working with slurm-wlm 18.08.

[slurm-users] After reboot nodes are in state = down

2019-09-26 Thread Rafał Kędziorski
Hi, I'm working with slurm-wlm 18.08.5-2 on Raspberry Pi Cluster: - 1 Pi 4 as manager - 4 Pi 4 nodes This work fine. But after every restart of the nodes I get this cluster@pi-manager:~ $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST devcluster*up infinite 4 down pi-4-n

Re: [slurm-users] Running multiple jobs simultaneously

2019-09-26 Thread Matt Jay
Matt, Depending on other parameters for the job, your '--ntasks=30' is likely having the effect of requesting 30 (or more) cores for that individual job, which likely is not "fitting" on an individual node (oversubscribe allows multiple jobs to share a resource, but doesn't impact resource requ

Re: [slurm-users] Running multiple jobs simultaneously

2019-09-26 Thread Matt Hohmeister
I just did that...beautiful...thanks! The "default" let me run 48 jobs concurrently across two nodes. I've noticed that, still, when I have "#SBATCH --ntasks=30" in my .sbatch file, the job still refuses to run, and I'm back at the below. Should I just ask my users to not use -ntasks in their .

Re: [slurm-users] Running multiple jobs simultaneously

2019-09-26 Thread Matt Jay
Hi Matt, Check out the "OverSubscribe" partition parameter. Try setting your partition to "OverSubscribe=YES" and then submitting the jobs with the "-oversubscibe" option (or OverSubscribe=FORCE if you want this to happen for all jobs submitted to the partition). Either oversubscribe option

[slurm-users] Running multiple jobs simultaneously

2019-09-26 Thread Matt Hohmeister
I have a two-node cluster running Slurm, and I'm being asked about allowing multiple jobs (hundreds of jobs) to run simultaneously. Following is my scheduling part of slurm.conf, which I changed to allow multiple jobs to run on each node: # SCHEDULING #DefMemPerCPU=0 FastSchedule=1 #MaxMemPerCP

Re: [slurm-users] How to modify the normal QOS

2019-09-26 Thread David Baker
Dear Jurgen, Thank you for that. That does the expected job. It looks like the weirdness that I saw in the serial partition has now gone away and so that is good. Best regards, David From: slurm-users on behalf of Juergen Salk Sent: 26 September 2019 16:18 To:

Re: [slurm-users] Monitoring with Telegraf

2019-09-26 Thread Tina Friedrich
I second that question - I'm using the same combination :) I know there's some efforts - see https://slurm.schedmd.com/SLUG16/monitoring_influxdb_slug.pdf - but I don't know exactly what the state of that is at the moment. (I resorted to telegraf's 'execute script' plugin to pump some informat

Re: [slurm-users] How to modify the normal QOS

2019-09-26 Thread Juergen Salk
* David Baker [190926 14:12]: > > Currently my normal QOS specifies MaxTRESPU=cpu=1280,nodes=32. I've > tried a number of edits, however I haven't yet found a way of > redefining the MaxTRESPU to be "cpu=1280". In the past I have > resorted to deleting a QOS completely and redefining the whole >

[slurm-users] Monitoring with Telegraf

2019-09-26 Thread Marcus Boden
Hey everyone, I am using Telegraf and InfluxDB to monitor our hardware and I'd like to include some slurm metrics into this. Is there already a telegraf plugin for monitoring slurm I don't know about, or do I have to start from scratch? Best, Marcus -- Marcus Vincent Boden, M.Sc. Arbeitsgruppe e

[slurm-users] How to modify the normal QOS

2019-09-26 Thread David Baker
Hello, Currently my normal QOS specifies MaxTRESPU=cpu=1280,nodes=32. I've tried a number of edits, however I haven't yet found a way of redefining the MaxTRESPU to be "cpu=1280". In the past I have resorted to deleting a QOS completely and redefining the whole thing, but in this case I'm not s