QOS Group TRES limits apply to associations.
If I recall correctly, an association is a (user,account,partition,cluster)
On Fri, Oct 21, 2022 at 9:46 AM Matthew R. Baney wrote:
>
> Hello,
>
> I have noticed that jobs submitted to non-preemptable partitions (PreemptType
> = preempt/partition_prio
You can set MinTRESPerJob in a QOS and then only allow that QOS ni
that partition.
Or have a set of QOS for that partition that have that set...
I'm not sure if a partition QOS would help here, but it could,
basically forcing that QOS on all jobs in the partition.
I've found that debugging lua job
On Wed, Sep 23, 2020 at 12:37 PM Renfro, Michael wrote:
> Not having a separate test environment, I put logic into my job_submit.lua to
> use either the production settings or the ones under development or testing,
> based off the UID of the user submitting the job:
I've also done it that way,
I think there are at least two possible ways to do what you want.
You can make a reservation on the node and mark it as a maintenance
reservation. I don't know if slurm will shut down the node if it is
idle while it has a maintenance reservation, but it certainly won't if
you also run a job as r
also state=resume should work
On Fri, Aug 7, 2020 at 12:25 PM Hanby, Mike wrote:
>
> This is what's in /var/log/slurmctld
> Invalid node state transition requested for node c01 from=DRAINING
> to=CANCEL_REBOOT
>
>
>
> So it looks like, for version 18.08 at least, you have to first undrain, then
Both
See man sbatch, --requeue
The default is to not requeue (unless it was changed in slurm.conf) and
your job anc check $SLURM_RESTART_COUNT to see if it has been restarted.
This is handy if your job can checkpoint / restart.
On Fri, Jul 24, 2020 at 3:33 PM Saikat Roy wrote:
> Hello,
>
> I
I have collectd running on my gpu nodes with the collectd_nvidianvml
plugin from pip.
I have a collectd frontend that displays that data along with slurm
data for the whole cluster for users to see.
Some of my users watch that carefully and tune their jobs to maximize
utilization.
When I spot jobs
Condor's original premise was to have long running compute jobs on
distributed nodes with no shared filesystem.
Of course, they played all kinds of dirty tricks to make this work
including intercepted libc and system calls.
I see no reason cleverly wrapped slurm jobs coudln't do the same,
either p
I got configless DNS SRV records working by putting the record in the
default domain in the search path for the cluster.
In other words, the cluster has its own domain, and all the nodes are
in it along with the SRV records.. This is the first domain in the
DNS search path.
On Tue, Jun 9, 2020 a
I've had slurm power off a few nodes I was working on...
My normal solution is to just power them back on without slurm's help.
Then it brings the node up in state "down / unexpectedly booted" and
it doesn't seem to mess with them until I use scontrol to change the
state again. (I like scontrol re
Hmm, works for me. Maybe they added that in more recent versions of slurm.
I'm using version 18+
On Wed, May 13, 2020 at 5:12 PM Alastair Neil wrote:
>
> invalid field requested: "reason"
>
> On Tue, 12 May 2020 at 16:47, Steven Dick wrote:
>>
>> Wh
iled, Run time 04:32:51, FAILED
>> [2020-05-10T00:26:05.215] _job_complete: JobId=533900 done
>
>
> it is curious, that all the jobs were running on the same processor, perhaps
> this is a cgroup related failure?
>
> On Tue, 12 May 2020 at 10:10, Steven Dick wrote:
>>
I see one job cancelled and two jobs failed.
Your slurmd log is incomplete -- it doesn't show the two failed jobs
exiting/failing, so the real error is not here.
It might also be helpful to look through slurmctld's log starting from
when the first job was canceled, looking at any messages mentioni
Previous versions of mysql are suppose to have nasty security issues.
I'm not sure why I had mysql instead of mariadb anyway.
On Mon, May 11, 2020 at 9:29 AM Relu Patrascu wrote:
>
> We've experienced the same problem on several versions of slurmdbd
> (18, 19) so we downgraded mysql and put a ho
Latest releases of slurm (17-20) don't work with mysql 5.7.30,
Latest version of mariadb works fine.
On Tue, May 5, 2020 at 3:41 PM Dustin Lang wrote:
>
> I tried upgrading Slurm to 18.08.9 and I am still getting this Segmentation
> Fault!
>
>
>
> On Tue, May 5, 2020 at 2:39 PM Dustin Lang wro
Have you looked at sreport?
On Fri, Apr 3, 2020 at 1:09 AM Sudeep Narayan Banerjee
wrote:
>
> How to get the Average number of CPU cores used by jobs per day by a
> particular group?
>
> By group means: say faculty group1, group2 etc. all those groups are having a
> certain number of students
>
When I changed this on a running system, no jobs were killed, but
slurm lost track of jobs on nodes and was unable to kill them or tell
when they were finished until slurmd on each node was restarted. I
let running jobs complete and monitored them manually, and restarted
slurmd on each node as the
lmod can mark modules as deprecated, so users are warned. I think you
might also be able to get it to collect statistics on module usage or
something.
lmod also has the advantage of being much more complicated and much less
efficient if set up incorrectly.
On Sun, Nov 24, 2019 at 9:20 PM Brian A
I don't think it shows up until the job completes.
On Sat, Sep 14, 2019 at 2:25 AM Brian Andrus wrote:
>
> Quick question?
> When I use sacct to show job stats, it always has a blank entry for the
> MaxRSS field. Is there something that needs enabled to get that in?
> I do see it if I use sstat w
nks for your help.
>
> Looks like QOS is the way to go if I want both job arrays + user limits on
> jobs/resources (in the context of a regression-test).
>
> Regards,
> Guillaume.
>
> On Fri, Aug 30, 2019 at 6:11 PM Steven Dick wrote:
>>
>> On Fri, Aug 30, 201
On Fri, Aug 30, 2019 at 2:58 PM Guillaume Perrault Archambault
wrote:
> My problem with that though, is what if each script (the 9 scripts in my
> earlier example) each require different requirements? For example, run on a
> different partition, or set a different time limit? My understanding is
ask.
>
>
> I'd be happy if you corrected my misunderstandings, especially if you could
> show me how to set a job limit that takes effect over multiple job arrays.
>
> I may have very glaring oversights as I don't necessarily have a big picture
> view of things (I
This makes no sense and seems backwards to me.
When you submit an array job, you can specify how many jobs from the
array you want to run at once.
So, an administrator can create a QOS that explicitly limits the user.
However, you keep saying that they probably won't modify the system
for just you
What operating system are you running?
Modern versions of systemd automatically put login sessions into their
own cgroup which are themselves in a "user" group.
When slurm is running parallel to this, it makes its own slurm cgroup.
It should be possible to have something at boot modify the systemd
I'm looking for a tool that will tell me why a specific job in the
queue is still waiting to run. squeue doesn't give enough detail. If
the job is held up on QOS, it's pretty obvious. But if it's
resources, it's difficult to tell.
If a job is not running because of resources, how can I identify
It is documented that you need to create the cluster in the database.
It is not documented that the accounting system won't work until you
restart slurmdbd multiple times before it starts collecting accounting
records.
Also, none of the necessary restarts are needed on an upgrade -- only
when slu
I've found that when creating a new cluster, slurmdbd does not
function correctly right away. It may be necessary to restart
slurmdbd at several points during the slurm installation process to
get everything working correctly.
Also, slurmctld will buffer the accounting data until slurmdbd starts
27 matches
Mail list logo