Dear Mahmood,
could you please show the output of
scontrol show -d job 119
Best
Marcus
On 12/16/19 5:41 PM, Mahmood Naderan wrote:
Excuse me, still I have problem. Although I freed memory on the nodes
as below
RealMemory=64259 AllocMem=1024 FreeMem=61882 Sockets=32 Boards=1
RealMemory
Hi Dean,
first make sure, the munge.key is really the same on all systems. Also
the users must be the same on the systems, as the submission itself is
done on the controller. Please be sure also, that the systems have the
same date and time.
After that, restart munge service and then the sl
Hi All,
I am in process of switching to SLURM from a Torque/Moab setup . Could you
please advise me on the following questions?
1) We use Moabs MAXPE and MAXPS limits per accounting group (which is not the
same as Unix group for us; users group membership does not match their
allocations mem
There are numerous ways to get this functionality.
The simplest is probably to just have a separate partition that will
only allow job times of 1 hour or less.
There are also options that would involve preemption of the longer jobs
so the quicker ones could run, priorities, etc.
It all depe
Resolved now. On older versions of Slurm, I could have queues without default
times specified (just an upper limit, in my case). As of Slurm 18 or 19, I had
to add a default time to all my queues to avoid the AssocGrpCPURunMinutesLimit
flag.
> On Dec 16, 2019, at 2:00 PM, Renfro, Michael wrote
Hello
I am looking into switching from Univa (sge) to slurm and am figuring out
how to implement some of our usage policy in slurm.
We have a Univa queue which uses job classes and RQSes to limit jobs with a run
time over 4 hours to only half the available slots (CPU cores) so some slots
ar
I have my controller running (slurmctld and slrumdbd) and my controller and
node host can ping each other by name so they resolve via /etc/hosts
settings. When I try to start the slurmd.service it shows that it is
active (running), but gives these errors:
Unable to register: Zero Bytes were trans
When I try to start a node it fails with this message:
fatal: Unable to find slurmstepd file at
/storage/slurm-build/sbin/slurmstepd
The location /storage/slurm-build/sbin/slurmstepd is where the binaries
were built by make (I used ./configure --prefix=/storage/slurm-build).
After I created the
Thanks, Ole. I forgot I had that tool already. Not seeing where the limits are
getting enforced. But now I’ve narrowed it down to some of my partitions or my
job routing Lua plugin:
=
[renfro@login ~]$ hpcshell --reservation=slurm-upgrade --partition=interactive
srun: job 232423 queued and
Hi Mike,
My showuserlimits tool prints nicely user limits from the Slurm database:
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserlimits
Maybe this can give you further insights into the source of problems.
/Ole
On 16-12-2019 17:27, Renfro, Michael wrote:
Hey, folks. I’ve j
Excuse me, still I have problem. Although I freed memory on the nodes as
below
RealMemory=64259 AllocMem=1024 FreeMem=61882 Sockets=32 Boards=1
RealMemory=120705 AllocMem=1024 FreeMem=115257 Sockets=32 Boards=1
RealMemory=64259 AllocMem=26624 FreeMem=61795 Sockets=32 Boards=1
RealMemor
OK. It takes some time for scontrol to update the values.
I can now see more free memory as below
RealMemory=120705 AllocMem=1024 FreeMem=115290 Sockets=32 Boards=1
Thank you William.
Regards,
Mahmood
On Mon, Dec 16, 2019 at 7:55 PM Mahmood Naderan
wrote:
> >Memory may be being used by
>Memory may be being used by jobs running, or tasks outside the control of
>Slurm running, or possibly NFS buffer cache or similar. You may need to
>start an ssh session on the node and look.
I checked that. For example, on compute-0-1, I see
RealMemory=120705 AllocMem=1024 FreeMem=8442 Sock
Hey, folks. I’ve just upgraded from Slurm 17.02 (way behind schedule, I know)
to 19.05. The only thing I’ve noticed going wrong is that my user resource
limits aren’t being applied correctly.
My typical user has a GrpTRESRunMin limit of cpu=144 (1000 CPU-days), and
after the upgrade, it app
Memory may be being used by jobs running, or tasks outside the control of
Slurm running, or possibly NFS buffer cache or similar. You may need to
start an ssh session on the node and look.
William
On Mon, 16 Dec 2019 at 15:38, Mahmood Naderan wrote:
> Hi,
> With the following output
>
>Rea
Hi,
With the following output
RealMemory=64259 AllocMem=1024 FreeMem=38620 Sockets=32 Boards=1
RealMemory=120705 AllocMem=1024 FreeMem=309 Sockets=32 Boards=1
RealMemory=64259 AllocMem=1024 FreeMem=59334 Sockets=32 Boards=1
RealMemory=64259 AllocMem=1024 FreeMem=282 Sockets=10 Boards=1
Hi,
I built a small testing cluster before I can put in production, I was
testing the sreport capabilities and it is showing some inconsistencies
(or maybe/probably I miss understood something).
In this Friday our virtual machines was offline, so I would expect
sreport to give 0 for all user
Okay ... obviously an auto-complete error that I failed to check: Please
ignore and accept my apologies.
> On Dec 16, 2019, at 7:03 AM, Wiegand, Paul wrote:
>
> unlock stokes-arcc
> get stokes-arcc
>
unlock stokes-arcc
get stokes-arcc
Hi Marcus and Bjørn-Helge
Thank you for your answers.
We don’t use slurm billing. We use system acct billing.
I also confirm that with --exclusive, there is a difference between ReqCPUS and
AllocCPUS, but --mem-per-cpu was more a --mem-per-task than a --mem-per-cpu :
it was associated to ReqCPU
20 matches
Mail list logo