Dear all,
I must say, I'm a bit dazzled, since this configuration should not be
valid. This is something, I myself observed. According to the manpage of
slurm.conf, CPUs and Boards are mutually exclusive:
*Boards* Number of Baseboards in nodes with a baseboard controller.
Note that when
Hi Matt,
Matt Hohmeister writes:
> I’m a sysadmin, brand new to Slurm, and just it running across two nodes:
> slurmctld and slurmd on one node; slurmd on the other.
>
> *
>
> In the next few days, I’m going to meet with a researcher who has a sizable
> MATLAB job I can submit to give this a s
Hi,
Sometimes when jobs are cancelled I see a spike in system load and hung
task errors. It appears to be related to NFS and cgroups.
The slurmstepd process gets hung cleaning up cgroups:
INFO: task slurmstepd:11222 blocked for more than 120 seconds.
Not tainted 4.4.0-119-generic #143-Ubun
Hi Caleb
I noticed the same thing. If you configure a host with more memory than
it really has slurm will think that the host has something wrong with it
and put it in drain status. At least that is my theory. The vendor can
likely give you a better more detailed answer.
-jfk
On Wed, May 2
Hi all,
Out of curiosity, what causes that? It'd be good to know for the future --
I ran into the same issue and just edited the memory down and it works fine
now, but I'd like to know why/what causes that error. I'm assuming low
resources, ie memory or CPU or whatever. Mind clarifying?
On Wed, M
Hi matt
scontrol update nodename=odin state=resume
scontrol update nodename=odin state=idle
-jfk
On Wed, May 2, 2018 at 5:28 PM, Matt Hohmeister
wrote:
> I have a two-node cluster: the server/compute node is a Dell PowerEdge
> R730; the compute node, a Dell PowerEdge R630. On both of these n
I have a two-node cluster: the server/compute node is a Dell PowerEdge R730;
the compute node, a Dell PowerEdge R630. On both of these nodes, slurmd -C
gives me the exact same line:
[me@odin slurm]$ slurmd -C
NodeName=odin CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10
ThreadsPerCore=2 Re
I'm a sysadmin, brand new to Slurm, and just it running across two nodes:
slurmctld and slurmd on one node; slurmd on the other.
[cid:image001.png@01D3E232.47462D60]
In the next few days, I'm going to meet with a researcher who has a sizable
MATLAB job I can submit to give this a spin.
Here's
Hi All,
I've encountered what I think is a bug with srun's exit status when a
timeout occurs, but perhaps my expectation is off. My expectation is for
srun to have a non-zero exit status when a timeout occurs before all tasks
can complete.
This behaves as expected when all tasks are timed out:
>
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512
Hello all,
I'm just wondering if anyone is able to reproduce the behavior I'm
seeing with `sacct`, or if anyone has experienced it previously.
In a nutshell, I usually can query jobs from specified nodes, similar
to the following:
`sacct -o $OPTIO
So there is a patch?
-- Original message--
From: Fulcomer, Samuel
Date: Wed, May 2, 2018 11:14
To: Slurm User Community List;
Cc:
Subject:Re: [slurm-users] GPU / cgroup challenges
This came up around 12/17, I think, and as I recall the fixes were added to the
src repo then; however, they
This came up around 12/17, I think, and as I recall the fixes were added to
the src repo then; however, they weren't added to any fo the 17.releases.
On Wed, May 2, 2018 at 6:04 AM, R. Paul Wiegand wrote:
> I dug into the logs on both the slurmctld side and the slurmd side.
> For the record, I h
I dug into the logs on both the slurmctld side and the slurmd side.
For the record, I have debug2 set for both and
DebugFlags=CPU_BIND,Gres.
I cannot see much that is terribly relevant in the logs. There's a
known parameter error reported with the memory cgroup specifications,
but I don't think t
On Wednesday, 2 May 2018 8:50:12 PM AEST John Hearns wrote:
> One learning pointgrep -i is a good default option. This ignores the
> case of the search, so you would have found WCKey a bit faster.
Also if you need to search recursively below a point then:
git grep --no-index -i ${PATTERN}
Mahmood, good to hear you have a solution.
One learning pointgrep -i is a good default option. This ignores the
case of the search, so you would have found WCKey a bit faster.
On 2 May 2018 at 04:26, Mahmood Naderan wrote:
> Thanks Trevor for pointing out that there is an option for suc
Hi,
I have Slurm 17.02.10 installed in a test environment. When I use sacct
-o "JobID,JobName,AllocCPUs,ReqMem,Elapsed" and AccountingStorageType =
accounting_storage/filetxt, the fields AllocCPUS and ReqMem are empty.
JobIDJobName AllocCPUS ReqMemElapsed
Hi,
I have Slurm 17.02.10 installed in a test environment. When I use sacct
-o "JobID,JobName,AllocCPUs,ReqMem,Elapsed" and AccountingStorageType =
accounting_storage/filetxt, the fields AllocCPUS and ReqMem are empty.
JobIDJobName AllocCPUS ReqMemElapsed
17 matches
Mail list logo