Re: [slurm-users] "Low socket*core*thre" - solution?

2018-05-02 Thread Marcus Wagner
Dear all, I must say, I'm a bit dazzled, since this configuration should not be valid. This is something, I myself observed. According to the manpage of slurm.conf, CPUs and Boards are mutually exclusive: *Boards*    Number of Baseboards in nodes with a baseboard controller.  Note that when

Re: [slurm-users] Permissions to administer slurm

2018-05-02 Thread Loris Bennett
Hi Matt, Matt Hohmeister writes: > I’m a sysadmin, brand new to Slurm, and just it running across two nodes: > slurmctld and slurmd on one node; slurmd on the other. > > * > > In the next few days, I’m going to meet with a researcher who has a sizable > MATLAB job I can submit to give this a s

[slurm-users] Hung tasks and high load when cancelling jobs

2018-05-02 Thread Brendan Moloney
Hi, Sometimes when jobs are cancelled I see a spike in system load and hung task errors. It appears to be related to NFS and cgroups. The slurmstepd process gets hung cleaning up cgroups: INFO: task slurmstepd:11222 blocked for more than 120 seconds. Not tainted 4.4.0-119-generic #143-Ubun

Re: [slurm-users] "Low socket*core*thre" - solution?

2018-05-02 Thread John Kelly
Hi Caleb I noticed the same thing. If you configure a host with more memory than it really has slurm will think that the host has something wrong with it and put it in drain status. At least that is my theory. The vendor can likely give you a better more detailed answer. -jfk On Wed, May 2

Re: [slurm-users] "Low socket*core*thre" - solution?

2018-05-02 Thread Caleb Smith
Hi all, Out of curiosity, what causes that? It'd be good to know for the future -- I ran into the same issue and just edited the memory down and it works fine now, but I'd like to know why/what causes that error. I'm assuming low resources, ie memory or CPU or whatever. Mind clarifying? On Wed, M

Re: [slurm-users] "Low socket*core*thre" - solution?

2018-05-02 Thread John Kelly
Hi matt scontrol update nodename=odin state=resume scontrol update nodename=odin state=idle -jfk On Wed, May 2, 2018 at 5:28 PM, Matt Hohmeister wrote: > I have a two-node cluster: the server/compute node is a Dell PowerEdge > R730; the compute node, a Dell PowerEdge R630. On both of these n

[slurm-users] "Low socket*core*thre" - solution?

2018-05-02 Thread Matt Hohmeister
I have a two-node cluster: the server/compute node is a Dell PowerEdge R730; the compute node, a Dell PowerEdge R630. On both of these nodes, slurmd -C gives me the exact same line: [me@odin slurm]$ slurmd -C NodeName=odin CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 Re

[slurm-users] Permissions to administer slurm

2018-05-02 Thread Matt Hohmeister
I'm a sysadmin, brand new to Slurm, and just it running across two nodes: slurmctld and slurmd on one node; slurmd on the other. [cid:image001.png@01D3E232.47462D60] In the next few days, I'm going to meet with a researcher who has a sizable MATLAB job I can submit to give this a spin. Here's

[slurm-users] srun timeout exit status bug?

2018-05-02 Thread Dan Boorstein
Hi All, I've encountered what I think is a bug with srun's exit status when a timeout occurs, but perhaps my expectation is off. My expectation is for srun to have a non-zero exit status when a timeout occurs before all tasks can complete. This behaves as expected when all tasks are timed out: >

[slurm-users] Odd sacct behavior?

2018-05-02 Thread John DeSantis
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Hello all, I'm just wondering if anyone is able to reproduce the behavior I'm seeing with `sacct`, or if anyone has experienced it previously. In a nutshell, I usually can query jobs from specified nodes, similar to the following: `sacct -o $OPTIO

Re: [slurm-users] GPU / cgroup challenges

2018-05-02 Thread Wiegand, Paul
So there is a patch? -- Original message-- From: Fulcomer, Samuel Date: Wed, May 2, 2018 11:14 To: Slurm User Community List; Cc: Subject:Re: [slurm-users] GPU / cgroup challenges This came up around 12/17, I think, and as I recall the fixes were added to the src repo then; however, they

Re: [slurm-users] GPU / cgroup challenges

2018-05-02 Thread Fulcomer, Samuel
This came up around 12/17, I think, and as I recall the fixes were added to the src repo then; however, they weren't added to any fo the 17.releases. On Wed, May 2, 2018 at 6:04 AM, R. Paul Wiegand wrote: > I dug into the logs on both the slurmctld side and the slurmd side. > For the record, I h

Re: [slurm-users] GPU / cgroup challenges

2018-05-02 Thread R. Paul Wiegand
I dug into the logs on both the slurmctld side and the slurmd side. For the record, I have debug2 set for both and DebugFlags=CPU_BIND,Gres. I cannot see much that is terribly relevant in the logs. There's a known parameter error reported with the memory cgroup specifications, but I don't think t

Re: [slurm-users] wckey specification error

2018-05-02 Thread Chris Samuel
On Wednesday, 2 May 2018 8:50:12 PM AEST John Hearns wrote: > One learning pointgrep -i is a good default option. This ignores the > case of the search, so you would have found WCKey a bit faster. Also if you need to search recursively below a point then: git grep --no-index -i ${PATTERN}

Re: [slurm-users] wckey specification error

2018-05-02 Thread John Hearns
Mahmood, good to hear you have a solution. One learning pointgrep -i is a good default option. This ignores the case of the search, so you would have found WCKey a bit faster. On 2 May 2018 at 04:26, Mahmood Naderan wrote: > Thanks Trevor for pointing out that there is an option for suc

[slurm-users] sacct fields AllocCPUS and ReqMem are empty

2018-05-02 Thread marcelsommer...@gmail.com
Hi, I have Slurm 17.02.10 installed in a test environment. When I use sacct -o "JobID,JobName,AllocCPUs,ReqMem,Elapsed" and AccountingStorageType = accounting_storage/filetxt, the fields AllocCPUS and ReqMem are empty. JobIDJobName AllocCPUS ReqMemElapsed

[slurm-users] sacct fields AllocCPUS and ReqMem are empty

2018-05-02 Thread Marcel Sommer
Hi, I have Slurm 17.02.10 installed in a test environment. When I use sacct -o "JobID,JobName,AllocCPUs,ReqMem,Elapsed" and AccountingStorageType = accounting_storage/filetxt, the fields AllocCPUS and ReqMem are empty. JobIDJobName AllocCPUS ReqMemElapsed