Re: [slurm-users] "Low socket*core*thre" - solution?

2018-05-05 Thread Werner Saar
Hi, what is the output of the command: slurmd -C rocks7 Best regards Werner On 05/05/2018 06:56 PM, Mahmood Naderan wrote: Quick follow up. I see the Sockets for the head node is 1 while for the compute nodes is 32. And I think that is the reason, why slurm only see one cpu (CPUTot=1). M

Re: [slurm-users] "Low socket*core*thre" - solution?

2018-05-05 Thread Chris Samuel
On Sunday, 6 May 2018 2:56:51 AM AEST Mahmood Naderan wrote: > May I ask what is the difference between CPUs and Sockets in slurm.conf? CPUs are processor cores (on a package) and the socket is the package that goes into the motherboard (which these days usually has its own memory controller wh

Re: [slurm-users] sacct: error

2018-05-05 Thread Chris Samuel
On Sunday, 6 May 2018 2:00:44 AM AEST Eric F. Alemany wrote: > Working on weekends - hey ? [...] This isn't my work. ;-) > It seems as the commands give different result (?) - What do you think ? Very very interesting - both slurmd and lscpu report 32 cores, but with differing interpretations

Re: [slurm-users] After Each slurm Run, I Need to Reinstall slurm

2018-05-05 Thread Chris Samuel
On Sunday, 6 May 2018 11:02:25 AM AEST Kenneth Russell wrote: > If you find that you have the same problem as me , you can use the script > file below to automate the reinstall process. As I said in my original > note, this is a very inefficient way to run slurm That script looks very very weird.

Re: [slurm-users] After Each slurm Run, I Need to Reinstall slurm

2018-05-05 Thread Christopher Samuel
On 06/05/18 11:26, Will Dennis wrote: 1) I am not sure Slurm can run “all-in-one” with controller/worker/acctg-db all on one host… If anyone else know if this is doable, please chime in (I actually have a request to do this for a single machine at work, where the researchers want to have many fo

Re: [slurm-users] After Each slurm Run, I Need to Reinstall slurm

2018-05-05 Thread Will Dennis
A few thoughts… 1) I am not sure Slurm can run “all-in-one” with controller/worker/acctg-db all on one host… If anyone else know if this is doable, please chime in (I actually have a request to do this for a single machine at work, where the researchers want to have many folks share a single GP

[slurm-users] Fwd: Fwd: After Each slurm Run, I Need to Reinstall slurm

2018-05-05 Thread Kenneth Russell
Sent from Mailspring (https://link.getmailspring.com/link/1525569570.local-0ef17352-66af-v1.2.1-7e744...@getmailspring.com/0?redirect=https%3A%2F%2Fgetmailspring.com%2F&recipient=c2x1cm0tdXNlcnNAbGlzdHMuc2NoZWRtZC5jb20%3D), the best free email app for work -- Forwarded message

[slurm-users] Fwd: After Each slurm Run, I Need to Reinstall slurm

2018-05-05 Thread Kenneth Russell
Sent from Mailspring (https://link.getmailspring.com/link/1525569168.local-b056fee5-454d-v1.2.1-7e744...@getmailspring.com/0?redirect=https%3A%2F%2Fgetmailspring.com%2F&recipient=c2x1cm0tdXNlcnNAbGlzdHMuc2NoZWRtZC5jb20%3D), the best free email app for work -- Forwarded message

Re: [slurm-users] After Each slurm Run, I Need to Reinstall slurm

2018-05-05 Thread Kenneth Russell
Eric, I had already installed the latest version of slurm (V17.11.5). I followed you advice and upgraded Ubuntu server to V 18.04. That didn't solve the problem. To install slurm I used the instructions in the web site: https://github.com//mknoxnv/ubuntu-slurm/blob/master/REEADME.mdEric, I had

Re: [slurm-users] After Each slurm Run, I Need to Reinstall slurm

2018-05-05 Thread Geert Geurts
Hi Kenneth, The pidfile is just a record that says what is the pid of slurmctld or slurmdbd or whatever daemon. It is used by systemd and gets created automatically. The only thing you could worry about is the parent directory of the pidfile, but not having a pidfile doesn't block the daemon fro

Re: [slurm-users] After Each slurm Run, I Need to Reinstall slurm

2018-05-05 Thread Eric F. Alemany
Hi Ken I am in the same boat as you are meaning that I am also new to SLURM. This is what I've done from good recommendation. Install Ubuntu 18.04 on your servers which just got released last week. Apparently the ubuntu 16.04 package of SLURM is outdated. Install slurm-llnl on headnode/master Ins

[slurm-users] After Each slurm Run, I Need to Reinstall slurm

2018-05-05 Thread Kenneth Russell
I am a new slurm user and am trying to set up a single node test system. I have spent endless hours trying to get slurm services to start. I am running Ubuntu Server V16.04 and slurm 17.11.5. My MB has an AMD 8 core processor. When I try to start slurmdbd or slurmctld services I get messages say

Re: [slurm-users] "Low socket*core*thre" - solution?

2018-05-05 Thread Mahmood Naderan
Quick follow up. I see the Sockets for the head node is 1 while for the compute nodes is 32. And I think that is the reason, why slurm only see one cpu (CPUTot=1). May I ask what is the difference between CPUs and Sockets in slurm.conf? Regards, Mahmood On Sat, May 5, 2018 at 9:24 PM, Mahmood

Re: [slurm-users] "Low socket*core*thre" - solution?

2018-05-05 Thread Mahmood Naderan
Hi, I also have the same problem. I think by default, slurm won't add the head node as a compute node. I manually set the state to resume, However, the number of cores is still low (1) and not what I specified in slurm.conf [root@rocks7 mahmood]# scontrol show node rocks7 NodeName=rocks7 Arch=x86

Re: [slurm-users] sacct: error

2018-05-05 Thread Eric F. Alemany
Hi Chris, Working on weekends - hey ? when i do "slurmd -C” on one of my execute node, i get: eric@radonc01:~$ slurmd -C NodeName=radonc01 slurmd: Considering each NUMA node as a socket CPUs=32 Boards=1 SocketsPerBoard=4 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=64402 UpTime=2-17:35:12 Al

Re: [slurm-users] Memory oversubscription and sheduling

2018-05-05 Thread Chris Samuel
On Thursday, 26 April 2018 3:28:19 AM AEST Cory Holcomb wrote: > It appears that I have a configuration that only takes into account the > allocated memory before dispatching. With batch systems the idea is for the users to set constraints for their jobs so the scheduler can backfill other jobs

Re: [slurm-users] GPU / cgroup challenges

2018-05-05 Thread Chris Samuel
On Wednesday, 2 May 2018 11:04:34 PM AEST R. Paul Wiegand wrote: > When I set "--gres=gpu:1", the slurmd log does have encouraging lines such > as: > > [2018-05-02T08:47:04.916] [203.0] debug: Allowing access to device > /dev/nvidia0 for job > [2018-05-02T08:47:04.916] [203.0] debug: Not allowi

Re: [slurm-users] Hung tasks and high load when cancelling jobs

2018-05-05 Thread Chris Samuel
On Thursday, 3 May 2018 1:23:44 PM AEST Brendan Moloney wrote: > I upgraded somewhat recently from 17.02 to 17.11, but I am not positive if > this bug is new or just went unnoticed previously. There is a known deadlock bug in 17.11.x which can happen for certain workloads, hopefully fixed in 17.

Re: [slurm-users] Repost: Odd sacct behavior?

2018-05-05 Thread Chris Samuel
On Thursday, 3 May 2018 10:59:50 PM AEST John DeSantis wrote: > So, has anyone else run into a similar issue? No, but... > I'm using slurm 16.05.10-2 and slurmdbd 16.05.10-2. ...you're on a very old version of Slurm with a known security problem in its slurmdbd, and you can't even download tha

Re: [slurm-users] sacct: error

2018-05-05 Thread Chris Samuel
On Saturday, 5 May 2018 2:45:19 AM AEST Eric F. Alemany wrote: > With Ray suggestion i have a error message for each nodes. Here i am giving > you only one error message from a node. > sacct: error: NodeNames=radonc01 CPUs=32 doesn't match > Sockets*CoresPerSocket*ThreadsPerCore (16), resetting CP

Re: [slurm-users] "Low socket*core*thre" - solution?

2018-05-05 Thread Chris Samuel
On Thursday, 3 May 2018 10:28:46 AM AEST Matt Hohmeister wrote: > …and it looks good, except for the drain on my server/compute node: I think if you've had the config wrong at some point in the past then slurmctld will remember the error and you'll need to manually clear it with: scontrol updat

Re: [slurm-users] sacct fields AllocCPUS and ReqMem are empty

2018-05-05 Thread Chris Samuel
On Saturday, 5 May 2018 12:43:32 AM AEST Benjamin Rampe wrote: > I haven't found anything in the documentation that talks about > limitations regarding job accounting. Yeah, the documentation is pretty poor on this. :-( The best I can find is this email to the old slurm-dev list from 6 years ago