Re: [slurm-users] Slurm overhead

2018-04-27 Thread Mahmood Naderan
The CPUs are Opteron and don't have HT. I also have found that for large run times (2-3 days) with high number of threads, e.g 32 threads, the run time w/o slurm is nearly the same with little differences. For shared memory runs, there is no problem and difference is negligible. So, I will try to u

Re: [slurm-users] Slurm overhead

2018-04-26 Thread John Hearns
Mahmood, do you haave Hyperthreading enabled? That may be the root cause of your problem. If you have hyperhtreading, then when you start to run more than the number of PHYSICAL cores you will get over-subscription. Now, with certain workloads that is fine - that is what hyperhtreading is all abou

Re: [slurm-users] Slurm overhead

2018-04-26 Thread Mahmood Naderan
It seems that the number of threads has some effects on the performance. Maybe some configurations issue exists in openmpi. I will investigate more on that. Thanks guys for the tips. Regards, Mahmood On Tue, Apr 24, 2018 at 9:18 PM, Ryan Novosielski wrote: > I would likely crank up the debugg

Re: [slurm-users] Slurm overhead

2018-04-24 Thread Ryan Novosielski
I would likely crank up the debugging on the slurmd process and look at the log files to see what’s going on in that time. You could also watch the job via top or other means (on Linux, you can press “1” to see line-by-line for each CPU core), or use strace on the process itself. Presumably some

Re: [slurm-users] Slurm overhead

2018-04-24 Thread Bill Barth
How do you start it? If you use Sys V style startup scripts, then likely /etc/Init.d/slurm stop, but if you;re using systemd, then probably systemctl stop slurm.service (but I don’t do systemd). Best, Bill. Sent from my phone > On Apr 24, 2018, at 11:15 AM, Mahmood Naderan wrote: > > Hi Bi

Re: [slurm-users] Slurm overhead

2018-04-24 Thread Mahmood Naderan
Hi Bill, In order to shutdown the slurm process on the compute node, is it fine to kill /usr/sbin/slurm? Or there is a better and safer way for that? Regards, Mahmood On Sun, Apr 22, 2018 at 5:44 PM, Bill Barth wrote: > Mahmood, > > If you have exclusive control of this system and can afford

Re: [slurm-users] Slurm overhead

2018-04-23 Thread Chris Samuel
On Sunday, 22 April 2018 4:06:56 PM AEST Mahmood Naderan wrote: > I ran some other tests and got the nearly the same results. That 4 > minutes in my previous post means about 50% overhead. So, 24000 > minutes on direct run is about 35000 minutes via slurm. That sounds like there's really somethin

Re: [slurm-users] Slurm overhead

2018-04-22 Thread Bill Barth
Mahmood, If you have exclusive control of this system and can afford to have compute-0-0 out of production for awhile, you can do a simple test: Shut Slurm down on compute-0-0 Login directly to compute-0-0 Run the timing experiment there Compare the results to both of the other experiments you h

Re: [slurm-users] Slurm overhead

2018-04-21 Thread Mahmood Naderan
I ran some other tests and got the nearly the same results. That 4 minutes in my previous post means about 50% overhead. So, 24000 minutes on direct run is about 35000 minutes via slurm. I will post with details later. the methodology I used is 1- Submit a job to a specific node (compute-0-0) via

Re: [slurm-users] Slurm overhead

2018-04-19 Thread Loris Bennett
Hi Mahmood, Mahmood Naderan writes: > Hi, > I have installed a program on all nodes since it is an rpm. Therefore, > when the program is running, it won't use the shared file system and > it just use its own /usr/local/program files. > > I also set a scratch path in the bashrc which is actually

[slurm-users] Slurm overhead

2018-04-19 Thread Mahmood Naderan
Hi, I have installed a program on all nodes since it is an rpm. Therefore, when the program is running, it won't use the shared file system and it just use its own /usr/local/program files. I also set a scratch path in the bashrc which is actually the path on the running node. For example, I set T