The CPUs are Opteron and don't have HT.
I also have found that for large run times (2-3 days) with high number
of threads, e.g 32 threads, the run time w/o slurm is nearly the same
with little differences. For shared memory runs, there is no problem
and difference is negligible. So, I will try to u
Mahmood, do you haave Hyperthreading enabled?
That may be the root cause of your problem. If you have hyperhtreading,
then when you start to run more than the number of PHYSICAL cores you
will get over-subscription. Now, with certain workloads that is fine - that
is what hyperhtreading is all abou
It seems that the number of threads has some effects on the
performance. Maybe some configurations issue exists in openmpi. I will
investigate more on that. Thanks guys for the tips.
Regards,
Mahmood
On Tue, Apr 24, 2018 at 9:18 PM, Ryan Novosielski wrote:
> I would likely crank up the debugg
I would likely crank up the debugging on the slurmd process and look at the log
files to see what’s going on in that time. You could also watch the job via top
or other means (on Linux, you can press “1” to see line-by-line for each CPU
core), or use strace on the process itself. Presumably some
How do you start it?
If you use Sys V style startup scripts, then likely /etc/Init.d/slurm stop, but
if you;re using systemd, then probably systemctl stop slurm.service (but I
don’t do systemd).
Best,
Bill.
Sent from my phone
> On Apr 24, 2018, at 11:15 AM, Mahmood Naderan wrote:
>
> Hi Bi
Hi Bill,
In order to shutdown the slurm process on the compute node, is it fine
to kill /usr/sbin/slurm? Or there is a better and safer way for that?
Regards,
Mahmood
On Sun, Apr 22, 2018 at 5:44 PM, Bill Barth wrote:
> Mahmood,
>
> If you have exclusive control of this system and can afford
On Sunday, 22 April 2018 4:06:56 PM AEST Mahmood Naderan wrote:
> I ran some other tests and got the nearly the same results. That 4
> minutes in my previous post means about 50% overhead. So, 24000
> minutes on direct run is about 35000 minutes via slurm.
That sounds like there's really somethin
Mahmood,
If you have exclusive control of this system and can afford to have compute-0-0
out of production for awhile, you can do a simple test:
Shut Slurm down on compute-0-0
Login directly to compute-0-0
Run the timing experiment there
Compare the results to both of the other experiments you h
I ran some other tests and got the nearly the same results. That 4
minutes in my previous post means about 50% overhead. So, 24000
minutes on direct run is about 35000 minutes via slurm. I will post
with details later. the methodology I used is
1- Submit a job to a specific node (compute-0-0) via
Hi Mahmood,
Mahmood Naderan writes:
> Hi,
> I have installed a program on all nodes since it is an rpm. Therefore,
> when the program is running, it won't use the shared file system and
> it just use its own /usr/local/program files.
>
> I also set a scratch path in the bashrc which is actually
Hi,
I have installed a program on all nodes since it is an rpm. Therefore,
when the program is running, it won't use the shared file system and
it just use its own /usr/local/program files.
I also set a scratch path in the bashrc which is actually the path on
the running node. For example, I set T
11 matches
Mail list logo