Hi,
I want to confirm that the hostname resolution is case sensitive in SLURM ?
Many thanks,
Bill
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
a nodelist for say those 28 core nodes and then
those 64 core nodes.
But going back to the original answer, --exclusive, is the answer here.
You DO know how many cores you need right? (Scaling study should give
you that). And you DO know the memory footprint by past jobs with
similar inputs
We've done this though with job_submit.lua. Mostly with OS updates. We
add a feature to everything then proceed. Telling users that adding a
feature gets you on the "new" nodes.
I can send you the snippet if you're using the job_submit.lua script.
Bill
On 6/14/24 2:18
Does anything like that already exist in Slurm?
Thanks!
- Bill
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
Bill Benedetto bbenede...@goodyear.com<mailto:bbenede...@goodyear.com>
The Goodyear Tire & Rubber Co.
I don't speak for Goodye
MEMORY
end
Bill
On 12/7/22 12:03 PM, Felho, Sandor wrote:
TransUnion is running a ten-node site using slurm with multiple queues.
We have an issue with --mem parameter. The is one user who has read the
slurm manual and found the --mem=0. This is giving the maximum memory on
the node (500 GiB's)
Indeed. We use this and BELIEVE that it works, lol!
Bill
function slurm_job_modify ( job_desc, job_rec, part_list, modify_uid )
if modify_uid == 0 then
return 0
end
if job_desc.qos ~= nil then
return 1
end
return 0
end
I usually add "withassoc" for a show user
sacctmgr show user loris withassoc
Bill
On 11/23/21 9:07 AM, Loris Bennett wrote:
sacctmgr show user loris accounts
Dammit! Completely forgot that I have these right here in my home
directory! And I probably used your tools last year when I generated
the report.
Thank you Ole for making me remember!
Bill
On 11/10/21 3:08 PM, Ole Holm Nielsen wrote:
On 10-11-2021 16:56, Bill Wichser wrote:
I can't
Thanks. You are right, now that I understand the heading in the man
page. Not quite what I was hoping for here! Oh well, back to the
drawing board.
Thanks all.
Bill
On 11/10/21 12:04 PM, Michael Gutteridge wrote:
My read of the sreport manpage on our currently installed version
(21.08
I can't seem to figure out how to do a query against a partition.
sreport cluster AccountUtilizationByUser user=bill cluster=della, no
issues. Works as expected.
sreport cluster AccountUtilizationByUser Partitions=cpu cluster=della
gives me
Unknown condition: Partitions=cpu
and
The cluster doesn't exist though. This was what I tried first.
[root@della5 bill]# sacctmgr show RunawayJobs cluster=tukey
sacctmgr: error: Slurmctld running on cluster tukey is not up, can't
check running jobs
Bill
On 7/27/21 4:59 PM, Carlos Fenoy wrote:
Hi,
You can cleanup
[root@della5 bill]# sacctmgr -i delete user mable
Error with request: Job(s) active, cancel job(s) before remove
JobID = 602995 C = tukey A = politics U = mable
Yup, when a user has an active job they cannot be deleted from the
database. The thing is, this cluster tukey has been
x27;d give a heads up. I don't
think our user was being malicious, and their actual -J was
#SBATCH -J sd-PBEpvw9040%x
Probably a hash and probably machine-generated/unlucky.
I hope this helps and is actually a problem report. We're on 18.08.5, so I hope
we don't have to go back
On 5/6/20 11:30 AM, Dustin Lang wrote:
Hi,
Ubuntu has made mysql 5.7.30 the default version. At least with Ubuntu 16.04,
this causes severe problems with Slurm dbd (v 17.x, 18.x, and 19.x; not sure
about 20).
I can confirm that kills slurmdbd on ubuntu 18.04 as well. I had compiled slurm
s
something we have not had to deal with as CPU time per 30 day sliding
window has been accepted, can be quantitatively shown, and just is a
much easier way to schedule when ALL resources can be used.
Bill
On 10/28/19 11:11 AM, Tina Friedrich wrote:
Hello,
is there a possibility to tie a r
Anyone know if the new GPU support allows having a different number of GPUs per
node?
I found:
https://www.ch.cam.ac.uk/computing/slurm-usage
Which mentions "SLURM does not support having varying numbers of GPUs per node
in a job yet."
I have a user with a particularly flexible code that would
Thanks. Had no problem setting the individual element of the array.
Just thought that it worked differently in the past! Memory apparently
isn't what it used to be!
Thanks again,
Bill
On 6/13/19 10:25 AM, Jacob Jenson wrote:
scontrol show job
# scontrol update jobid=3136818 timelimit+=30-00:00:00
scontrol: error: TimeLimit increment/decrement not supported for job arrays
This is new to 18.08.7 it appears. Am I just missing something here?
Bill
On 5/15/19 12:34 AM, Barbara Krašovec wrote:
> It could be a problem with ARP cache.
>
> If the number of devices approaches 512, there is a kernel limitation in
> dynamic
> ARP-cache size and it can result in the loss of connectivity between nodes.
We have 162 compute nodes, a dozen or so file
My latest addition to a cluster results in a group of the same nodes
periodically getting listed as
"not-responding" and usually (but not always) recovering.
I increased logging up to debug3 and see messages like:
[2019-05-14T17:09:25.247] debug: Spawning ping agent for
bigmem[1-9],bm[1,7,9-13
deployment. Danny says he has heard of no problems but that
doesn't mean the folks in the trenches haven't seen issues!
Thanks,
Bill
h the others who think that the
environment inside the script is likely screwed up. Throwing in a printenv and
saving that can't hurt.
Bill.
--
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu| Phone: (512) 232-7069
Office: ROC 1.435| Fax: (512) 475-9445
Yes. We use something like this
if job_desc.features == nil then
job_desc.features = "special"
else
job_desc.features = job_desc.features .. ",special"
end
Bill
On 12/19/2018 09:27 AM, Kevin Manal
On 11/13/18 9:39 PM, Kilian Cavalotti wrote:
> Hi Bill,
> There are a couple mentions of the same backtrace on the bugtracker,
> but that was a long time ago (namely
> https://bugs.schedmd.com/show_bug.cgi?id=1557 and
> https://bugs.schedmd.com/show_bug.cgi?id=1660, for Slurm 14.
After being up since the second week in Oct or so, yesterday our slurm
controller started segfaultings. It was compiled/run on ubuntu 16.04.1.
Nov 12 14:31:48 nas-11-1 kernel: [2838306.311552] srvcn[9111]: segfault at 58 ip
004b51fa sp 7fbe270efb70 error 4 in slurmctld[40+eb000
On 10/16/18 3:38 AM, Bjørn-Helge Mevik wrote:
> Just a tip: Make sure that the kernel has support for constraining swap
> space. I believe we once had to reinstall one of our clusters once
> because we had forgotten to check that.
I tried starting slurmd with -D -v -v -v and got:
slurmd: debug:
a 3GB process with --mem=1000:
$ ps acux
USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND
bill 17698 11.1 1.5 2817020 1015392 ? D20:40 0:13 stream\
$ smem
User Count Swap USS PSS RSS
bill 1 1795552 1017048 1017076
Hi Alex,
Try run nvidia-smi before start slurmd, I also found this issue. I have to run
nvidia-smi before slurmd when I reboot system.
Regards,
Bill
-- Original --
From: Alex Chekholko
Date: Tue,Jul 24,2018 6:10 AM
To: Slurm User Community List
Subject: Re
All I can suggest is to check that all the paths you have provided SIESTA are
correct (the path to the executable is clearly fine b/c SIESTA starts, but can
it fine prime.fdf?). Otherwise start with your local support team.
Best,
Bill.
--
Bill Barth, Ph.D., Director, HPC
bba...@tacc.
Thank you Peter,
Bill
-- Original --
From: Peter Kjellström
Date: Thu,Jul 19,2018 9:51 PM
To: Bill
Cc: Slurm User Community List
Subject: Re: [slurm-users] default memory request
On Thu, 19 Jul 2018 18:57:09 +0800
"Bill" wrote:
> Hi ,
>
&
Hi ,
I just found the way , set "DefMemPerCPU=4096" for partition in slurm.conf
It will use 4G memory request.
Regards,
Bill
-- Original --
From: "Bill";
Date: Thu, Jul 19, 2018 06:39 PM
To: "Slurm User Community&q
hanks,
Bill
Thank you Brain,
Another question is can one node has different weight for different partition?
E.G node1 0.8 in Partition high but 0.5 in partition low?
Best regards,
Bill
-- Original --
From: Brian Andrus
Date: Wed,Jun 27,2018 0:44 PM
To: slurm-users
Subject
advance,
Bill
Greetings all,
Just wanted to mention I build building the newest slurm on Ubuntu 18.04.
Gcc-7.3 is the default compiler, which means that the various dependencies
(munge, libevent, hwloc, netloc, pmix, etc) are already available and built with
gcc-7.3.
I carefully built slurm-17.11.6 + openmpi
2T06:00:00
StartTime=2018-06-12T06:00:00
You'd need more code around that, obviously, to determine if this
starttime might hold up the job.
Bill
On 05/10/2018 04:23 PM, Prentice Bisbal wrote:
Dear Slurm Users,
We've started using maintenance reservations. As you would expect, this
c
On 05/08/2018 05:33 PM, Christopher Samuel wrote:
> On 09/05/18 10:23, Bill Broadley wrote:
>
>> It's possible of course that it's entirely an openmpi problem, I'll
>> be investigating and posting there if I can't find a solution.
>
> One of the cha
Greetings all,
I have slurm-17.11.5, pmix-1.2.4, and openmpi-3.0.1 working on several clusters.
I find srun handy for things like:
bill@headnode:~/src/relay$ srun -N 2 -n 2 -t 1 ./relay 1
c7-18 c7-19
size= 1, 16384 hops, 2 nodes in 0.03 sec ( 2.00 us/hop) 1953 KB/sec
Building was
How do you start it?
If you use Sys V style startup scripts, then likely /etc/Init.d/slurm stop, but
if you;re using systemd, then probably systemctl stop slurm.service (but I
don’t do systemd).
Best,
Bill.
Sent from my phone
> On Apr 24, 2018, at 11:15 AM, Mahmood Naderan wrote:
>
handle it for them. Maybe you should look into
that after you eliminate direct interference from Slurm.
Best,
Bill.
--
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu| Phone: (512) 232-7069
Office: ROC 1.435| Fax: (512) 475-9445
On 4/22/18, 1:06 AM, "
memory that
the node has (minus some padding for the OS, etc.). IS UsePAM enabled in your
slurm.conf, maybe that’s doing it.
Best,
Bill.
--
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu| Phone: (512) 232-7069
Office: ROC 1.435| Fax: (512) 475-9445
On 4/15
wants (cgroups, perhaps?).
Best,
Bill.
--
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu| Phone: (512) 232-7069
Office: ROC 1.435| Fax: (512) 475-9445
On 4/15/18, 1:41 PM, "slurm-users on behalf of Mahmood Naderan"
wrote:
Excuse me... I
/pam.d/sshd file
has pam_limits.so in it, that’s probably where the unlimited setting for root
is coming from.
Best,
Bill.
--
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu| Phone: (512) 232-7069
Office: ROC 1.435| Fax: (512) 475-9445
On 4/15/18, 1:26 PM
better forms of these, but they’re working for us. I guess this
counts now as being documented in a public place!
Best,
Bill.
--
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu| Phone: (512) 232-7069
Office: ROC 1.435| Fax: (512) 475-9445
On 3/21/18, 7:49 AM
,
Bill.
--
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu| Phone: (512) 232-7069
Office: ROC 1.435| Fax: (512) 475-9445
On 3/21/18, 6:08 AM, "slurm-users on behalf of Ole Holm Nielsen"
wrote:
We experience problems with MPI jobs dumping lots
going to depend on the shebang line (as to what’s
being invoked) bash? csh? python? perl? /usr/bin/env X? So, I’d be surprised if
there was a mode for this. Also, would you expect Slurm to delete any options
it used from your command line or leave them?
Best,
Bill.
--
Bill Barth, Ph.D., Director
We do the same at TACC in our base module (which happens to be called “TACC”),
and then we document it.
Best,
Bill.
--
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu| Phone: (512) 232-7069
Office: ROC 1.435| Fax: (512) 475-9445
On 3/6/18, 5:13 PM, "
ThatParameter=100’ or whatever you like to change it.
--
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu| Phone: (512) 232-7069
Office: ROC 1.435| Fax: (512) 475-9445
On 2/23/18, 11:13 PM, "slurm-users on behalf of ~Stack~"
wrote:
Greetings,
We kick them off and lock them out until they respond. Disconnections are
common enough that it doesn’t always get their attention. Inability to log back
in always does.
Best,
Bill.
Sent from my phone.
> On Feb 15, 2018, at 9:25 AM, Patrick Goetz wrote:
>
> The simple solution i
e probably other ways to
do this, but the infrastructure is now historical and set in some stone.
Best,
Bill.
--
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu| Phone: (512) 232-7069
Office: ROC 1.435| Fax: (512) 475-9445
On 2/7/18, 12:28 AM, "slu
file with job records which our local accounting system
consumes to decrement allocation balances, if you care to know).
Best,
Bill.
--
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu| Phone: (512) 232-7069
Office: ROC 1.435| Fax: (512) 475-9445
On 2/6/18
use Lmod to make it available and visible to our
users. There are more strategies for this than you can imagine, so settle on a
few and keep it simple for you!
Best,
Bill.
--
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu| Phone: (512) 232-7069
Office: ROC 1.435
://sourceforge.net/p/lmod/mailman/) which is very active and
monitored by the author and a very knowledgeable community.
Best,
Bill.
--
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu| Phone: (512) 232-7069
Office: ROC 1.435| Fax: (512) 475-9445
On 12/19/17, 8:43 AM
install is
much more recent and does support them) for internal reasons, so we provide the
Launcher for folks who have similar needs to you.
Best,
Bill.
--
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu| Phone: (512) 232-7069
Office: ROC 1.435| Fax: (512) 475
54 matches
Mail list logo