Hi Bjørn-Helge,
On 26/09/2024 09:50, Bjørn-Helge Mevik via slurm-users wrote:
Ward Poelmans via slurm-users writes:
We hit a snag when updating our clusters from Slurm 23.02 to
24.05. After updating the slurmdbd, our multi cluster setup was broken
until everything was updated to 24.05. We
Hi all,
We hit a snag when updating our clusters from Slurm 23.02 to 24.05. After
updating the slurmdbd, our multi cluster setup was broken until everything was
updated to 24.05. We had not anticipated this.
SchedMD says that fixing it would be a very complex operation.
Hence, this warning to
Hi Kevin,
On 19/08/2024 08:15, Kevin Buckley via slurm-users wrote:
If I supply a
--constraint=
option to an sbatch/salloc/srun, does the arg appear inside
any object that a Lua CLI Filter could access?
Have a look if you can spot them in:
function slurm_cli_pre_submit(options, pack_offse
Hi Arnuld,
On 5/07/2024 13:56, Arnuld via slurm-users wrote:
It should show up like this:
Gres=gpu:gtx_1080_ti:4(S:0-1),shard:gtx_1080_ti:16(S:0-1)
What's the meaning of (S:0-1) here?
The sockets to which the GPUs are associated:
If GRES are associated with specific sockets, t
Hi Ricardo,
It should show up like this:
Gres=gpu:gtx_1080_ti:4(S:0-1),shard:gtx_1080_ti:16(S:0-1)
CfgTRES=cpu=32,mem=515000M,billing=130,gres/gpu=4,gres/shard=16
AllocTRES=cpu=8,mem=31200M,gres/shard=1
I can't directly spot any error however. Our gres.conf is simply
`AutoDetect=nvm
Hi,
This is systemd, not slurm. We've also seen it being created and removed. As
far as I understood something about the session that systemd clean up. We've
worked around by adding this to the prolog:
MY_XDG_RUNTIME_DIR=/dev/shm/${USER}
mkdir -p $MY_XDG_RUNTIME_DIR
echo "export XDG_RUNTIME_DI
Hi,
On 26/02/2024 09:27, Josef Dvoracek via slurm-users wrote:
Are you anybody using something more advanced and still understandable by
casual user of HPC?
I'm not sure it qualifies but:
sbatch --wrap 'screen -D -m'
srun --jobid --pty screen -rd
Or:
sbatch -J screen --wrap 'screen -D -m'
Hi,
On 21/11/2023 13:52, Arsene Marian Alain wrote:
But how can user write or access the hidden directory .1809 if he doesn't have
read/write permission on main directory 1809?
Because it works as a namespace. On my side:
$ ls -alh /local/6000523/
total 0
drwx-- 3 root root 33 Nov
Hi Arsene,
On 21/11/2023 10:58, Arsene Marian Alain wrote:
I just give my Basepath=/scratch (a local directory for each node that is already mounted
with 1777 permissions) in job_container.conf. The plugin automatically generates for each
job a directory with the "JOB_ID", for example: /scrat
Hi Ole,
On 10/11/2023 15:04, Ole Holm Nielsen wrote:
On 11/5/23 21:32, Ward Poelmans wrote:
Yes, it's very similar. I've put our systemd unit file also online on
https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11
This might disturb the logic in waitforib.sh, or at l
https://github.com/maxlxl/network.target_wait-for-interfaces ?
Thanks,
Ole
On 11/1/23 20:09, Ward Poelmans wrote:
We have a slightly difference script to do the same. It only relies on /sys:
# Search for infiniband devices and check waits until
# at least one reports that it is ACTIVE
if [[ ! -d /sys
Hi,
We have a slightly difference script to do the same. It only relies on /sys:
# Search for infiniband devices and check waits until
# at least one reports that it is ACTIVE
if [[ ! -d /sys/class/infiniband ]]
then
logger "No infiniband found"
exit 0
fi
ports=$(ls /sys/class/infiniba
Hi Steven,
On 27/08/2023 08:17, Steven Swanson wrote:
I'm trying to set up slurm as the backend for a system with Jupyter
Notebook-based front end.
The jupyter notebooks are running in containers managed by Jupyter Hub, which
is a mostly turnkey system for providing docker containers that us
Hi Gary,
On 29/06/2023 22:35, Jackson, Gary L. wrote:
A follow-up:
As the slurmctld code is written now, it seems that all job submission paths
through Slurm get the `job_submit` plugin callback invoked on their behalf,
which is great! However, if this is a promise that the API is making, t
Hi Xaver,
On 17/04/2023 11:36, Xaver Stiensmeier wrote:
let's say I want to submit a large batch job that should run on 8 nodes.
I have two partitions, each holding 4 nodes. Slurm will now tell me that
"Requested node configuration is not available". However, my desired
output would be that sl
Hi,
We have a dedicated partitions for GPUs (their name ends with _gpu) and simply
forbid a job that is not requesting GPU resources to use this partition:
local function job_total_gpus(job_desc)
-- return total number of GPUs allocated to the job
-- there are many ways to request a GPU
On 24/02/2023 18:34, David Laehnemann wrote:
Those queries then should not have to happen too often, although do you
have any indication of a range for when you say "you still wouldn't
want to query the status too frequently." Because I don't really, and
would probably opt for some compromise of
Hi Xaver,
On 6/02/2023 08:39, Xaver Stiensmeier wrote:
How would you schedule a job (let's say using srun) to work on these
nodes? Of course this would be interesting in a dynamic case, too
(assuming that the database is downloaded to nodes during job
execution), but for now I would be happy wi
Hi,
Slurm 22.05 has a new thing called GPU sharding that allows a single GPU to be
used by multiple jobs at once. As far as I understood the major difference with
the MPS approach is that this should generic (not tied to NVidia technology).
Has anyone tried it out? Does it work well? Any cavea
On 18/01/2023 15:22, Ohlerich, Martin wrote:
But Magnus (Thanks for the Link!) is right. This is still far away from a
feature rich job- or task-farming concept, where at least some overview of the
passed/failed/missing task statistics is available etc.
GNU parallel has log output and options
Hi Martin,
Just a tip: use gnu parallel instead of a for loop. Much easier and more
powerful.
Like:
parallel -j $SLURM_NTASKS srun -N 1 -n 1 -c 1 --exact ::: *.input
Ward
smime.p7s
Description: S/MIME Cryptographic Signature
Hi Michael,
On 30/11/2022 07:29, Michael Milton wrote:
Considering this, my question is about which APIs (ABI, CLI, other?) are
considered stable and worth targeting from a third party application. In
addition, is there any initiative to making the ABI stable, because it seems
like it would
On 24/10/2022 09:32, Ole Holm Nielsen wrote:
On 10/24/22 06:12, Richard Chang wrote:
I have a two node Slurmctld setup and both will mount an NFS exported directory
as the state save location.
It is definitely a BAD idea to store Slurm StateSaveLocation on a slow NFS
directory! SchedMD reco
Hi William,
On 14/10/2022 11:41, William Zhang wrote:
How to realize this function .
For example ,
A job requires 6 CPUs with 1 GPU .And it runs on gpu ID 0 , CPU ID 0-5 .
The second job requires 8 CPUs with 1 GPU . If it runs on gpu ID 1 ,we hope the
CPU ID is 16-23.
The third job requires 6
Hi Loris,
On 29/09/2022 09:26, Loris Bennett wrote:
I can see that this is potentially not easy, since an MPI job might have
still have phases where only one core is actually being used.
Slurm will create the needed cgroups on all the nodes that are part of the job
when the job starts. So yo
Hi Guillaume,
On 15/06/2022 16:59, Guillaume De Nayer wrote:
Perhaps I missunderstand the Slurm documentation...
As thought that the --exclusive option used in combination with sbatch
will reserve the whole node (40 cores) for the job (submitted with
sbatch). This part is working fine. I can c
Hi,
We're using a cli filter for doing this. But it's more tricky then just
`--export=NONE`. For a srun inside a sbatch, you want `--export=ALL` again
because MPI will break otherwise.
We have this in our cli filter:
function slurm_cli_pre_submit(options, pack_offset)
local default_e
Hi Steven,
I think truly dynamic adding and removing of nodes is something that's on the
roadmap for slurm 23.02?
Ward
On 5/05/2022 15:28, Steven Varga wrote:
Hi Tina,
Thank you for sharing. This matches my observations when I checked if slurm
could do what I am upto: manage AWS EC2 dynamic(
Hi PVD,
On 25/03/2022 01:55, pankajd wrote:
We have slurm 21.08.6 and GPUs in our compute nodes. We want to restrict / disable the
use of "exclusive" flag in srun for users. How should we do it?
You can check for the flag in the job_submit.lua plugin and reject it if it's
used while also r
Hi Jeff,
On 17/03/2022 15:39, Jeffrey R. Lang wrote:
I want to look into the new feature of saving job scripts in the Slurm
database but have been unable to find documentation on how to do it. Can
someone please point me in the right direction for the documentation or slurm
configurati
Hi Paul,
On 10/02/2022 14:33, Paul Brunk wrote:
Now we see a problem in which the OOM killer is in some cases
predictably killing job steps who don't seem to deserve it. In some
cases these are job scripts and input files which ran fine before our
Slurm upgrade. More details follow, but th
Hi,
On 20/07/2021 16:01, Durai Arasan wrote:
>
> This is limited to this one node only. Do you know how to fix this? I already
> tried restarting the slurmd service on this node.
Is the node properly definied in the slurm.conf and do the DNS hostname work?
scontrol show node slurm-bm-70
Ward
On 6/07/2021 14:59, Emre Brookes wrote:
> I'm using slurm 20.02.7 & have the same issue (except I am running batch
> jobs).
> Does MinJobAge work to keep completed jobs around for the specified duration
> in squeue output?
It does for me if I do 'squeue -t all'. This is slurm 20.11.7.
Ward
Hi Tina,
On 2/07/2021 13:42, Tina Friedrich wrote:
> We did think about having 'hidden' GPU partitions instead of wrangling it
> with features, but there didn't seem to be any benefit to that that we could
> see.
The benefit with partitions is that you can set a bunch of options that are not
p
On 8/06/2021 00:27, Sid Young wrote:
> Is there a tool that will extract the job counts in JSON format? Such as
> #running, #in pending #onhold etc
>
> I am trying to build some custom dashboards for the our new cluster and this
> would be a really useful set of metrics to gather and display.
Hi,
On 7/06/2021 04:33, David Schanzenbach wrote:
> In our .rpmmacros file we use, the following option is set:
> %_with_slurmrestd 1
You also need libjwt: https://bugs.schedmd.com/show_bug.cgi?id=4
Ward
On 27/05/2021 08:19, Loris Bennett wrote:
> Thanks for the detailed explanations. I was obviously completely
> confused about what MUNGE does. Would it be possible to say, in very
> hand-waving terms, that MUNGE performs a similar role for the access of
> processes to nodes as SSH does for the ac
Hi Ole,
On 16/04/2021 14:23, Ole Holm Nielsen wrote:
> Question: Does anyone have experiences with this type of scenario? Any
> good ideas or suggestions for other methods for data migration?
We once did something like that.
Basically it did something like that:
- Process is kicked off per use
Hi Simone,
On 9/04/2021 18:03, Simone Riggi wrote:
> All of them are working.
> So in this case the only requirement for a user is having the read/write
> permission on the socket?
Correct. The authentication is done as you know the user with a socket.
> My goal at the end would be to let a Do
Hi Simone,
On 8/04/2021 23:23, Simone Riggi wrote:
> $ scontrol token lifespan=7200 username=riggi
>
> How can I configure and test the other auth method (local)? I am using
> jwt at the moment.
> I would like a user to be always authorized to use the rest API.
local means socket (so you don't
Hi Simone,
On 8/04/2021 15:53, Simone Riggi wrote:
> - I see effectively that --with jwt is not listed. I wonder how to build
> (using rpmbuild) slurm auth plugins?
> In general I didn't understand from the doc what plugins slurmrestd
> expects by default and where it searches it. From -a opti
Hi Ole,
On 8/04/2021 10:09, Ole Holm Nielsen wrote:
> On 4/8/21 9:50 AM, Simone Riggi wrote:
>>
>> rpmbuild -ta slurm-20.11.5.tar.bz2 --with mysql --with slurmrestd
>> --with jwt
>
> I don't see this "--with jwt" in the slurm.spec file:
It's not yet there: https://bugs.schedmd.com/show_bug.cgi?i
Hi Simone,
On 8/04/2021 09:50, Simone Riggi wrote:
> where /etc/slurm/slurmrestd.conf
>
> include /etc/slurm/slurm.conf
> AuthType=auth/jwt
Did you add a key?
AuthAltParameters=jwt_key=/etc/slurm/jwt.key
It needs to be present on the slurmdbd and slurmctld nodes.
Ward
Hi Prentice,
On 8/03/2021 22:02, Prentice Bisbal wrote:
> I have a very hetergeneous cluster with several different generations of
> AMD and Intel processors, we use this method quite effectively.
Could you elaborate a bit more and how you manage that? Do you force you
users to pick a feature? W
Hi,
On 5/03/2021 11:29, Alberto Morillas, Angelines wrote:
> I know that when I send a job with scontroI can get the path and the
> name of the script used to send this job, but normally the users change
> theirs scripts and sometimes all was wrong after that, so is there any
> possibility to rep
On 15/12/2020 17:48, Olaf Gellert wrote:
So munge seems to work as far as I can say. What else does
slurm using munge? Are hostnames part of the authentication?
Do I have to wonder about the time "Thu Jan 01 01:00:00 1970"
I'm not an expert but I know that hostnames are part of munge
authentic
46 matches
Mail list logo