[slurm-users] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?
Hi, On 26/02/2024 09:27, Josef Dvoracek via slurm-users wrote: Are you anybody using something more advanced and still understandable by casual user of HPC? I'm not sure it qualifies but: sbatch --wrap 'screen -D -m' srun --jobid --pty screen -rd Or: sbatch -J screen --wrap 'screen -D -m' srun --jobid $(squeue -n screen -h -o '%A') --pty screen -rd Ward smime.p7s Description: S/MIME Cryptographic Signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Slurm Cleaning Up $XDG_RUNTIME_DIR Before It Should?
Hi, This is systemd, not slurm. We've also seen it being created and removed. As far as I understood something about the session that systemd clean up. We've worked around by adding this to the prolog: MY_XDG_RUNTIME_DIR=/dev/shm/${USER} mkdir -p $MY_XDG_RUNTIME_DIR echo "export XDG_RUNTIME_DIR=$MY_XDG_RUNTIME_DIR" (in combination with private tmpfs per job). Ward On 15/05/2024 10:14, Arnuld via slurm-users wrote: I am using the latest slurm. It runs fine for scripts. But if I give it a container then it kills it as soon as I submit the job. Is slurm cleaning up the $XDG_RUNTIME_DIR before it should? This is the log: [2024-05-15T08:00:35.143] [90.0] debug2: _generate_patterns: StepId=90.0 TaskId=-1 [2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command argv[0]=/bin/sh [2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command argv[1]=-c [2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command argv[2]=crun --rootless=true --root=/run/user/1000/ state slurm2.acog.90.0.-1 [2024-05-15T08:00:35.167] [90.0] debug: _get_container_state: RunTimeQuery rc:256 output:error opening file `/run/user/1000/slurm2.acog.90.0.-1/status`: No such file or directory [2024-05-15T08:00:35.167] [90.0] error: _get_container_state: RunTimeQuery failed rc:256 output:error opening file `/run/user/1000/slurm2.acog.90.0.-1/status`: No such file or directory [2024-05-15T08:00:35.167] [90.0] debug: container already dead [2024-05-15T08:00:35.167] [90.0] debug3: _generate_spooldir: task:0 pattern:%m/oci-job%j-%s/task-%t/ path:/var/spool/slurmd/oci-job90-0/task-0/ [2024-05-15T08:00:35.167] [90.0] debug2: _generate_patterns: StepId=90.0 TaskId=0 [2024-05-15T08:00:35.168] [90.0] debug3: _generate_spooldir: task:-1 pattern:%m/oci-job%j-%s/ path:/var/spool/slurmd/oci-job90-0/ [2024-05-15T08:00:35.168] [90.0] stepd_cleanup: done with step (rc[0x100]:Unknown error 256, cleanup_rc[0x0]:No error) [2024-05-15T08:00:35.275] debug3: in the service_connection [2024-05-15T08:00:35.278] debug2: Start processing RPC: REQUEST_TERMINATE_JOB [2024-05-15T08:00:35.278] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2024-05-15T08:00:35.278] debug: _rpc_terminate_job: uid = 64030 JobId=90 [2024-05-15T08:00:35.278] debug: credential for job 90 revoked smime.p7s Description: S/MIME Cryptographic Signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Using sharding
Hi Ricardo, It should show up like this: Gres=gpu:gtx_1080_ti:4(S:0-1),shard:gtx_1080_ti:16(S:0-1) CfgTRES=cpu=32,mem=515000M,billing=130,gres/gpu=4,gres/shard=16 AllocTRES=cpu=8,mem=31200M,gres/shard=1 I can't directly spot any error however. Our gres.conf is simply `AutoDetect=nvml`. AccountingStorageTRES=gres/gpu,gres/shard GresTypes=gpu,shard Did you try restarting slurm? Ward smime.p7s Description: S/MIME Cryptographic Signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Using sharding
Hi Arnuld, On 5/07/2024 13:56, Arnuld via slurm-users wrote: It should show up like this: Gres=gpu:gtx_1080_ti:4(S:0-1),shard:gtx_1080_ti:16(S:0-1) What's the meaning of (S:0-1) here? The sockets to which the GPUs are associated: If GRES are associated with specific sockets, that information will be reported. For example if all 4 GPUs on a node are all associated with socket zero, then "Gres=gpu:4(S:0)". If associated with sockets 0 and 1 then "Gres=gpu:4(S:0-1)". Ward smime.p7s Description: S/MIME Cryptographic Signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Access to --constraint= in Lua cli_filter?
Hi Kevin, On 19/08/2024 08:15, Kevin Buckley via slurm-users wrote: If I supply a --constraint= option to an sbatch/salloc/srun, does the arg appear inside any object that a Lua CLI Filter could access? Have a look if you can spot them in: function slurm_cli_pre_submit(options, pack_offset) env_json = slurm.json_env() slurm.log_info("ENV: %s", env_json) opt_json = slurm.json_cli_options(options) slurm.log_info("OPTIONS: %s", opt_json) end I thought all options could be access in the cli filter. Ward smime.p7s Description: S/MIME Cryptographic Signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: A note on updating Slurm from 23.02 to 24.05 & multi-cluster
Hi Bjørn-Helge, On 26/09/2024 09:50, Bjørn-Helge Mevik via slurm-users wrote: Ward Poelmans via slurm-users writes: We hit a snag when updating our clusters from Slurm 23.02 to 24.05. After updating the slurmdbd, our multi cluster setup was broken until everything was updated to 24.05. We had not anticipated this. When you say "everything", do you mean all the slurmctlds, or also all slurmds? Yes, the issue was gone after *everything* was upgraded: the slurmctld, slurmd and login nodes. Ward smime.p7s Description: S/MIME Cryptographic Signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] A note on updating Slurm from 23.02 to 24.05 & multi-cluster
Hi all, We hit a snag when updating our clusters from Slurm 23.02 to 24.05. After updating the slurmdbd, our multi cluster setup was broken until everything was updated to 24.05. We had not anticipated this. SchedMD says that fixing it would be a very complex operation. Hence, this warning to everybody on planning to update: make sure to quickly updating everything once you've updated the slurmdbd daemon. Reference: https://support.schedmd.com/show_bug.cgi?id=20931 Ward smime.p7s Description: S/MIME Cryptographic Signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com