[slurm-users] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-26 Thread Ward Poelmans via slurm-users

Hi,

On 26/02/2024 09:27, Josef Dvoracek via slurm-users wrote:


Are you anybody using something more advanced and still understandable by 
casual user of HPC?


I'm not sure it qualifies but:

sbatch --wrap 'screen -D -m'
srun --jobid  --pty screen -rd

Or:
sbatch -J screen --wrap 'screen -D -m'
srun --jobid $(squeue -n screen -h -o '%A') --pty screen -rd

Ward


smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurm Cleaning Up $XDG_RUNTIME_DIR Before It Should?

2024-05-15 Thread Ward Poelmans via slurm-users

Hi,

This is systemd, not slurm. We've also seen it being created and removed. As 
far as I understood something about the session that systemd clean up. We've 
worked around by adding this to the prolog:

MY_XDG_RUNTIME_DIR=/dev/shm/${USER}
mkdir -p $MY_XDG_RUNTIME_DIR
echo "export XDG_RUNTIME_DIR=$MY_XDG_RUNTIME_DIR"

(in combination with private tmpfs per job).

Ward

On 15/05/2024 10:14, Arnuld via slurm-users wrote:

I am using the latest slurm. It  runs fine for scripts. But if I give it a 
container then it kills it as soon as I submit the job. Is slurm cleaning up 
the $XDG_RUNTIME_DIR before it should?  This is the log:

[2024-05-15T08:00:35.143] [90.0] debug2: _generate_patterns: StepId=90.0 
TaskId=-1
[2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command 
argv[0]=/bin/sh
[2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command 
argv[1]=-c
[2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command 
argv[2]=crun --rootless=true --root=/run/user/1000/ state slurm2.acog.90.0.-1
[2024-05-15T08:00:35.167] [90.0] debug:  _get_container_state: RunTimeQuery 
rc:256 output:error opening file `/run/user/1000/slurm2.acog.90.0.-1/status`: 
No such file or directory

[2024-05-15T08:00:35.167] [90.0] error: _get_container_state: RunTimeQuery 
failed rc:256 output:error opening file 
`/run/user/1000/slurm2.acog.90.0.-1/status`: No such file or directory

[2024-05-15T08:00:35.167] [90.0] debug:  container already dead
[2024-05-15T08:00:35.167] [90.0] debug3: _generate_spooldir: task:0 
pattern:%m/oci-job%j-%s/task-%t/ path:/var/spool/slurmd/oci-job90-0/task-0/
[2024-05-15T08:00:35.167] [90.0] debug2: _generate_patterns: StepId=90.0 
TaskId=0
[2024-05-15T08:00:35.168] [90.0] debug3: _generate_spooldir: task:-1 
pattern:%m/oci-job%j-%s/ path:/var/spool/slurmd/oci-job90-0/
[2024-05-15T08:00:35.168] [90.0] stepd_cleanup: done with step 
(rc[0x100]:Unknown error 256, cleanup_rc[0x0]:No error)
[2024-05-15T08:00:35.275] debug3: in the service_connection
[2024-05-15T08:00:35.278] debug2: Start processing RPC: REQUEST_TERMINATE_JOB
[2024-05-15T08:00:35.278] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2024-05-15T08:00:35.278] debug:  _rpc_terminate_job: uid = 64030 JobId=90
[2024-05-15T08:00:35.278] debug:  credential for job 90 revoked







smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Using sharding

2024-07-04 Thread Ward Poelmans via slurm-users

Hi Ricardo,

It should show up like this:

   Gres=gpu:gtx_1080_ti:4(S:0-1),shard:gtx_1080_ti:16(S:0-1)

   CfgTRES=cpu=32,mem=515000M,billing=130,gres/gpu=4,gres/shard=16
   AllocTRES=cpu=8,mem=31200M,gres/shard=1


I can't directly spot any error however. Our gres.conf is simply 
`AutoDetect=nvml`.

AccountingStorageTRES=gres/gpu,gres/shard
GresTypes=gpu,shard

Did you try restarting slurm?

Ward


smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Using sharding

2024-07-05 Thread Ward Poelmans via slurm-users

Hi Arnuld,

On 5/07/2024 13:56, Arnuld via slurm-users wrote:

It should show up like this:

     Gres=gpu:gtx_1080_ti:4(S:0-1),shard:gtx_1080_ti:16(S:0-1)


What's the meaning of (S:0-1) here?


The sockets to which the GPUs are associated:

If GRES are associated with specific sockets, that information will be reported. For example if all 
4 GPUs on a node are all associated with socket zero, then "Gres=gpu:4(S:0)". If 
associated with sockets 0 and 1 then "Gres=gpu:4(S:0-1)".


Ward


smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Access to --constraint= in Lua cli_filter?

2024-08-19 Thread Ward Poelmans via slurm-users

Hi Kevin,

On 19/08/2024 08:15, Kevin Buckley via slurm-users wrote:

If I supply a

   --constraint=

option to an sbatch/salloc/srun, does the arg appear inside
any object that a Lua CLI Filter could access?


Have a look if you can spot them in:
function slurm_cli_pre_submit(options, pack_offset)
  env_json = slurm.json_env()
  slurm.log_info("ENV: %s", env_json)
  opt_json = slurm.json_cli_options(options)
  slurm.log_info("OPTIONS: %s", opt_json)
end

I thought all options could be access in the cli filter.

Ward


smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: A note on updating Slurm from 23.02 to 24.05 & multi-cluster

2024-09-26 Thread Ward Poelmans via slurm-users

Hi Bjørn-Helge,

On 26/09/2024 09:50, Bjørn-Helge Mevik via slurm-users wrote:

Ward Poelmans via slurm-users  writes:


We hit a snag when updating our clusters from Slurm 23.02 to
24.05. After updating the slurmdbd, our multi cluster setup was broken
until everything was updated to 24.05. We had not anticipated this.


When you say "everything", do you mean all the slurmctlds, or also all slurmds?


Yes, the issue was gone after *everything* was upgraded: the slurmctld, slurmd 
and login nodes.


Ward


smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] A note on updating Slurm from 23.02 to 24.05 & multi-cluster

2024-09-25 Thread Ward Poelmans via slurm-users

Hi all,

We hit a snag when updating our clusters from Slurm 23.02 to 24.05. After 
updating the slurmdbd, our multi cluster setup was broken until everything was 
updated to 24.05. We had not anticipated this.

SchedMD says that fixing it would be a very complex operation.

Hence, this warning to everybody on planning to update: make sure to quickly 
updating everything once you've updated the slurmdbd daemon.

Reference: https://support.schedmd.com/show_bug.cgi?id=20931



Ward


smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com