[slurm-users] error no error

2025-02-12 Thread Ricardo Román-Brenes via slurm-users
Hello. Could someone enlighten me as to what this error message is? Feb 13 10:02:00 gpu1 slurmd[573705]: slurmd: error: slurm_msg_sendto: address:port=192.168.9.1:36698 msg_type=8001: No error -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-use

[slurm-users] Re: [EXTERNAL] avoid using same GPU by the interactive job

2025-02-12 Thread navin srivastava via slurm-users
Thank you Jesse. I am using Enterprise SLES15SP6 as the OS. I have not introduced the cgroup functionality in my environment. I can think about it and will see if this solution works out. but is there any other way to use without Cgroup to achieve the same. Batch job requests are fine 2 jobs wit

[slurm-users] Re: /etc/passwd sync?

2025-02-12 Thread Mark W. Moorcroft via slurm-users
I made some progress without need for the /etc/passwd sync. Sbatch is working fine on multi-node jobs it appears. Now only salloc runs fail, which I guess is expected behavior without user account sync and ssh key setup. On bare metal OpenHPC, warewulf handles all that for me. So presumably I h

[slurm-users] Re: /etc/passwd sync?

2025-02-12 Thread Cutts, Tim via slurm-users
ParallelCluster isn’t an AWS service. It’s a solution they release as open source which deploys other standard services (EC2, FSx for Lustre etc) using CloudFormation. I have no experience of gov cloud, so I wouldn’t know, but I’d be surprised if it doesn’t allow the use of awscli (which you c

[slurm-users] Re: /etc/passwd sync?

2025-02-12 Thread Mark W. Moorcroft via slurm-users
The only service found in gov cloud with a search for “parallel” is Batch. And the awscli commands are absent there as well. In commercial cloud “parallel” yields many services. Mark Moorcroft Senior Linux Administrator Analytical Mechanics Associates e. mark.w.moorcr...@ama-inc.com

[slurm-users] Re: [EXTERNAL] avoid using same GPU by the interactive job

2025-02-12 Thread Chintanadilok, Jesse via slurm-users
Navin, You can isolate GPUs per job if you have cgroups set up properly. What OS are you using? Newer OSes will support cgroupsv2 out of the box, but if necessary you can continue using v1, this workflow should be applicable for both. Add ConstrainDevices=yes to your cgroup.conf This is what t

[slurm-users] Re: /etc/passwd sync?

2025-02-12 Thread Cutts, Tim via slurm-users
Can you not use Parallel Cluster, rather than the Parallel Computing Service? Parallel Cluster is just EC2 autoscaling and some shared storage through CloudFormation/CDK and a command line interface. I don’t think there are any secret special services? Tim -- Tim Cutts Senior Director, R&D I

[slurm-users] avoid using same GPU by the interactive job

2025-02-12 Thread navin srivastava via slurm-users
hi, facing an issue in my environment where the batch job and the interactive job use the same gpu. Each server has 2 gpu. When 2 batch jobs are running it works fine and use the 2 different gpu's. but if one batch job is running and another job is submitted interactively then it uses the same GP