[slurm-users] Re: [EXT] Re: slurm_pam_adopt module not working

2025-06-16 Thread William Brown via slurm-users
You say that you modified the file in a different way. It may be worth checking file permissions as for some security functions files can be ignored if they don't have the required permissions. That said, that would show in the journal/ logs. William On Tue, 17 Jun 2025, 06:24 Ratnasamy, Fritz v

[slurm-users] Re: SLURM configuration for LDAP users

2024-02-04 Thread William Brown via slurm-users
We use Active Directory and NFSv4 and I think that we have some instructions for setting it up on CentOS 7. It was quite involved and does require that the directory service returns UID and GID information, so have populated the RFC2307 fields in AD. This is required for munge to work. W

Re: [slurm-users] cpus-per-task behaviour of srun after 22.05

2023-10-22 Thread William Brown
probably no 'right way' as it depends so much on the program being run. William Brown On Sun, 22 Oct 2023, 17:51 Jason Simms, wrote: > Hello Michael, > > I don't have an elegant solution, but I'm writing mostly to +1 this. I > didn't catch this in the release n

Re: [slurm-users] Submitting jobs from machines outside the cluster

2023-08-27 Thread William Brown
could submit jobs to various job runners including Slurm. The galaxy node definitely didn't run any slurm daemons. I think you do need a common authentication system between the submitting node and the cluster, but that may just be what I'm used to. William Brown On Sun, 27 Aug 202

Re: [slurm-users] Get Job Array information in Epilog script

2023-03-17 Thread William Brown
We create the temporary directories using SLURM_JOB_ID, and that works fine with Job Arrays so far as I can see. Don't you have a problem if a user has multiple jobs on the same node? William On Fri, 17 Mar 2023 at 11:17, Timo Rothenpieler wrote: > > Hello! > > I'm currently facing a bit of an

Re: [slurm-users] slurm_persist_conn_open_without_init: failed to open persistent connection to host

2022-11-30 Thread William Brown
If this is a single host machine I suggest checking the /etc/hosts file to make sure that ‘mannose’ is listed as you expect. It is generally advised to use FQDNs for host names; the fact that the message “connection to host:mannose:6819: Connection refused” used a short name may mean that in a

Re: [slurm-users] Managing partition resources

2022-08-31 Thread William Brown
that cannot be exclusive such as IO to storage. We have used the --spread-jobs option with some success but I think it spreads the jobs of a single sbatch file rather than cause a new job to scale horizontally. I'm sure others know better. William Brown On Wed, 31 Aug 2022, 18:31 Aleja

Re: [slurm-users] nodes lingering in completion

2022-04-01 Thread William Brown
To process the epilog a Bash process must be created so perhaps look at .bashrc. Try timing running the epilog yourself on a compute node. I presume it is owned by an account local to the compute nodes, not a directory service account? William On Fri, 1 Apr 2022, 17:25 Henderson, Brent, wrote:

Re: [slurm-users] work with sensitive data

2021-12-17 Thread William Brown
I realise not helpful with Lustre but we are using NFSv4 with krb5p mounts to encrypt in flight. Also AUKS to make the Kerberos tickets available to the compute nodes, an idea from CERN. All our nodes are AD integrated, so if the user is authenticated by AD they can access the data, and not other

Re: [slurm-users] SLURM on AWS via Terraform

2021-04-19 Thread William Brown
Try https://github.com/clusterinthecloud William On Mon, 19 Apr 2021, 17:24 Nicholas Yue, wrote: > Hi, > > I am looking for information on how it might be possible to spin up an > AWS SLURM cluster via Terraform. > > Thank you in advance. > > Cheers > -- > Nicholas Yue > Graphics - Arnold,

Re: [slurm-users] R jobs crashing when run in parallel

2021-03-29 Thread William Brown
Maybe you have run out of file handles. William On Mon, 29 Mar 2021, 17:36 Patrick Goetz, wrote: > Could this be a function of the R script you're trying to run, or are > you saying you get this error running the same script which works at > other times? > > On 3/29/21 7:47 AM, Simon Andrews wr

Re: [slurm-users] Slurm version 20.11.5 is now available

2021-03-19 Thread William Brown
We build with CSI hardened nodes and /tmp is marked to block execution. It causes occasional frustration but it would be important to be able to redirect to a file system that allowed execution. William On Fri, 19 Mar 2021, 13:28 Paul Edmon, wrote: > I was about to ask this as well as we use /s

Re: [slurm-users] SLURM slurmctld error on Ubuntu20.04 starting through systemctl

2021-03-17 Thread William Brown
I can't immediately check what I do with Slurm but in several systemd files I create sub folders of /var/run and set their ownership the same as the service will run under. I use CentOS (for now!). I can post an actual service startup file in daylight if useful. William On Wed, 17 Mar 2021,

Re: [slurm-users] Cluster nodes on multiple cluster networks

2021-01-22 Thread William Brown
I think there would be no reason why a slurm node will care about traffic on multiple interfaces as long as your configuration is set to listen on them, e.g. no firewalld rules in the way restricting traffic to the private network. William From: slurm-users On Behalf Of Sajesh Singh Sen

Re: [slurm-users] pam_slurm_adopt always claims now active jobs even when they do

2021-01-15 Thread William Brown
I encountered the same problem, and as with munge I created a .te file that can be built to create a policy to add to the compute nodes to fix this: my-pam_slurm_adopt.te: --- module my-pam_slurm_adopt 1.0; require {

Re: [slurm-users] pam_slurm_adopt always claims now active jobs even when they do

2020-10-29 Thread William Brown
That is interesting as I run with SElinux enforcing. I will do some more testing of attaching by ssh to nodes with running jobs. William On Thu, 29 Oct 2020, 11:58 Paul Raines, wrote: > The debugging was useful. The problem turned out to be that I am running > with SELINUX enabled due to corp

Re: [slurm-users] unable to run on all the logical cores

2020-10-11 Thread William Brown
I use the SelectTypeParameters=CR_CPU. So, is there a config to tune, an option to use in "sbatch" to achieve the same result, or should I rather launch 20 jobs per node and have each job split in two internally (using "parallel" or "future" for example)? On Th

Re: [slurm-users] unable to run on all the logical cores

2020-10-08 Thread William Brown
R is single threaded. On Thu, 8 Oct 2020, 07:44 Diego Zuccato, wrote: > Il 08/10/20 08:19, David Bellot ha scritto: > > > good spot. At least, scontrol show job is now saying that each job only > > requires one "CPU", so it seems all the cores are treated the same way > now. > > Though I still h

Re: [slurm-users] IPv6 for slurmd and slurmctld

2020-05-01 Thread William Brown
For some services that display of 0.0.0.0 does include IPv6, although it is counter-intuitive. Try to see if you can connect to it using the IPv6 address. William On Fri, 1 May 2020 at 16:35, Thomas Schäfer wrote: > Hi, > > is there an switch, option, environment variable, configurable key wo

Re: [slurm-users] Correct way to do sbcast with sbatch

2020-04-18 Thread William Brown
I will admit that I have not used sbcast but from reading the man pages I think that it does not do what you hope. The sbcast command will indeed run on the first allocated node, so the source file must be accessible from there. The man page does say that shared file systems are a better so

Re: [slurm-users] Error buildind rpm on Centos 7

2020-04-07 Thread William Brown
Search the list archive, I had the same and it was because I had MariaDB installed but as the packaging of MariaDB changed I was missing a required RPM. They split it differently and there is another RPM prerequisite. Can't recall the name just now, but search the archive. William On Tue, 7 Apr

Re: [slurm-users] Meaning of --cpus-per-task and --mem-per-cpu when SMT processors are used

2020-03-04 Thread William Brown
What Marcus reports is quite correct. It can be confusing, and Slurm uses 'CPU' I think as a non-specific term to mean 'the smallest assignable compute object'. With SMT enabled that is the thread, and with it disabled it is the core. We were told by the company that installed the cluster at m

Re: [slurm-users] Srun not setting DISPLAY with --x11 for one account

2020-01-24 Thread William Brown
There are differences for X11 between Slurm versions so it may help to know which version you have. I tried some of your commands on our slurm 19.05.3-2 cluster, and interestingly on the session on the compute node I don't see the cookie for the login node: This was with MobaXterm: [user@prdubrv

Re: [slurm-users] sbatch sending the working directory from the controller to the node

2020-01-21 Thread William Brown
The srun man page says: When initiating remote processes srun will propagate the current working directory, unless --chdir= is specified, in which case path will become the working directory for the remote processes. William From: slurm-users On Behalf Of Dean Schulze Sent: 21 Janua

Re: [slurm-users] Slurm 19-05-4-1 and Centos8

2020-01-10 Thread William Brown
be owned by the user and group specified > in User= and Group=." > > Best > Marcus > > On 1/10/20 12:20 PM, William Brown wrote: > > Here is an example of a modified system service file which uses > ExecStartPre to create the directory under /var/run on the fly. T

Re: [slurm-users] Slurm 19-05-4-1 and Centos8

2020-01-10 Thread William Brown
Here is an example of a modified system service file which uses ExecStartPre to create the directory under /var/run on the fly. This is for slurmctld. As /var/run is I think in RAM this creates the folder when the service starts. There are other customisations for our environment in here, bu

Re: [slurm-users] Need to execute a binary with arguments on a node

2019-12-18 Thread William Brown
Sometimes the way is to make the shell the binary, e.g. bash -c 'ls -lsh' On Wed, 18 Dec 2019, 18:25 Dean Schulze, wrote: > This is a rookie question. I can use the srun command to execute a simple > command like "ls" or "hostname" on a node. But I haven't found a way to > add arguments lik

Re: [slurm-users] slurmd.service fails to register

2019-12-17 Thread William Brown
These are the tests that we use: The following steps can be performed to verify that the software has been properly installed and configured. These should be done as a non-privileged user: • Generate a credential on stdout: $ munge -n • Check if a credential can be loca

Re: [slurm-users] Small FreeMem is reported by scontrol

2019-12-16 Thread William Brown
Memory may be being used by jobs running, or tasks outside the control of Slurm running, or possibly NFS buffer cache or similar. You may need to start an ssh session on the node and look. William On Mon, 16 Dec 2019 at 15:38, Mahmood Naderan wrote: > Hi, > With the following output > >Rea

Re: [slurm-users] slurmdbd.service gives: Unable to initialize auth/munge authentication plugin

2019-12-15 Thread William Brown
That will depend where the rest of the cluster is. If they were in the VPN such as inside a corporate network that you used the VPN to connect to, they might. But if they are elsewhere in your home network, they will not. I think some VPN clients can be configured to be quite open but usually they

[slurm-users] pkgconfig conflict

2019-12-12 Thread William Brown
Version 19.05.3-2 CentOS 7.7 I was wanting to install the slurm-devel RPM that I had built, but I get this translation check error: $ sudo yum localinstall /home/apps/slurm/19.05/RPMS/slurm-devel-19.05.3-2.el7.x86_64.rpm . . Transaction check error: file /usr/lib64/pkgconfig from install of slu

Re: [slurm-users] Need help with controller issues

2019-12-12 Thread William Brown
I looked back in the list to November when I had the same problem problem building with MariaDB: >>>> On 11-11-2019 21:23, William Brown wrote: >>>>> I have in fact found the answer by looking harder. >>>>> >>>>> The config.log clearly sho

Re: [slurm-users] Need help with controller issues

2019-12-10 Thread William Brown
The latest MariaDB packaging is different, there is a 3rd RPM needed, as well as the client and developer. Away from my desk but the info is on the MariaDB site. William On Wed, 11 Dec 2019, 05:23 Chris Samuel, wrote: > On Tuesday, 10 December 2019 1:57:59 PM PST Dean Schulze wrote: > > > This

Re: [slurm-users] Environment modules

2019-11-23 Thread William Brown
Agreed, I have just been setting up Lmod on a national compute cluster where I am a non-privileged cluster and on an internal cluster where I have full rights. It works very well, and Lmod can read theTcl module files also. The most recent version has some extra features specially for Slurm. An

Re: [slurm-users] Replace SGE by Slurm on running cluster

2019-11-12 Thread William Brown
In my last role we moved from SGE to Slurm. However we did this by using VMs for all the control, login, slurmDBD and MariaDB nodes, so it was easy enough to build a Slurm cluster up to the point where it needed compute nodes. We then removed compute nodes in groups from SGE, reinstalled w

Re: [slurm-users] RPM build error - accounting_storage_mysql.so

2019-11-12 Thread William Brown
;>>> Hi William, > >>>> > >>>> Interesting experiences with MariaDB 10.4! I tried to collect the > >>>> instructions from the MariaDB page, but I'm unsure about how to get > >>>> the galera-4 RPM. > >>>> &

Re: [slurm-users] RPM build error - accounting_storage_mysql.so

2019-11-11 Thread William Brown
sik.dtu.dk/niflheim/Slurm_installation#build-slurm-rpms Note in particular: > Important: Install the MariaDB (a replacement for MySQL) packages before you > build Slurm RPMs (otherwise some libraries will be missing): > > yum install mariadb-server mariadb-devel /Ole On 11-11-201

[slurm-users] RPM build error - accounting_storage_mysql.so

2019-11-11 Thread William Brown
(pkglib_LTLIBRARIES) pkglib_LTLIBRARIES = accounting_storage_slurmdbd.la So I think that the problem is that the definition of pkglib_LTLIBRARIES is commented out in the accounting_storage_mysql Makefile, hence nothing to build. Is that intended? Is it a consequence of something in my environment? William Brown

Re: [slurm-users] SLURM in Virtual Machine

2019-09-12 Thread William Brown
I built a cluster with Login Node, slurmctld, slurmdbd and MariaDb all on VMs, and the compute nodes all physical. Works fine. Having a VM as login node has the added benefit that anyone who tries to run an application there interactively soon finds that it will not run in small RAM, and in fact

Re: [slurm-users] Accounting: Default Associations for Unknown Accounts

2018-12-20 Thread William Brown
inverse script but that is just a problem of having time. I am looking at using keytab to solve the Kerberos ticket but I haven’t cracked it yet. William Brown Rothamsted Research From: slurm-users On Behalf Of Sam Hawarden Sent: 20 December 2018 23:36 To: Slurm User Community List