Re: [slurm-users] Cleanup of job_container/tmpfs

2023-03-06 Thread Michael Jennings
On Monday, 06 March 2023, at 10:15:22 (+0100), Niels Carl W. Hansen wrote: Seems there still are some issues with the autofs - job_container/tmpfs functionality in Slurm 23.02. If the required directories aren't mounted on the allocated node(s) before jobstart, we get: slurmstepd: error: coul

Re: [slurm-users] Cleanup of job_container/tmpfs

2023-03-01 Thread Michael Jennings
On Wednesday, 01 March 2023, at 10:28:24 (+0100), Ole Holm Nielsen wrote: but there may be some significant improvements included in 23.02 TL;DR: I can vouch for this. The primary problem with the interaction between the new namespace code and the automounter daemon was simply that the shar

Re: [slurm-users] Plugins failing on Slurm v22 build Rocky Linux 9

2022-11-29 Thread Michael Jennings
On Tuesday, 29 November 2022, at 08:44:48 (+), Mark Holliman wrote: I mentioned Fedora 9 and CentOS 9 (Stream) simply because they tend to be compatible, and something that works on them is likely to work on Rocky9. RHEL 8.x is based on Fedora 28. RHEL 9.x is based on Fedora 34 via CentOS

Re: [slurm-users] Is sacct not handling quotes properly?

2022-05-04 Thread Michael Jennings
On Wednesday, 04 May 2022, at 10:00:57 (-0700), David Henkemeyer wrote: I am seeing what I think might be a bug with sacct. When I do the following: *> sbatch --export=NONE --wrap='uname -a' --exclusive* *Submitted batch job 2869585* Then, I ask sacct for the SubmitLine, as such: *> sacc

Re: [slurm-users] what is the elegant way to drain node from epilog with self-defined reason?

2022-05-03 Thread Michael Jennings
On Tuesday, 03 May 2022, at 15:46:38 (+0800), taleinterve...@sjtu.edu.cn wrote: We need to detect some problem at job end timepoint, so we write some detection script in slurm epilog, which should drain the node if check is not passed. I know exit epilog with non-zero code will make slurm autom

Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-27 Thread Michael Jennings
On Thursday, 27 May 2021, at 08:19:14 (+0200), Loris Bennett wrote: Thanks for the detailed explanations. I was obviously completely confused about what MUNGE does. Would it be possible to say, in very hand-waving terms, that MUNGE performs a similar role for the access of processes to nodes a

Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-25 Thread Michael Jennings
On Tuesday, 25 May 2021, at 14:09:54 (+0200), Loris Bennett wrote: > I think my main problem is that I expect logging in to a node with a job > to work with pam_slurm_adopt but without any SSH keys. My assumption > was that MUNGE takes care of the authentication, since users' jobs start > on node

Re: [slurm-users] NHC and slurm

2021-04-20 Thread Michael Jennings
On Thursday, 15 April 2021, at 10:58:31 (-0300), Heitor wrote: > I'm trying to setup NHC[0] for our Slurm cluster, but I'm not > getting it to work properly. Just for future reference, NHC has its own mailing lists, and even though your question does relate to Slurm tangentially, it's really an N

Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-02-03 Thread Michael Jennings
On Wednesday, 03 February 2021, at 18:06:27 (+), Philip Kovacs wrote: > I am familiar with the package rename process and it would not have > the effect you might think it would.If I provide an upgrade path to > a new package name, e.g. slurm-xxx, the net effect would be to tell > yum ordnf-ma

Re: [slurm-users] SLES 15 rpmbuild from 20.02.5 tarball wants munge-libs: system munge RPMs don't provide it

2020-10-20 Thread Michael Jennings
On Tuesday, 20 October 2020, at 15:49:25 (+0800), Kevin Buckley wrote: > On 2020/10/20 11:50, Christopher Samuel wrote: > > > > I forgot I do have access to a SLES15 SP1 system, that has: > > > > # rpm -q libmunge2 --provides > > libmunge.so.2()(64bit) > > libmunge2 = 0.5.14-4.9.1 > > libmunge2(

Re: [slurm-users] slurm password -what is impact when changing it

2020-09-15 Thread Michael Jennings
On Monday, 14 September 2020, at 13:46:27 (+), Braun, Ruth A wrote: > Is there any issue if I set/change the slurm account password?I'm running > 19.05.x > > Current state is locked but I have to reset it periodically: > # passwd --status slurm > slurm LK 2014-02-03 -1 -1 -1 -1 (Password

Re: [slurm-users] ssh-keys on compute nodes?

2020-06-10 Thread Michael Jennings
On Tuesday, 09 June 2020, at 15:26:36 (-0400), Prentice Bisbal wrote: > Host-based security is not considered as safe as user-based security, so > should only be used in special cases. That's a pretty significant claim, and certainly one that would need to be backed up with evidence, references,

Re: [slurm-users] ssh-keys on compute nodes?

2020-06-09 Thread Michael Jennings
On Tuesday, 09 June 2020, at 21:27:27 (+0200), Ole Holm Nielsen wrote: > Thanks very much, this is really cool! I need to look into the > HostbasedAuthentication for intra-cluster MPI tasks spawned by SSH (not > using srun). > > Presumably external access still needs to use SSH authorized keys?

Re: [slurm-users] ssh-keys on compute nodes?

2020-06-09 Thread Michael Jennings
On Tuesday, 09 June 2020, at 12:43:34 (+0200), Ole Holm Nielsen wrote: > in which case you need to set up SSH authorized_keys files for such > users. I'll admit that I didn't know about this until I came to LANL, but there's actually a much better alternative than having to create user key pairs

Re: [slurm-users] slurm-20.02.1-1 failed rpmbuild with error File not found

2020-04-21 Thread Michael Jennings
They do something even better: They allow the user/customer to make the choice in the spec file! :-) And to be clear, they don't expect users to be experts in building packages; that's why their Quick-Start Guide (https://slurm.schedmd.com/quickstart_admin.html) is as thorough as it is; it even h

Re: [slurm-users] RHEL8 support - Missing Symbols in SelectType libraries

2019-11-01 Thread Michael Jennings
On Friday, 01 November 2019, at 10:41:26 (-0700), Brian Andrus wrote: > That's pretty much how I did it too. > > But... > > When you try to run slurmd, it chokes on the missing symbols issue. I don't yet have a full RHEL8 cluster to test on, and this isn't really my area of expertise, but have

Re: [slurm-users] RHEL8 support - Missing Symbols in SelectType libraries

2019-11-01 Thread Michael Jennings
On Friday, 01 November 2019, at 11:37:37 (-0600), Michael Jennings wrote: > I build with Mezzanine, but the equivalent would roughly be this: > > rpmbuild -ts slurm-19.05.3-2.tar.bz2 > cat the_above_diff.patch | (cd ~/rpmbuild/SPECS ; patch -p0) > rpmbuild --with x11 --with

Re: [slurm-users] RHEL8 support - Missing Symbols in SelectType libraries

2019-11-01 Thread Michael Jennings
On Tuesday, 29 October 2019, at 15:11:38 (+), Christopher Benjamin Coffey wrote: > Brian, I've actually just started attempting to build slurm 19 on > centos 8 yesterday. As you say, there are packages missing now from > repos like: They're not missing; they're just harder to get at now, for

Re: [slurm-users] Execute scripts on suspend and cancel

2019-10-17 Thread Michael Jennings
On Thursday, 17 October 2019, at 16:50:29 (+), Goetz, Patrick G wrote: > Are applications even aware when they've been hit by a SIGSTP? This > idea of a license being released under these circumstances just > seems very unlikely. No, which is why SIGSTOP cannot be caught. The action is carr

Re: [slurm-users] Heterogeneous HPC

2019-09-19 Thread Michael Jennings
On Thursday, 19 September 2019, at 19:27:38 (-0400), Fulcomer, Samuel wrote: > I obviously haven't been keeping up with any security concerns over the use > of Singularity. In a 2-3 sentence nutshell, what are they? So before I do that, if you have a few minutes, I do think you'll find it worth y

Re: [slurm-users] Heterogeneous HPC

2019-09-19 Thread Michael Jennings
On Thursday, 19 September 2019, at 20:00:40 (+), Goetz, Patrick G wrote: > On 9/19/19 8:22 AM, Thomas M. Payerle wrote: > > one of our clusters > > is still running RHEL6, and while containers based on Ubuntu 16, > > Debian 8, or RHEL7 all appear to work properly, > > containers based on Ubunt

Re: [slurm-users] Heterogeneous HPC

2019-09-19 Thread Michael Jennings
On Friday, 20 September 2019, at 00:03:28 (+0430), Mahmood Naderan wrote: > For the replies. Matlab was an example. I would also like to create > to containers for OpenFoam with different versions. Then a user can > choose what he actually wants. All modern container runtimes support the OCI stan

Re: [slurm-users] Heterogeneous HPC

2019-09-19 Thread Michael Jennings
On Thursday, 19 September 2019, at 12:38:43 (+0430), Mahmood Naderan wrote: > The question is not directly related to Slurm, but is actually related to > the people in this community. > > For heterogeneous environments, where different operating systems, > application and library versions are nee

Re: [slurm-users] How can jobs request a minimum available (free) TmpFS disk space?

2019-09-10 Thread Michael Jennings
On Monday, 02 September 2019, at 20:02:57 (+0200), Ole Holm Nielsen wrote: > We have some users requesting that a certain minimum size of the > *Available* (i.e., free) TmpFS disk space should be present on nodes > before a job should be considered by the scheduler for a set of > nodes. > > I bel

Re: [slurm-users] X11 forwarding and VNC?

2019-03-25 Thread Michael Jennings
On Monday, 25 March 2019, at 12:57:46 (+), Ryan Novosielski wrote: > If the error message is accurate, the fix may be having the VNC > server not set DISPLAY equal to localhost:10.0 or similar as SSH > normally does these days, but to configure it to set DISPLAY to > fqdn:10.0. We had to do so

Re: [slurm-users] x11 forwarding not available?

2018-10-16 Thread Michael Jennings
On Tuesday, 16 October 2018, at 09:30:13 (-0400), Dave Botsch wrote: > Hrm... it looks like the default install of OHPC went with DHA keys > instead: > > .ssh]$ cat config > # Added by Warewulf 2018-10-08 > Host * >IdentityFile ~/.ssh/cluster >StrictHostKeyChecking=no > $ file cluster >

Re: [slurm-users] How do you orchestrate SLURM operations, what tools do you use?

2018-08-15 Thread Michael Jennings
On Wednesday, 15 August 2018, at 10:01:19 (-0400), Paul Edmon wrote: > On 08/14/2018 05:16 AM, Pablo Llopis wrote: > > > >Integration with a possible built-in healthcheck is also something > >to consider, as the orchestration logic would need to take care of > >disabling the healthcheck funcionali

Re: [slurm-users] How to access environment variables in submit script?

2018-05-10 Thread Michael Jennings
On Thursday, 10 May 2018, at 10:09:22 (-0400), Paul Edmon wrote: > Not that I am aware of.  Since the header isn't really part of the > script bash doesn't evaluate them as far as I know. > > On 05/10/2018 09:19 AM, Dmitri Chebotarov wrote: > > > >Is it possible to access environment variables in

Re: [slurm-users] Memory oversubscription and sheduling

2018-05-10 Thread Michael Jennings
On Thursday, 10 May 2018, at 20:02:37 (+1000), Chris Samuel wrote: > For instance there's the LBNL Node Health Check (NHC) system that plugs into > both Slurm and Torque. > > https://slurm.schedmd.com/SUG14/node_health_check.pdf > > https://github.com/mej/nhc > > At ${JOB-1} we would run our i

Re: [slurm-users] scancel a list of jobs

2018-05-08 Thread Michael Jennings
On Tuesday, 08 May 2018, at 17:00:33 (+), Chester Langin wrote: > Is there no way to scancel a list of jobs? Like from job 120 to job > 150? I see cancelling by user, by pending, and by job name. --Chet If you're using BASH, you can just do: scancel {120..150} In other POSIX-compatible s

Re: [slurm-users] What's the best way to suppress core dump files from jobs?

2018-03-21 Thread Michael Jennings
On Wednesday, 21 March 2018, at 20:14:22 (+0100), Ole Holm Nielsen wrote: > Thanks for your friendly advice! I keep forgetting about Systemd > details, and your suggestions are really detailed and useful for > others! Do you mind if I add your advice to my Slurm Wiki page? Of course not! Espec

Re: [slurm-users] What's the best way to suppress core dump files from jobs?

2018-03-21 Thread Michael Jennings
On Wednesday, 21 March 2018, at 12:08:00 (+0100), Ole Holm Nielsen wrote: > One working solution is to modify the slurmd Systemd service file > /usr/lib/systemd/system/slurmd.service to add a line: > LimitCORE=0 This is a bit off-topic, but I see this a lot, so I thought I'd provide a friendly

Re: [slurm-users] What's the best way to suppress core dump files from jobs?

2018-03-21 Thread Michael Jennings
On Wednesday, 21 March 2018, at 08:40:32 (-0600), Ryan Cox wrote: > UsePAM has to do with how jobs are launched when controlled by > Slurm.  Basically, it sends jobs launched under Slurm through the > PAM stack.  UsePAM is not required by pam_slurm_adopt because it is > *sshd* and not *slurmd or s

Re: [slurm-users] fast way for a node to determine its own state?

2018-03-21 Thread Michael Jennings
On Wednesday, 21 March 2018, at 12:05:49 (+0100), Alexis Huxley wrote: > > >Depending on the load on the scheduler, this can be slow. Is there > > >faster way? Perhaps one that doesn't involve communicating with > > >the scheduler node? Thanks! > > Thanks for the suggestion Ole, but we have somet

Re: [slurm-users] How to deal with user running stuff in frontend node?

2018-02-15 Thread Michael Jennings
On Thursday, 15 February 2018, at 16:11:29 (+0100), Manuel Rodríguez Pascual wrote: > Although this is not strictly related to Slurm, maybe you can recommend me > some actions to deal with a particular user. > > On our small cluster, currently there are no limits to run applications in > the fron

Re: [slurm-users] Remote submission hosts and security

2017-12-06 Thread Michael Jennings
On Wednesday, 06 December 2017, at 08:23:10 (-0800), Jeff White wrote: > A Web portal is exactly why I am doing this.  The remote server is a > Web server running some software that expects to pass a script to > sbatch directly.  So the SSH stuff you mention doesn't apply. I'm not sure I agree wi