[slurm-users] Nodes show by sinfo in partitions
Hello, I have a question related to the cloud feature or a feature that can solve an issue that I have with my cluster,to make it simple let say that I have a set of nodes ( let say 10 nodes ), if needed I move node/s from cluster A to cluster B and in my slurm.conf I define all the possible number of available nodes: Cluster A NodeName=clusterA-[001-010] Cluster B NodeName=clusterB-[001-010] In normal operation I have 5 nodes in 'cluster A' and 5 in 'cluster B', but in case of needs I reboot a node of 'cluster B' in 'cluster A', and the result will be 4 nodes in 'cluster B' and 6 in 'cluster A'. The "issue" is that since I specified all possible nodes in slurm.conf, when I ran sinfo what I see is: Cluster A Normal up 1-00:00:00 5 up clusterA-[01-05] Normal up 1-00:00:00 5 down* clusterA-[06-10] Cluster B Normal up 1-00:00:00 5 up clusterB-[06-10] Normal up 1-00:00:00 5 down* clusterB-[01-5] And in both slurmctld.log I have the message: error: Unable to resolve "clusterA-006": Unknown host or error: Unable to resolve "clusterB-001": Unknown host Since I have a lot of partitions and a lot of nodes, the sinfo it is much more complicated to read due to the DOWN nodes that are actually not present in the system, is there a way/feature/option that wont display in the sinfo nodes that are actually NOT present and reachable by the slurmctld due to the "error: Unable to resolve "clusterA-006": Unknown host " ? Basically I'd like to have in both slurm.conf all the possible nodes but the sinfo should shows: Cluster A Normal up 1-00:00:00 5 up clusterA-[01-05] Cluster B Normal up 1-00:00:00 5 up clusterB-[06-10] And If I move a node once the node is actually reachable: Cluster A Normal up 1-00:00:00 6 up clusterA-[01-06] Cluster B Normal up 1-00:00:00 4 up clusterB-[07-10] Thanks Fabio -- - Fabio Verzelloni - CSCS - Swiss National Supercomputing Centre via Trevano 131 - 6900 Lugano, Switzerland Tel: +41 (0)91 610 82 04
[slurm-users] RPM build error - accounting_storage_mysql.so
Hi Everyone, I'm trying to rebuild slurm rpms for 18.08.7 on RHEL7.6 with this command: rpmbuild -tb --define "_prefix /opt/slurm/18.08.07" --define "_sysconfdir /etc/slurm" --define "_slurm_sysconfdir /etc/slurm" slurm-18.08.7.tar.bz2 But no matter what, I'm keep getting this error: ... Provides: slurm-slurmd = 18.08.7-1.el7 slurm-slurmd(x86-64) = 18.08.7-1.el7 Requires(interp): /bin/sh /bin/sh /bin/sh Requires(rpmlib): rpmlib(FileDigests) <= 4.6.0-1 rpmlib(PayloadFilesHavePrefix) <= 4.0-1 rpmlib(CompressedFileNames) <= 3.0.4-1 Requires(post): /bin/sh Requires(preun): /bin/sh Requires(postun): /bin/sh Requires: libc.so.6()(64bit) libc.so.6(GLIBC_2.14)(64bit) libc.so.6(GLIBC_2.2.5)(64bit) libc.so.6(GLIBC_2.3.2)(64bit) libc.so.6(GLIBC_2.3.4)(64bit) libc.so.6(GLIBC_2.3)(64bit) libc.so.6(GLIBC_2.4)(64bit) libc.so.6(GLIBC_2.7)(64bit) libdl.so.2()(64bit) libpam_misc.so.0()(64bit) libpam_misc.so.0(LIBPAM_MISC_1.0)(64bit) libpam.so.0()(64bit) libpam.so.0(LIBPAM_1.0)(64bit) libpthread.so.0()(64bit) libpthread.so.0(GLIBC_2.2.5)(64bit) libpthread.so.0(GLIBC_2.3.2)(64bit) libslurmfull.so()(64bit) libutil.so.1()(64bit) libutil.so.1(GLIBC_2.2.5)(64bit) libz.so.1()(64bit) libz.so.1(ZLIB_1.2.0)(64bit) Processing files: slurm-slurmdbd-18.08.7-1.el7.x86_64 error: File not found: /root/rpmbuild/BUILDROOT/slurm-18.08.7-1.el7.x86_64/opt/slurm/18.08.07/lib64/slurm/accounting_storage_mysql.so RPM build errors: File not found: /root/rpmbuild/BUILDROOT/slurm-18.08.7-1.el7.x86_64/opt/slurm/18.08.07/lib64/slurm/accounting_storage_mysql.so File not found: /root/rpmbuild/BUILDROOT/slurm-18.08.7-1.el7.x86_64/opt/slurm/18.08.07/lib64/slurm/accounting_storage_mysql.so It seems that I have all dependencies in place, and on RHEL7.5 I did not get this issue, suggestions are appreciated. Moreover is my rpmbuild command ok to get also sview or do I need to pass some other option? Thanks Fabio -- - Fabio Verzelloni - CSCS - Swiss National Supercomputing Centre via Trevano 131 - 6900 Lugano, Switzerland Tel: +41 (0)91 610 82 04
[slurm-users] Job error when using --job-name=`basename $PWD`
Hi Everyone, I'm experiencing a weird issue, when submitting a job like that: - #!/bin/bash #SBATCH --job-name=`basename $PWD` #SBATCH --ntasks=2 srun -n 2 hostname - Output: srun: error: Unable to create step for job 15387: More processors requested than permitted If I submit a job like that: - #!/bin/bash #SBATCH --job-name=myjob #SBATCH --ntasks=2 srun -n 2 hostname - Output: Mynode-001 Mynode-001 If I decrease the number of task it works fine: - #!/bin/bash #SBATCH --job-name=`basename $PWD` #SBATCH --ntasks=1 srun -n 1 hostname - Output: Mynode-001 The slurm version is 18.08.8, is that a bug in slurm? Thanks Fabio -- - Fabio Verzelloni - CSCS - Swiss National Supercomputing Centre via Trevano 131 - 6900 Lugano, Switzerland Tel: +41 (0)91 610 82 04
Re: [slurm-users] Job error when using --job-name=`basename $PWD`
Dear Sven, thanks for the clarification. Fabio On 29.07.19, 11:44, "slurm-users on behalf of Sven Hansen" wrote: Hi Fabio, SLURM does not support usual shell expression evaluation or arbitrary variable substitution within #SBATCH arguments. If the option parser comes across options it does not understand, it tends to abort and assumes default settings for all mandatory unset options. This can be nasty for newcomers because it happens silently. If you need to set job arguments dynamically, altering a basic jobscript frame with a programming language of your choice is the go-to option. As so often, sed does the trick very well. Best, Sven Am 29.07.2019 um 08:22 schrieb Marcus Boden: > Hi Fabio, > > are you sure that command substition works in the #SBATCH part of the > jobscript? I don't think that slurm actally evaluates that, though I > might be wrong. > > It seems like the #SBATCH after the --job-name line are not evaluated > anymore, therefore you can't start srun with two tasks (since slurm only > allocates one). > > Best regards, > Marcus > > On 19-07-29 05:51, Verzelloni Fabio wrote: >> Hi Everyone, >> I'm experiencing a weird issue, when submitting a job like that: >> - >> #!/bin/bash >> #SBATCH --job-name=`basename $PWD` >> #SBATCH --ntasks=2 >> srun -n 2 hostname >> - >> Output: >> srun: error: Unable to create step for job 15387: More processors requested than permitted >> >> If I submit a job like that: >> - >> #!/bin/bash >> #SBATCH --job-name=myjob >> #SBATCH --ntasks=2 >> srun -n 2 hostname >> - >> Output: >> Mynode-001 >> Mynode-001 >> >> If I decrease the number of task it works fine: >> - >> #!/bin/bash >> #SBATCH --job-name=`basename $PWD` >> #SBATCH --ntasks=1 >> srun -n 1 hostname >> - >> Output: >> Mynode-001 >> >> The slurm version is 18.08.8, is that a bug in slurm? >> >> Thanks >> Fabio >> >> -- >> - Fabio Verzelloni - CSCS - Swiss National Supercomputing Centre >> via Trevano 131 - 6900 Lugano, Switzerland >> Tel: +41 (0)91 610 82 04 >> >>