[slurm-users] Re: Slurm webhooks

2025-04-23 Thread Davide DelVento via slurm-users
ers@lists.schedmd.com> wrote: > Davide DelVento via slurm-users writes: > > > I've gotten a request to have Slurm notify users for the typical email > > things (job started, completed, failed, etc) with a REST API instead of > > email. This would allow notifications

[slurm-users] Slurm webhooks

2025-04-21 Thread Davide DelVento via slurm-users
Happy Monday everybody, I've gotten a request to have Slurm notify users for the typical email things (job started, completed, failed, etc) with a REST API instead of email. This would allow notifications in MS Teams, Slack, or log stuff in some internal websites and things like that. As far as I

[slurm-users] Re: cpus and gpus partitions and how to optimize the resource usage

2025-04-04 Thread Davide DelVento via slurm-users
Ciao Massimo, How about creating another queue cpus_in_the_gpu_nodes (or something less silly) which targets the GPU nodes but does not allow the allocation of the GPUs with gres and allocates 96-8 (or whatever other number you deem appropriate) of the CPUs (and similarly with memory)? Actually it

[slurm-users] Re: cpus and gpus partitions and how to optimize the resource usage

2025-04-01 Thread Davide DelVento via slurm-users
Yes, I think so, but that should be no problem. I think that requires your Slurm was built using the --enable-multiple-slurmd configure option, so you might need to rebuild Slurm, if you didn't use that option in the first place. On Mon, Mar 31, 2025 at 7:32 AM Massimo Sgaravatto < massimo.sgarava

[slurm-users] Re: Preemption question

2025-03-30 Thread Davide DelVento via slurm-users
Hi Kamil, I don't use QoS, so I don't have a direct answer to your question, however I use preemption for a queue/partition and that is extremely easy to set up and maintain. In case you plan with QoS won't work, you can set up a preemptable queue and force this user to submit only to this queue a

[slurm-users] Re: [EXTERNAL] Re: [EXTERN] Re: Slurm 24.05 and OpenMPI

2025-03-28 Thread Davide DelVento via slurm-users
{ "emoji": "👍", "version": 1 } -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: [EXTERNAL] Re: [EXTERN] Re: Slurm 24.05 and OpenMPI

2025-03-27 Thread Davide DelVento via slurm-users
{ "emoji": "♥️", "version": 1 } -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: [EXTERN] Re: Slurm 24.05 and OpenMPI

2025-03-27 Thread Davide DelVento via slurm-users
Hi Matthias, I see. It does not freak me out. Unfortunately I have very little experience working with MPI-in-containers, so I don't know the best way to debug this. What I do know is that some ABIs in Slurm change with Slurm major versions and dependencies need to be recompiled with newer versions

[slurm-users] Re: Slurm 24.05 and OpenMPI

2025-03-26 Thread Davide DelVento via slurm-users
Hi Matthias, Let's take the simplest things out first: have you compiled OpenMPI yourself, separately on both clusters, using the specific drivers for whatever network you have on each? In my experience OpenMPI is quite finicky about working correctly, unless you do that. And when I don't, I see ex

[slurm-users] Re: slurmrestd equivalent to "srun -n 10 echo HELLO"

2025-03-24 Thread Davide DelVento via slurm-users
If you submit the command as a script, the output and the error stream end up in files, because you may logout, or have gazillion of other things or other reasons, and therefore the stream to tty/console does not make sense anymore On Mon, Mar 24, 2025 at 8:29 AM Dan Healy via slurm-users < slurm-

[slurm-users] Re: SLURM_JOB_ACCOUNT var missing in prolog

2025-03-13 Thread Davide DelVento via slurm-users
I am not sure about that one variable, however I gave up on using environmental variables in the prolog for the reasons described in an earlier thread at the following link https://groups.google.com/g/slurm-users/c/R9adbpdZ22E/m/cZAkDIS5AAAJ On Wed, Mar 12, 2025 at 3:36 AM Jonás Arce via slurm-us

[slurm-users] Re: Limit CPUs per job (but not per user, partition or node)

2025-02-26 Thread Davide DelVento via slurm-users
Hi Herbert, I believe the limit is per node (not per partition) whereas you want it per job. In other words, your users will be able to run jobs on other nodes. There is no MaxCPUsPerJob option in the partition definition, but I believe you can make that restriction in other ways (at worst with a

[slurm-users] Re: Create filenames based on slurm hosts

2025-02-14 Thread Davide DelVento via slurm-users
Actually I hit sent too quickly, what I meant (assuming bash) is for a in $(scontrol show hostname whatever_list); do touch $a; done with the same whatever_list being $SLURM_JOB_NODELIST On Fri, Feb 14, 2025 at 1:18 PM Davide DelVento wrote: > Not sure I completely understand what you need, bu

[slurm-users] Re: Create filenames based on slurm hosts

2025-02-14 Thread Davide DelVento via slurm-users
Not sure I completely understand what you need, but if I do... How about touch whatever_prefix_$(scontrol show hostname whatever_list) where whatever_list could be your $SLURM_JOB_NODELIST ? On Fri, Feb 14, 2025 at 9:42 AM John Hearns via slurm-users < slurm-users@lists.schedmd.com> wrote: > I

[slurm-users] Re: Unexpected node got allocation

2025-01-09 Thread Davide DelVento via slurm-users
I believe in absence of other reasons, slurm assigns nodes to jobs in the order they are listed in the partition definitions of slurm.conf -- perhaps for whatever reason the node 41 appears first there, rather than 01? On Thu, Jan 9, 2025 at 7:24 AM Dan Healy via slurm-users < slurm-users@lists.sc

[slurm-users] Re: formatting node names

2025-01-07 Thread Davide DelVento via slurm-users
Wonderful. Thanks Ole for the reminder! I had bookmarked your wiki (of course!) but forgot to check it out in this case. I'll add a more prominent reminder to self in my notes to always check it! Happy new year everybody once again On Tue, Jan 7, 2025 at 1:58 AM Ole Holm Nielsen via slurm-users <

[slurm-users] Re: formatting node names

2025-01-06 Thread Davide DelVento via slurm-users
Found it, I should have asked to my puppet as it's mandatory in some places :-D It is simply scontrol show hostname gpu[01-02],node[03-04,12-22,27-32,36] Sorry for the noise On Mon, Jan 6, 2025 at 12:55 PM Davide DelVento wrote: > Hi all, > I remember seeing on this list a slurm command to cha

[slurm-users] formatting node names

2025-01-06 Thread Davide DelVento via slurm-users
Hi all, I remember seeing on this list a slurm command to change a slurm-friendly list such as gpu[01-02],node[03-04,12-22,27-32,36] into a bash friendly list such as gpu01 gpu02 node03 node04 node12 etc I made a note about it but I can't find my note anymore, nor the relevant message. Can some

[slurm-users] Re: Job not starting

2024-12-10 Thread Davide DelVento via slurm-users
Good sleuthing. It would be nice if Slurm would say something like Reason=Priority_Lower_Than_Job_ so people will immediately find the culprit in such situations. Has anybody with a SchedMD subscription ever asked something like that, or is there some reasons for which it'd be impossible (or t

[slurm-users] Re: error and output files

2024-12-09 Thread Davide DelVento via slurm-users
Mmmm, from https://slurm.schedmd.com/sbatch.html > By default both standard output and standard error are directed to a file of the name "slurm-%j.out", where the "%j" is replaced with the job allocation number. Perhaps at your site there's a configuration which uses separate error files? See the

[slurm-users] Re: Job not starting

2024-12-06 Thread Davide DelVento via slurm-users
Ciao Diego, I find it extremely hard to understand situations like this. I wish Slurm were more clear on how it reported what it is doing, but I digress... I suspect that there are other job(s) which have higher priority than this one which are supposed to run on that node but cannot start because

[slurm-users] Re: Change primary alloc node

2024-10-31 Thread Davide DelVento via slurm-users
Another possible use case of this is a regular MPI job where the first/controller task often uses more memory than the workers and may need to be scheduled on a higher memory node than them. I think I saw this happening in the past, but I'm not 100% sure it was in Slurm or some other scheduling sys

[slurm-users] Re: Job pre / post submit scripts

2024-10-28 Thread Davide DelVento via slurm-users
Not sure if I understand your use case, but if I do I am not sure if Slurm provides that functionality. If it doesn't (and if my understanding is correct), you can still achieve your goal by: 1) removing sbatch and salloc from user's path 2) writing your own custom scripts named sbatch (and hard/s

[slurm-users] Re: errors compiling Slurm 18 on RHEL 9: [Makefile:577: scancel] Error 1 & It's not recommended to have unversioned Obsoletes

2024-09-27 Thread Davide DelVento via slurm-users
Slurm 18? Isn't that a bit outdated? On Fri, Sep 27, 2024 at 9:41 AM Robert Kudyba via slurm-users < slurm-users@lists.schedmd.com> wrote: > We're in the process of upgrading but first we're moving to RHEL 9. My > attempt to compile using rpmbuild -v -ta --define "_lto_cflags %{nil}" > slurm-18.

[slurm-users] Re: Print Slurm Stats on Login

2024-08-28 Thread Davide DelVento via slurm-users
Thanks everybody once again and especially Paul: your job_summary script was exactly what I needed, served on a golden plate. I just had to modify/customize the date range and change the following line (I can make a PR if you want, but it's such a small change that it'd take more time to deal with

[slurm-users] Re: Spread a multistep job across clusters

2024-08-26 Thread Davide DelVento via slurm-users
Ciao Fabio, That for sure is syntactically incorrect, because the way sbatch parsing works: as soon as it finds a non-empy non-comment line (your first srun) it will stop parsing for #SBATCH directives. So assuming this is a single file as it looks from the formatting, the second hetjob and the cl

[slurm-users] Re: Slurmdbd purge and reported downtime

2024-08-23 Thread Davide DelVento via slurm-users
owing that the problem won't happen again in the future. Thanks and have a great weekend On Fri, Aug 23, 2024 at 8:00 AM Ole Holm Nielsen via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hi Davide, > > On 8/22/24 21:30, Davide DelVento via slurm-users wrote: > >

[slurm-users] Slurmdbd purge and reported downtime

2024-08-22 Thread Davide DelVento via slurm-users
I am confused by the reported amount of Down and PLND Down by sreport. According to it, our cluster would have had a significant amount of downtime, which I know didn't happen (or, according to the documentation "time that slurmctld was not responding", see https://slurm.schedmd.com/sreport.html)

[slurm-users] Re: Print Slurm Stats on Login

2024-08-21 Thread Davide DelVento via slurm-users
Hi Ole, On Wed, Aug 21, 2024 at 1:06 PM Ole Holm Nielsen via slurm-users < slurm-users@lists.schedmd.com> wrote: > The slurmacct script can actually break down statistics by partition, > which I guess is what you're asking for? The usage of the command is: > Yes, this is almost what I was askin

[slurm-users] Re: Print Slurm Stats on Login

2024-08-21 Thread Davide DelVento via slurm-users
; > inside jobs to emulate a login session, causing a heavy load on your > servers. > > /Ole > > On 8/21/24 01:13, Davide DelVento via slurm-users wrote: > > Thanks Kevin and Simon, > > > > The full thing that you do is indeed overkill, however I was able to > l

[slurm-users] Re: Print Slurm Stats on Login

2024-08-20 Thread Davide DelVento via slurm-users
Thanks Kevin and Simon, The full thing that you do is indeed overkill, however I was able to learn how to collect/parse some of the information I need. What I am still unable to get is: - utilization by queue (or list of node names), to track actual use of expensive resources such as GPUs, high

[slurm-users] Re: Unable to run sequential jobs simultaneously on the same node

2024-08-19 Thread Davide DelVento via slurm-users
Since each instance of the program is independent and you are using one core for each, it'd be better to leave slurm deal with that and schedule them concurrently as it sees fit. Maybe you simply need to add some directive to allow shared jobs on the same node. Alternatively (if at your site jobs m

[slurm-users] Re: Print Slurm Stats on Login

2024-08-14 Thread Davide DelVento via slurm-users
g text output of squeue command) > > cheers > > josef > > -- > *From:* Davide DelVento via slurm-users > *Sent:* Wednesday, 14 August 2024 01:52 > *To:* Paul Edmon > *Cc:* Reid, Andrew C.E. (Fed) ; Jeffrey T Frey < > f...@udel.edu>; slurm-users@lists.schedm

[slurm-users] Re: Print Slurm Stats on Login

2024-08-13 Thread Davide DelVento via slurm-users
I too would be interested in some lightweight scripts. XDMOD in my experience has been very intense in workload to install, maintain and learn. It's great if one needs that level of interactivity, granularity and detail, but for some "quick and dirty" summary in a small dept it's not only overkill,

[slurm-users] Re: Seeking Commercial SLURM Subscription Provider

2024-08-13 Thread Davide DelVento via slurm-users
How about SchedMD itself? They are the ones doing most (if not all) of the development, and they are great. In my experience, the best options are either SchedMD or the vendor of your hardware. On Mon, Aug 12, 2024 at 11:17 PM John Joseph via slurm-users < slurm-users@lists.schedmd.com> wrote: >

[slurm-users] Re: With slurm, how to allocate a whole node for a single multi-threaded process?

2024-08-02 Thread Davide DelVento via slurm-users
I am pretty sure with vanilla slurm is impossible. What it might be possible (maybe) is submitting 5 core jobs and using some pre-post scripts which immediately before the job start change the requested number of cores to "however are currently available on the node where it is scheduled to run".

[slurm-users] Re: With slurm, how to allocate a whole node for a single multi-threaded process?

2024-08-01 Thread Davide DelVento via slurm-users
In part, it depends on how it's been configured, but have you tried --exclusive? On Thu, Aug 1, 2024 at 7:39 AM Henrique Almeida via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hello, everyone, with slurm, how to allocate a whole node for a > single multi-threaded process? > > > https:

[slurm-users] Re: Can SLURM queue different jobs to start concurrently?

2024-07-08 Thread Davide DelVento via slurm-users
I think the best way to do it would be to schedule the 10 things to be a single slurm job and then use some of the various MPMD ways (the nitty gritty details depend if each executable is serial, OpenMP, MPI or hybrid). On Mon, Jul 8, 2024 at 2:20 PM Dan Healy via slurm-users < slurm-users@lists.s

[slurm-users] Re: Best practice for jobs resuming from suspended state

2024-05-16 Thread Davide DelVento via slurm-users
I don't really have an answer for you, just responding to make your message pop out in the "flood" of other topics we've got since you posted. On our cluster we configure cancelling our jobs because it makes more sense for our situation, so I have no experience with that resume from being suspende

[slurm-users] Re: memory high water mark reporting

2024-05-16 Thread Davide DelVento via slurm-users
Not exactly the answer to your question (which I don't know) but if you can get to prefix whatever is executed with this https://github.com/NCAR/peak_memusage (which also uses getrusage) or a variant you will be able to do that. On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users < slurm-us

[slurm-users] Re: Partition Preemption Configuration Question

2024-05-08 Thread Davide DelVento via slurm-users
{ "emoji": "👍", "version": 1 } -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: StateSaveLocation and Slurm HA

2024-05-07 Thread Davide DelVento via slurm-users
Are you seeking something simple rather than sophisticated? If so, you can use the controller local disk for StateSaveLocation and place a cron job (on the same node or somewhere else) to take that data out via e.g. rsync and put it where you need it (NFS?) for the backup control node to use if/whe

[slurm-users] Re: Partition Preemption Configuration Question

2024-05-02 Thread Davide DelVento via slurm-users
Hi Jason, I wanted exactly the same and was confused exactly like you. For a while it did not work, regardless of what I tried, but eventually (with some help) I figured it out. What I set up and it is working fine is this globally PreemptType = preempt/partition_prio PreemptMode=REQUEUE and th

[slurm-users] Re: Recover Batch Script Error

2024-02-16 Thread Davide DelVento via slurm-users
Yes, that is what we are also doing and it works well. Note that requesting a batch script for another user, one sees nothing (rather than an error message saying that one does not have permissions) On Fri, Feb 16, 2024 at 12:48 PM Paul Edmon via slurm-users < slurm-users@lists.schedmd.com> wrote:

[slurm-users] Re: Need help managing licence

2024-02-16 Thread Davide DelVento via slurm-users
The simple answer is to just add a line such as Licenses=whatever:20 and then request your users to use the -L option as described at https://slurm.schedmd.com/licenses.html This works very well, however it does not do enforcement like Slurm does with other resources. You will find posts in this

[slurm-users] Re: Compilation question

2024-02-09 Thread Davide DelVento via slurm-users
Hi Sylvain, For the series better late than never, is this still a problem? If so, is this a new install or an update? Whan environment/compiler are you using? The error undefined reference to `__nv_init_env' seems to indicate that you are doing something cuda-related which I think you should not

[slurm-users] Re: Memory used per node

2024-02-09 Thread Davide DelVento via slurm-users
If you would like the high watermark memory utilization after the job completes, https://github.com/NCAR/peak_memusage is a great tool. Of course it has the limitation that you need to know that you want that information *before* starting the job, which might or might not a problem for your use cas