from:"Diego Zuccato"

[slurm-users] Re: Plese help [CPUs=24 Boards=1 SocketsPerBoard=1 CoresPerSocket=16 ThreadsPerCore=1]

2025-02-16 Thread Diego Zuccato via slurm-users

nstances) L1i: 768 KiB (16 instances) L2: 14 MiB (10 instances) L3: 30 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-23 -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma

[slurm-users] Re: Job not starting

2024-12-10 Thread Diego Zuccato via slurm-users

=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:gpu03 Dependency=(null) Paint me surprised... Diego Il 07/12/2024 10:03, Diego Zuccato via slurm-users ha scritto: Ciao Davide. Il 06/12/2024 16:42, Davide DelVento ha scritto: I find it extremely hard to understand situations like this. I wish

[slurm-users] Re: Job not starting

2024-12-07 Thread Diego Zuccato via slurm-users

#x27;long' (10, IIRC). Diego On Fri, Dec 6, 2024 at 7:36 AM Diego Zuccato via slurm-users us...@lists.schedmd.com <mailto:slurm-users@lists.schedmd.com>> wrote: Hello all. An user reported that a job wasn't starting, so I tried to replicate the request and

[slurm-users] Job not starting

2024-12-06 Thread Diego Zuccato via slurm-users

llocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s -8<-- So the node is free, the partition does not impose extra limits (used only for accounting factors) but the job does not start. Any hints? Tks -- Diego Zuccato DIFA - Dip.

[slurm-users] Re: Suspending jobs and resuming

2024-11-21 Thread Diego Zuccato via slurm-users

e their jobs last longer than the wall time limit by suspending and resuming a job? Best, *Fritz Ratnasamy* Data Scientist Information Technology -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 4

[slurm-users] Re: Does Slurm support DSP

2024-11-20 Thread Diego Zuccato via slurm-users

est it). Diego Il 20/11/2024 12:37, Ole Holm Nielsen via slurm-users ha scritto: On 20-11-2024 08:28, shaobo liu wrote: DSP(Digital Signal Processing) is a type of hardware accelerator. Which Linux operating system is your DSP running? Is the DSP device hosted in a normal Linux server? /

[slurm-users] Re: Change primary alloc node

2024-10-31 Thread Diego Zuccato via slurm-users

l to slurm-users-le...@lists.schedmd.com <mailto:slurm-users-le...@lists.schedmd.com> -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786 -- slurm-us

[slurm-users] Re: Nodes TRES double what is requested

2024-07-10 Thread Diego Zuccato via slurm-users

264 idle CPUs. Not sure if its a known bug, or an issue with our config? I have tried various things, like setting the sockets/boards in slurm.conf. Thanks Jack -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Bert

[slurm-users] Re: Executing srun -n X where X is greater than total CPU in entire cluster

2024-05-30 Thread Diego Zuccato via slurm-users

urmctldLogFile: "/var/log/slurm/slurmctld.log" SlurmdLogFile: "/var/log/slurm/slurmd.log" SlurmdSpoolDir: "/var/spool/slurm/d" SlurmUser: "{{ slurm_user.name <http://slurm_user.name> }}" SrunPortRange: "60000-61000" StateS

[slurm-users] Re: [EXTERN] Re: scheduling according time requirements

2024-04-30 Thread Diego Zuccato via slurm-users

eed to specify the partition at all. Any thoughts? Dietmar -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786 -- slurm-users mailing list -- slurm-users@lists.

[slurm-users] Re: Lua script

2024-03-06 Thread Diego Zuccato via slurm-users

Il 06/03/2024 13:49, Gestió Servidors via slurm-users ha scritto: And how can I reject the job inside the lua script? Just use return slurm.FAILURE and job will be refused. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna

[slurm-users] Re: Lua script

2024-03-06 Thread Diego Zuccato via slurm-users

end end end return slurm.SUCCESS end However, if I submit a job with TimeLimit of 5 hours, lua script doesn’t modify submit and job remains “pending”… What am I doing wrong? Thanks. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia S

Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

2024-01-23 Thread Diego Zuccato

I guess NVIDIA had something in mind when they developed MPS, so I guess our pattern may not be typical (or at least not universal), and in that case the MPS plugin may well be what you need. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Univ

Re: [slurm-users] Database cluster

2024-01-23 Thread Diego Zuccato

SQL replication, then manually switch to slurmdbd to a replication slave if the master goes down? Do you do something else? Thanks. Daniel -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bolo

Re: [slurm-users] Unable to submit job (ReqNodeNotAvail, UnavailableNodes)

2023-11-07 Thread Diego Zuccato

and sets itself to drained. Another possibility is that slurmctld detects a mismatch between the node and its config: in this case you'll find the reason in slurmctld.log . -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le

Re: [slurm-users] Unable to submit job (ReqNodeNotAvail, UnavailableNodes)

2023-11-07 Thread Diego Zuccato

all_nodes* drained 32 2:8:2 6 0 1 (null) batch job complete f You have to RESUME the node so it starts accepting jobs. scontrol update nodename=compute-0 state=resume -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di

Re: [slurm-users] Fairshare: Penalising unused memory rather than used memory?

2023-10-11 Thread Diego Zuccato

hus penalise everyone who requests large amounts of memory, whether it is needed or not. Therefore I would be interested in knowing whether one can take into account the *requested but unused memory* when calculating usage. Is this possible? Cheers, Loris -- Dr. Loris Bennett (Herr/Mr) ZED

Re: [slurm-users] job not running if partition MaxCPUsPerNode < actual max

2023-10-03 Thread Diego Zuccato

and a job array range. I tried to add "-v" to the sbatch to see if that gives more useful info, but I couldn't get any more insight. Does anyone have any idea why it's rejecting my job? thanks, Noam -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatic

[slurm-users] Mismatch between scontrol and sacctmgr ?

2023-09-22 Thread Diego Zuccato

able.id_resv does have 15 different values (including 0). -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread Diego Zuccato

at the slurm.conf description is misleading. Noam -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread Diego Zuccato

e to write that snippet in job_submit.lua ... Would you expect that to prevent the job from ever running on any partition?Currently (and, I think, wrongly) that's exactly what happens. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università

Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread Diego Zuccato

y the correct account. On Thu, Sep 21, 2023 at 3:11 AM Diego Zuccato <mailto:diego.zucc...@unibo.it>> wrote: Hello all. We have one partition (b4) that's reserved for an account while the others are "free for all". The problem is that sbatch --partitio

[slurm-users] Weirdness with partitions

2023-09-21 Thread Diego Zuccato

d having to replicate scheduler logic in job_submit.lua... :) -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

[slurm-users] "Node count specification invalid" when specifying multiple partitions???

2023-08-08 Thread Diego Zuccato

contain enough nodes to satisfy the request. That seems to also apply to all_partitions jobsubmitplugin, making it nearly useless. We're using Slurm 22.05.6 . On 20.11.4 it worked as expected (excluding partitions that couldn't satisfy the request). Any hint? TIA -- Diego Zuccat

[slurm-users] Reservations deleted when group is empty?

2023-08-03 Thread Diego Zuccato

not be a great problem if the reservation remained... A reservation should only get deleted when expired, IMO (but I can understand that there are cases where the current behaviour is desired). -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università

Re: [slurm-users] Dynamic Node Shrinking/Expanding for Running Jobs in Slurm

2023-06-28 Thread Diego Zuccato

ght on this topic. Your expertise and assistance would greatly help me in successfully completing my project. Thank you in advance for your time and support. Best regards, Maysam Johannes Gutenberg University of Mainz -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici

Re: [slurm-users] Reservations and groups

2023-05-03 Thread Diego Zuccato

Ok, PEBKAC :) When creating the reservation, I set account=root . Just adding "account=" to the update fixed both errors. Sorry for the noise. Diego Il 04/05/2023 07:51, Diego Zuccato ha scritto: Hello all. I'm trying to define a reservation that only allows users in a

[slurm-users] Reservations and groups

2023-05-03 Thread Diego Zuccato

p id [root@slurmctl ~]# getent group res-TEST res-TEST:*:1180406822:testuser The group comes from AD via sssd. What am I missing? TIA -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel

Re: [slurm-users] Multiple default partitions

2023-04-17 Thread Diego Zuccato

e two default partitions? In the best case in a way that slurm schedules to partition1 on default and only to partition2 when partition1 can't handle the job right now. Best regards, Xaver Stiensmeier -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater

Re: [slurm-users] GPUs not available after making use of all threads?

2023-02-14 Thread Diego Zuccato

why some of us do so many experimental runs of jobs and gather timings. We have yet to see a 100% efficient process, but folks are improving things all the time. Brian Andrus On 2/13/2023 9:56 PM, Diego Zuccato wrote: I think that's incorrect: > The concept of hyper-threading is not d

Re: [slurm-users] GPUs not available after making use of all threads?

2023-02-13 Thread Diego Zuccato

,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126) ||Has anyone faced this or a similar issue and can give me some directions? Best wishes Sebastian || -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] I just had a "conversation" with ChatGPT about working DMTCP, OpenMPI and SLURM. Here are the results

2023-02-12 Thread Diego Zuccato

o...@gmail.com> Webpage: http://www.ph.utexas.edu/~daneel/ <http://www.ph.utexas.edu/~daneel/> -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] [External] Hibernating a whole cluster

2023-02-07 Thread Diego Zuccato

That's probably not optimal, but could work. I'd go with brutal preemption: swapping 90+G can be quite time-consuming. Diego Il 07/02/2023 14:18, Analabha Roy ha scritto: On Tue, 7 Feb 2023, 18:12 Diego Zuccato, <mailto:diego.zucc...@unibo.it>> wrote: RAM used by

Re: [slurm-users] [External] Hibernating a whole cluster

2023-02-07 Thread Diego Zuccato

ics <http://www.buruniv.ac.in/academics/department/physics> The University of Burdwan <http://www.buruniv.ac.in/> Golapbag Campus, Barddhaman 713104 West Bengal, India Emails: dan...@utexas.edu <mailto:dan...@utexas.edu>, a...@phys.buruniv.ac.in <mailto:a...@phys.buruniv.ac.

Re: [slurm-users] How to set default partition in slurm configuration

2023-01-25 Thread Diego Zuccato

nd reference to the "default partition" in `JobSubmitPlugins` and this might be the solution. However, I think this is something so basic that it probably shouldn't need a plugin so I am unsure. Can anyone point me towards how setting the default partition is done? Best regards,

Re: [slurm-users] GPU jobs not allocated correctly when requesting more than 1 CPU

2022-10-26 Thread Diego Zuccato

Il 21/10/2022 19:14, Rohith Mohan ha scritto: IIUC this could be the source of your problem: SelectTypeParameters=CR_CPU_Memory Maybe try CR_Core_memory . CR_CPU* does not have notion of sockets/cores/threads. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma

Re: [slurm-users] Ideal NFS exported StateSaveLocation size.

2022-10-24 Thread Diego Zuccato

d between controllers, right? Possibly use NVME-backed (or even better NVDIMM-backed) NFS share. Or replica-3 Gluster volume with NVDIMMs for the bricks, for the paranoid :) Diego -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Ber

Re: [slurm-users] Persistent Interactive Jobs

2022-06-10 Thread Diego Zuccato

too. Regards, -- Willy Markuske HPC Systems Engineer Research Data Services P: (619) 519-4435 -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] temporary SLURM directories

2022-05-26 Thread Diego Zuccato

Il 26/05/2022 11:48, Diego Zuccato ha scritto: Still can't export TMPDIR=... from TaskProlog script. Surely missing something important. Maybe TaskProlog is called as a subshell? In that case it can't alter caller's env... But IIUC someone made it work, and that confuses

Re: [slurm-users] temporary SLURM directories

2022-05-26 Thread Diego Zuccato

ace used by the job).Still can't export TMPDIR=... from TaskProlog script. Surely missing something important. Maybe TaskProlog is called as a subshell? In that case it can't alter caller's env... But IIUC someone made it work, and that confuses me... -- Diego Zuccato DIFA - Dip.

Re: [slurm-users] temporary SLURM directories

2022-05-23 Thread Diego Zuccato

ry when the calculation is done, but I'm sure there must be a better way to do this. Thanks in advance for the help. best regards, Alain -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

[slurm-users] MPICH

2022-04-27 Thread Diego Zuccato

}/usr/mpich-4.0.2 gives an executable that only uses 1 CPU even if sbatch requested 52. :( Any hint appreciated. Tks. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20

Re: [slurm-users] issue with --cpus-per-task=1

2022-03-10 Thread Diego Zuccato

roduced (on newer versions)? Can this somehow be avoided by setting a default number of tasks or some other (partition) parameter? Sorry for asking but I couldn't find anything in the documentation. Let me know if you need more information. Best Regards, Benjamin -- Diego Zuccato

Re: [slurm-users] slurmctld/slurmdbd filesystem/usermap requirements

2022-02-10 Thread Diego Zuccato

eed to see the users' home dirs and/or job script dirs. == Paul Brunk, system administrator Georgia Advanced Resource Computing Center Enterprise IT Svcs, the University of Georgia On 2/10/22, 6:26 AM, "slurm-users" wrote: [EXTERNAL SENDER - PROCEED CAUTIOUSLY] On Thu, 2022-02-

[slurm-users] slurmctld/slurmdbd filesystem/usermap requirements

2022-02-10 Thread Diego Zuccato

slurmctld need read access to /home/userA/myjob.sh or does it receive the job script as a "blob" or as a path? Does it even need to know userA's GID or will it simply use 'userA' to lookup associations in dbd? Tks. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servi

Re: [slurm-users] memory per node default

2022-01-21 Thread Diego Zuccato

MBs less than older one (been there, done that... :( ). -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] Updated "pestat" tool for printing Slurm nodes status including GRES/GPU

2021-12-17 Thread Diego Zuccato

Tks. Will be useful soon :) Are there other monitoring plugin you'd suggest? Il 17/12/2021 11:15, Loris Bennett ha scritto: Hi Diego, Diego Zuccato writes: Hi Loris. Il 14/12/2021 14:16, Loris Bennett ha scritto: spectrum, today, via our Zabbix monitoring, I spotted some jobs wi

Re: [slurm-users] Updated "pestat" tool for printing Slurm nodes status including GRES/GPU

2021-12-17 Thread Diego Zuccato

Hi Loris. Il 14/12/2021 14:16, Loris Bennett ha scritto: spectrum, today, via our Zabbix monitoring, I spotted some jobs with an unusually high GPU-efficiencies which turned out to be doing cryptomining :-/ What are you using to collect data for Zabbix? -- Diego Zuccato DIFA - Dip. di Fisica

Re: [slurm-users] Prevent users from updating their jobs

2021-12-17 Thread Diego Zuccato

imulator ~]$ sacct -j 791 -o "jobid,nodelist,user" JobID NodeList User --- - 791 smp-1 user01 -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] Requirement of one GPU job should run in GPU nodes in a cluster

2021-12-17 Thread Diego Zuccato

knowingly, not by accident), I'm afraid... Best, Steffen -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] nvml autodetect is ignoring gpus

2021-11-30 Thread Diego Zuccato

only impact autodetection (so it "just" requires manual config) or GPU jobs won't be able to start at all? -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] enable_configless, srun and DNS vs. hosts file

2021-11-14 Thread Diego Zuccato

l get the error you saw. Restarting slurmd on the submit node fixes it. This is the documented behavior (adding nodes needs slurmd restarted everywhere). Could this be what you're seeing (as opposed to /etc/hosts vs DNS)? -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi In

Re: [slurm-users] How to get an estimate of job completion for planned maintenance?

2021-11-08 Thread Diego Zuccato

ith a shorter wallclock time could be backfilled till the reservation/maintenance starts. You can put the reservation anytime in the system but at least or before " minus ", e.g. scontrol create reservation= starttime= duration= user=root flags=maint nodes=ALL Hope, that helps a

Re: [slurm-users] Wrong hwloc detected?

2021-11-08 Thread Diego Zuccato

ages are listed here: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#install-prerequisites /Ole On 05-11-2021 15:38, Diego Zuccato wrote: They aren't using modules so it must be something system-wide :( But not all jobs are impacted. And it seems it's a bit random (doesn't

Re: [slurm-users] Wrong hwloc detected?

2021-11-05 Thread Diego Zuccato

They aren't using modules so it must be something system-wide :( But not all jobs are impacted. And it seems it's a bit random (doesn't happen always). I'm out of ideas, currently :( Il 05/11/2021 13:10, Ole Holm Nielsen ha scritto: On 11/5/21 12:47, Diego Zuccato wr

[slurm-users] Wrong hwloc detected?

2021-11-05 Thread Diego Zuccato

askAffinity=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes MemorySwappiness=0 MaxSwapPercent=0 AllowedSwapSpace=0 Any ideas? Tks. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy t

Re: [slurm-users] slurm.conf syntax checker?

2021-10-18 Thread Diego Zuccato

tax errors and the most common errors is already a big help, expecially for noobs :) [OK]: All nodeweights are correct. What do you mean with this? How can weights be "incorrect"? If someone is interested ...Surely I am :) -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Serv

Re: [slurm-users] "Low RealMem" after upgrade

2021-10-05 Thread Diego Zuccato

grading-slurm Yup. That's why I upgraded the whole cluster at once. Tks for the help. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] "Low RealMem" after upgrade

2021-10-04 Thread Diego Zuccato

d node state specified"). SLURM 20.11.4. Tks. Diego Il 01/10/2021 21:32, Paul Brunk ha scritto: Hi: If you mean "why are the nodes still Drained, now that I fixed the slurm.conf and restarted (never mind whether the RealMem parameter is correct)?", try 'scontrol update nodena

[slurm-users] "Low RealMem" after upgrade

2021-10-01 Thread Diego Zuccato

lt;-- I also tried lowering RealMemory setting to 6, in case MemSpecLimit interfered, but the result remains the same. Any ideas? TIA! -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bolo

Re: [slurm-users] restarting slurmctld restarts jobs???

2021-09-20 Thread Diego Zuccato

Il 20/09/2021 13:49, Diego Zuccato ha scritto: Tks. Checked it: it's on the home filesystem, NFS-shared between the nodes. Well, actually a bit more involved than that: JobCompLoc points to /var/spool/jobscompleted.txt but /var/spool/slurm is actually a symlink to /home/conf/slurm_spoo

Re: [slurm-users] restarting slurmctld restarts jobs???

2021-09-20 Thread Diego Zuccato

y. The explanation at below is taken from slurm web site: "The backup controller recovers state information from the StateSaveLocation directory, which must be readable and writable from both the primary and backup controllers." Regards; Ahmet M. 20.09.2021 12:08 tarihinde Diego Zuccato

[slurm-users] restarting slurmctld restarts jobs???

2021-09-20 Thread Diego Zuccato

ue. I'm currently in the process of adding some nodes, but I already did it other times w/ no issues (actually the second slurmctld node have been installed to catch the race of a job terminating while the main slurmctld was shut down). Anything I should double-check? Tks. -- Diego Zucca

Re: [slurm-users] FreeMem is not equal to (RealMem - AllocMem)

2021-09-14 Thread Diego Zuccato

right now): RealMemory=257433 AllocMem=0 FreeMem=159610 That's probably due to buffer/caches remaining allocated between jobs. They're handled by the SO and should be automatically freed when a program needs mem. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Inform

Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-06 Thread Diego Zuccato

IIRC we increased SlurmdTimeout to 7200 . Il 06/08/2021 13:33, Adrian Sevcenco ha scritto: On 8/6/21 1:56 PM, Diego Zuccato wrote: We had a similar problem some time ago (slow creation of big core files) and solved it by increasing the Slurm timeouts oh, i see.. well, in principle i should not

Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-06 Thread Diego Zuccato

21 12:46, Adrian Sevcenco ha scritto: On 8/6/21 1:27 PM, Diego Zuccato wrote: Hi. Hi! Might it be due to a timeout (maybe the killed job is creating a core file, or caused heavy swap usage)? i will have to search for culprit .. the problem is why would the node be put in drain for the reas

Re: [slurm-users] Compact scheduling strategy for small GPU jobs

2021-08-06 Thread Diego Zuccato

hen when I submit one job with 8 gpus, it will pending because of gpu fragments: nodes A has 2 idle gpus, node b 6 idle gpus Thanks in advance! -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bolog

Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-06 Thread Diego Zuccato

task fails and how can i disable it? (i use cgroups) moreover, how can the killing of task fails? (this is on slurm 19.05) Thank you! Adrian -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna

Re: [slurm-users] 4 sockets but "

2021-07-23 Thread Diego Zuccato

and bash? GNU Awk 4.2.1, GNU bash, versione 5.0.3(1)-release . The job timings are printed by pestat if you use the -S, -E and -T options. See help info by "pestat -h". I'll have another look on monday for further testing (I started quite early this morning :) ). Tks a lot for n

Re: [slurm-users] 4 sockets but "

2021-07-23 Thread Diego Zuccato

ot;when will my job start". But pestat and slurmtop are different tools for different uses, no need to duplicate all functionality. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] slumctld don't start at boot

2021-07-23 Thread Diego Zuccato

g the slurmdbd/mariadb support is all right with no problems, but slurmctld still does not start on boot. Also in the log reported blade01 is the hostname of one of the nodes. You should probably fix /usr/lib/systemd/system/slurmdbd.service as well. /Ole -- Diego Zuccato DIFA - Dip. di Fisica e

Re: [slurm-users] slumctld don't start at boot

2021-07-23 Thread Diego Zuccato

he sender and delete it permanently from your computer system. Fai crescere i nostri giovani ricercatori dona il 5 per mille alla Sapienza *codice fiscale 80209930587* -- Diego Zuccato DIFA - Dip. di Fisica e Astron

Re: [slurm-users] 4 sockets but "

2021-07-23 Thread Diego Zuccato

e. I don't quite see how one could integrate pestat itself directly into Zabbix, as it is more geared to producing a report, but maybe Ole has ideas :-) How to use the collected data is one of the big open problems in IT :) -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Inform

Re: [slurm-users] 4 sockets but "

2021-07-22 Thread Diego Zuccato

m not really OK with it yet (for example I still can't understand how I can exclude some metrics from a host that got 'em added by a template... When I'll have enough time I'll find a way :) ). Maybe pestat can be added to the Zabbix metrics... -- Diego Zuccato DIFA - Dip. di

Re: [slurm-users] 4 sockets but "

2021-07-21 Thread Diego Zuccato

=192, restarted slurmctld and it keeps seeing all CPUs... What should I think? But another problem surfaces: slurmtop seems not to handle so many CPUs gracefully and throws a lot of errors, but that should be something manageable... Tks for the help. BYtE, Diego Il 21/07/2021 11:01, Diego Zuc

Re: [slurm-users] 4 sockets but "

2021-07-21 Thread Diego Zuccato

Uff... A bit mangled... Correcting and resending. Il 21/07/2021 08:18, Diego Zuccato ha scritto: Il 20/07/2021 18:02, mercan ha scritto: Hi Ahmet. Did you check slurmctld log for a complain about the host line. if the slumctld can not recognize a parameter, may be it give up processing whole

Re: [slurm-users] 4 sockets but "

2021-07-20 Thread Diego Zuccato

] _build_node_list: No nodes satisfy JobId=33808 requirements in partition b4 (str957 is the second frontend/login node that I've had to take offline for an unrelated problem). -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna

Re: [slurm-users] 4 sockets but "

2021-07-20 Thread Diego Zuccato

ems related to a regression in later versions... Maybe delete Boards=1 SocketsPerBoard=4 and try Sockets=4 in stead? Already tried. Actually, it's been the first try. The pam_slurm_adopt is very useful :-) IIUC only if you allow users to connect to the worker nodes. I don't. :)

Re: [slurm-users] 4 sockets but "

2021-07-20 Thread Diego Zuccato

Wiki notes could be helpful? https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#compute-node-configuration Tks. Interesting, but I don't se pam_slurm_adopt. Other than that, it seems very much like what I'm doing. BYtE, Diego On 7/20/21 12:49 PM, Diego Zuccato wrote: Hello all. It'

[slurm-users] 4 sockets but "

2021-07-20 Thread Diego Zuccato

. I restarted slurmctld after every change in slurm.conf just to be sure. Any idea? Tks. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] incorrect number of cpu's being reported in srun job

2021-06-18 Thread Diego Zuccato

sibly impacting other users. Even if you just make users "pay" for the resources used by applying fairshare, the temptation to game the system could be too big. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pic

Re: [slurm-users] Job requesting two different GPUs on two different nodes

2021-06-10 Thread Diego Zuccato

. Maybe someone more experienced can refine it. No... it doesn't work... -Mensaje original- De: Diego Zuccato Enviado el: jueves, 10 de junio de 2021 10:37 Para: Slurm User Community List ; Gestió Servidors Asunto: Re: [slurm-users] Job requesting two different GPUs on two differen

Re: [slurm-users] Job requesting two different GPUs on two different nodes

2021-06-10 Thread Diego Zuccato

, --gres=gpu:GeForceRTX2070:1” because line “#SBATCH --gres=” is for each node and, then, a line containing two “gres” would request a node with 2 different GPUs. So… is it possible to request 2 different GPUs in 2 different nodes? Thanks. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia

Re: [slurm-users] Conflicting --nodes and --nodelist

2021-06-02 Thread Diego Zuccato

submitting your job. Brian Andrus On 6/1/2021 4:15 AM, Diego Zuccato wrote: Hello all. I just found that if an user tries to specify a nodelist (say including 2 nodes) and --nodes=1, the job gets rejected with sbatch: error: invalid number of nodes (-N 2-1) The expected behaviour is that slurm

[slurm-users] Conflicting --nodes and --nodelist

2021-06-01 Thread Diego Zuccato

found conflicting info about the issue. Is it version-dependant? If so, we're currently using 18.08.5-2 (from Debian stable). Should we expect changes when Debian will ship a newer version? Is it possible to have the expected behaviour? Tks. -- Diego Zuccato DIFA - Dip. di Fisica e Astronom

Re: [slurm-users] Determining Cluster Usage Rate

2021-05-17 Thread Diego Zuccato

nked to in this page. Tks. I upgrade Slurm frequently and have no problems doing so. We're at 20.11.7 now. You should avoid 20.11.{0-2} due to a bug in MPI. That's a really useful info. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - U

Re: [slurm-users] Determining Cluster Usage Rate

2021-05-17 Thread Diego Zuccato

dle manually-compiled packages). As Ole said, it's an old version. I'd love to be able to keep up with the newest releases, but ... :( -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] Determining Cluster Usage Rate

2021-05-14 Thread Diego Zuccato

ed Reported - -- --- ophcpu 81.93%0.00%0.00% 15.85% 2.22% 100.00% ophmem 80.60%0.00%0.00% 19.40% 0.00% 100.00% BYtE, Diego -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informa

Re: [slurm-users] Determining Cluster Usage Rate

2021-05-13 Thread Diego Zuccato

Il 14/05/2021 08:19, Christopher Samuel ha scritto: sreport -t percent -T ALL cluster utilization "sreport: fatal: No valid TRES given" :( -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 401

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-13 Thread Diego Zuccato

t's out of my depth, but there's a very low-volume mailing list at ccr-xdmod-l...@listserv.buffalo.edu <mailto:ccr-xdmod-l...@listserv.buffalo.edu> you could inquire at. [1] https://github.com/ubccr/xdmod/releases/tag/v9.5.0-rc.4 <https://github.com/ubccr/xdmod/releases/tag

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-12 Thread Diego Zuccato

Il 12/05/21 13:30, Diego Zuccato ha scritto: Anyway, at a first glance, it uses a bit too many technologies for my taste (php, java, js...) and could be a problem integrating it in a vhost managed by one of our ISPConfig instances. But I'll try it. Somehow I'll make it work :) The m

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-12 Thread Diego Zuccato

ig instances. But I'll try it. Somehow I'll make it work :) -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-12 Thread Diego Zuccato

ess to the bare numbers is definitely a no-no :) -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-11 Thread Diego Zuccato

have to do some changes (re field witdh: our usernames are quite long, being from AD), but first I have to check if it extracts the info our users want to see :) -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2

[slurm-users] Cluster usage, filtered by partition

2021-05-11 Thread Diego Zuccato

or at least the data to put in a spreadsheet for further processing)? Tks. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] [External] Re: PropagateResourceLimits

2021-04-27 Thread Diego Zuccato

d got propagated (as implied by PropagateResourceLimits default value of ALL). And I can confirm that setting it to NONE seems to have solved the issue: users on the frontend get limited resources, and jobs on the nodes get the resources they asked. -- Diego Zuccato DIFA - Dip. di Fisica e Astro

[slurm-users] PropagateResourceLimits

2021-04-22 Thread Diego Zuccato

so when I tried to limit to 1GB soft / 4GB hard the memory users can use on the frontend, the jobs began to fail at startup even if they requested 200G (that are available on the worker nodes but not on the frontend)... Tks. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Inform

Re: [slurm-users] how to print all the key-values of "job_desc" in job_submit.lua?

2021-03-29 Thread Diego Zuccato

Il 29/03/21 09:35, taleinterve...@sjtu.edu.cn ha scritto: > Why the loop code cannot get the content in job_desc? And what is the > correct way to print all its content without manually specify each key? I already reported it quite some time ago. Seems pairs() is not working. -- Diego Z

Re: [slurm-users] MaxTime only for a user

2021-02-25 Thread Diego Zuccato

n the partition. So the definition will have to be reversed: set the partition limit to the max allowed (1h) and limit all users except one in the assoc. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40

Re: [slurm-users] associations, limits,qos

2021-01-29 Thread Diego Zuccato

Il 29/01/21 08:47, Diego Zuccato ha scritto: >> Jobs submitted with sbatch cannot run on multiple partitions. The job >> will be submitted to the partition where it can start first. (from >> sbatch reference) > Did I misunderstand or heterogeneous jobs can workaround this

1 2 >

1 - 100 of 160 matches

Mail list logo