[slurm-users] Re: CR_CPU used but only cores used

2025-06-11 Thread Adrian Sevcenco via slurm-users
n one job from being allocated on each core. On 11/06/2025 0:53, Adrian Sevcenco via slurm-users wrote: Hi! i have a weird situation in which only cores are used instead of CPUs this is Alma

[slurm-users] CR_CPU used but only cores used

2025-06-10 Thread Adrian Sevcenco via slurm-users
Hi! i have a weird situation in which only cores are used instead of CPUs this is Alma9/slurm 22.05.9 (the last one from epel) I have: conf.d/resources.conf 9:SelectType=select/cons_res conf.d/resources.conf 11:SelectTypeParameters=CR_CPU conf.d/resources.conf 5:TaskPluginParam=autobind=thread

[slurm-users] Half of cpus used despite CR_CPU

2022-04-13 Thread Adrian Sevcenco
Hi! I have a weird situation with a cluster that i switched from CR_Core to CR_CPU select/cons_res, TaskPlugin=task/affinity,task/cgroup TaskPluginParam=autobind=threads despite reporting in the jobs that only 1 CPU is needed: NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

Re: [slurm-users] memory limits:: why job is not killed but oom-killer steps up?

2022-01-13 Thread Adrian Sevcenco
problem with my expectation that i should not see oom-killer? or with my configuration? Thank you! Adrian -- ------ Adrian Sevcenco, Ph.D. | Institute of Space Science - ISS, Romania| adrian.sevcenco at {cern.ch,spacescience.ro} | --

[slurm-users] memory limits:: why job is not killed but oom-killer steps up?

2022-01-12 Thread Adrian Sevcenco
Hi! I have a problem with the enforcing the memory limits... I'm using the cgroup to enforce the limits and i had expected that when cgroup memory limits are reach the job is killed .. instead i see in log a lot of oom-killer reports that act only a certain process from cgroup ... Did i missed

Re: [slurm-users] initscript poll timeout @ 10000 msec :: what slurm conf var?

2021-12-05 Thread Adrian Sevcenco
On 05.12.2021 20:54, Chris Samuel wrote: On 4/12/21 9:34 am, Adrian Sevcenco wrote: actually is not ... so, once again, does anyone have an idea about customization of the timeout of init script defined in job_container.conf? Looking at the source it's hard-coded in Slurm 21.08, so

Re: [slurm-users] initscript poll timeout @ 10000 msec :: what slurm conf var?

2021-12-04 Thread Adrian Sevcenco
On 03.12.2021 23:50, Adrian Sevcenco wrote: On 03.12.2021 16:31, Adrian Sevcenco wrote: Hi! I have a rather long init script in job_container and while i tried to raise various timeouts i still get: [2021-12-03T16:22:08.070] error: run_command: initscript poll timeout @ 1 msec [2021-12

Re: [slurm-users] initscript poll timeout @ 10000 msec :: what slurm conf var?

2021-12-03 Thread Adrian Sevcenco
On 03.12.2021 16:31, Adrian Sevcenco wrote: Hi! I have a rather long init script in job_container and while i tried to raise various timeouts i still get: [2021-12-03T16:22:08.070] error: run_command: initscript poll timeout @ 1 msec [2021-12-03T16:22:08.080] error: _create_ns: init

[slurm-users] initscript poll timeout @ 10000 msec :: what slurm conf var?

2021-12-03 Thread Adrian Sevcenco
Hi! I have a rather long init script in job_container and while i tried to raise various timeouts i still get: [2021-12-03T16:22:08.070] error: run_command: initscript poll timeout @ 1 msec [2021-12-03T16:22:08.080] error: _create_ns: init script: /etc/slurm/cvmfs_makeshared.sh failed what

Re: [slurm-users] job_container.conf:: how to adopt a autofs base mount point

2021-12-02 Thread Adrian Sevcenco
.. lets see how it goes, if autofs will mount stuff if i bind in job ns the /cvmfs (that i make rshared) Thanks a lot!!! Adrian https://www.kernel.org/doc/Documentation/filesystems/sharedsubtree.txt has more info on the options. Ryan On 12/2/21 15:58, Adrian Sevcenco wrote: Hi! I have a a

[slurm-users] job_container.conf:: how to adopt a autofs base mount point

2021-12-02 Thread Adrian Sevcenco
Hi! I have a annoying problem with the namespaces and the shared attribute of an autofs mountpoint... so, there is a directory named /cvmfs where autofs will mount various directories depending of the job requests. these directories, named repositories they do not need to be defined, regardles

Re: [slurm-users] slurmstepd: error: Too many levels of symbolic links

2021-12-02 Thread Adrian Sevcenco
Hi! On 01.12.2021 10:25, Bjørn-Helge Mevik wrote: Adrian Sevcenco writes: Hi! Does anyone know what could the the cause of such error? I have a shared home, slurm 20.11.8 and i try a simple script in the submit directory which is in the home that is nfs shared... We had the "Too

[slurm-users] slurmstepd: error: Too many levels of symbolic links

2021-11-30 Thread Adrian Sevcenco
Hi! Does anyone know what could the the cause of such error? I have a shared home, slurm 20.11.8 and i try a simple script in the submit directory which is in the home that is nfs shared... also i have job_container.conf defined, but i have no idea if this is a problem.. Thank you! Adrian

[slurm-users] slurm free memory reporting: free vs available

2021-11-25 Thread Adrian Sevcenco
Hi! Maybe this was discussed before but i did not had a problem with this until now .. It would seem that slurm see as "available to allocate to the job" memory the free memory instead of available memory .. is this by design? can this be changed to available with a flag or something? The ar

[slurm-users] multi-process/thread jobs:: configuration and job specification

2021-09-29 Thread Adrian Sevcenco
Hi! I'm trying to prepare and test for some jobs that will arrive and that will use multiple processes (i have no control on this, there are multiple executables that are being started in parallel within the job and communicate between them with a customization of zmq) the submitting method i

Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-07 Thread Adrian Sevcenco
Hi! On 8/8/21 3:19 AM, Chris Samuel wrote: On Friday, 6 August 2021 12:02:45 AM PDT Adrian Sevcenco wrote: i was wondering why a node is drained when killing of task fails and how can i disable it? (i use cgroups) moreover, how can the killing of task fails? (this is on slurm 19.05) Slurm

Re: [slurm-users] 19.05->20.11 update:: slurmdbd failure - SOLVED

2021-08-07 Thread Adrian Sevcenco
On 8/7/21 9:50 PM, Adrian Sevcenco wrote: Hi! I just upgraded slurm from 19.05 to 20.11 (all services stopped before) and now, after checking the configuration slurmdbd do not start anymore: [2021-08-07T21:42:01.890] error: Database settings not recommended values: innodb_buffer_pool_size

Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-07 Thread Adrian Sevcenco
ata Services P: (619) 519-4435 On 8/6/21 6:16 AM, Adrian Sevcenco wrote: On 8/6/21 3:19 PM, Diego Zuccato wrote: IIRC we increased SlurmdTimeout to 7200 . Thanks a lot! Adrian Il 06/08/2021 13:33, Adrian Sevcenco ha scritto: On 8/6/21 1:56 PM, Diego Zuccato wrote: We had a similar proble

[slurm-users] 19.05->20.11 update:: slurmdbd failure

2021-08-07 Thread Adrian Sevcenco
Hi! I just upgraded slurm from 19.05 to 20.11 (all services stopped before) and now, after checking the configuration slurmdbd do not start anymore: [2021-08-07T21:42:01.890] error: Database settings not recommended values: innodb_buffer_pool_size innodb_log_file_size innodb_lock_wait_timeout [

Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-06 Thread Adrian Sevcenco
On 8/6/21 3:19 PM, Diego Zuccato wrote: IIRC we increased SlurmdTimeout to 7200 . Thanks a lot! Adrian Il 06/08/2021 13:33, Adrian Sevcenco ha scritto: On 8/6/21 1:56 PM, Diego Zuccato wrote: We had a similar problem some time ago (slow creation of big core files) and solved it by

Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-06 Thread Adrian Sevcenco
wouldn't trigger it. Then, once the need for core files was over, I disabled core files and restored the timeouts. and how much did you increased them? i have SlurmctldTimeout=300 SlurmdTimeout=300 Thank you! Adrian Il 06/08/2021 12:46, Adrian Sevcenco ha scritto: On 8/6/21 1:27 PM, Diego Zu

Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-06 Thread Adrian Sevcenco
control/disable this? Thank you! Adrian BYtE,  Diego Il 06/08/2021 09:02, Adrian Sevcenco ha scritto: Having just implemented some triggers i just noticed this: NODELIST    NODES PARTITION   STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON alien-0-47  1    alien

[slurm-users] draining nodes due to failed killing of task?

2021-08-06 Thread Adrian Sevcenco
Having just implemented some triggers i just noticed this: NODELISTNODES PARTITION STATE CPUSS:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON alien-0-47 1alien*draining 48 48:1:1 193324 214030 1 rack-0,4 Kill task failed alien-0-56 1alien* drained

[slurm-users] update in place/db compatibility 19.05 vs 20.11

2021-08-02 Thread Adrian Sevcenco
Hi! can a 19.05 cluster be directly upgraded to 20.11? Thank you! Adrian

Re: [slurm-users] declare availability of up to 8 cores//job

2021-08-02 Thread Adrian Sevcenco
. Thanks a lot for info! Adrian -Paul Edmon- On 8/2/2021 12:05 PM, Adrian Sevcenco wrote: On 8/2/21 6:26 PM, Paul Edmon wrote: Probably more like MaxTRESPERJob=cpu=8 i see, thanks!! i'm still searching for the definition of GrpTRES :) Thanks a lot! Adrian You would need to specify how

Re: [slurm-users] declare availability of up to 8 cores//job

2021-08-02 Thread Adrian Sevcenco
21 11:24 AM, Adrian Sevcenco wrote: On 8/2/21 5:44 PM, Paul Edmon wrote: You can set up a Partition based QoS that can set this limit: https://slurm.schedmd.com/resource_limits.html  See the MaxTRESPerJob limit. oh, thanks a lot!! would something like this work/be in line with your indication? :

Re: [slurm-users] declare availability of up to 8 cores//job

2021-08-02 Thread Adrian Sevcenco
MaxTRESPerJob=8 modify account blah DefaultQOS=8cpu Thanks a lot! Adrian -Paul Edmon- On 8/2/2021 10:40 AM, Adrian Sevcenco wrote: Hi! Is there a way to declare that jobs can request up to 8 cores? Or is it allowed by default (as i see no limit regarding this .. ) .. i just have MaxNodes=1 this

[slurm-users] declare availability of up to 8 cores//job

2021-08-02 Thread Adrian Sevcenco
Hi! Is there a way to declare that jobs can request up to 8 cores? Or is it allowed by default (as i see no limit regarding this .. ) .. i just have MaxNodes=1 this is CR_CPU alocator Thank you! Adrian

[slurm-users] job submit location :: restricted to HOME?

2021-03-03 Thread Adrian Sevcenco
Hi! I just encountered the situation that i cannot submit jobs from other location than $HOME... i just have an exit of 1, and in slurmctld log i just see : _job_complete: JobId=30831 WEXITSTATUS 1 _job_complete: JobId=30831 done both the custom location and HOME are NFS shared and i checked cl

Re: [slurm-users] [EXT] wrong number of jobs used

2021-01-19 Thread Adrian Sevcenco
ngineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Wed, 20 Jan 2021 at 06:50, Adrian Sevcenco <mailto:adrian.sevce...@spacescience.ro>> wrote: UoM notice: External email. Be cautious of links, attachme

[slurm-users] wrong number of jobs used

2021-01-19 Thread Adrian Sevcenco
Hi! So, i have a very strange situation that i do not even know how to troubleshoot... I'm running with SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory,CR_LLN TaskPlugin=task/affinity,task/cgroup TaskPluginParam=autobind=threads and a partition defined with: LLN=yes DefMemPerCPU=40

Re: [slurm-users] job restart :: how to find the reason

2020-12-02 Thread Adrian Sevcenco
have continuous monitoring (ganglia) data, but this is beyond the scope of this list. Thanks a lot! Adrian -Paul Edmon- On 12/2/2020 6:27 AM, Adrian Sevcenco wrote: Hi! I encountered a situation when a bunch of jobs were restarted and this is seen from Requeue=1 Restarts=1 BatchFlag=1 Reboot

Re: [slurm-users] Randomize Slurm Node Allocation

2020-12-02 Thread Adrian Sevcenco
On 12/2/20 1:27 PM, Fabio Moreira wrote: Hi, I would like to know if Slurm has any configuration to enable a randomize node allocation, since we have 256 nodes in our cluster and the first nodes are always allocated at first. Is there any way to allocate them in an aleatory way? We have already

[slurm-users] job restart :: how to find the reason

2020-12-02 Thread Adrian Sevcenco
Hi! I encountered a situation when a bunch of jobs were restarted and this is seen from Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0 So, i would like to know, how i can i find why there is a Requeue (when there is only one partition defined) and why there is a restart .. Thanks a lot!!

Re: [slurm-users] howto list/get all scripts run by a job?

2020-06-19 Thread Adrian Sevcenco
On 6/19/20 12:35 PM, mercan wrote: Hi; For running jobs, you can get the running script with using: scontrol write batch_script  "$SLURM_JOBID" - wow, thanks a lot!!! Adrian command. the - parameter reqired for screen output. Ahmet M. On 19.06.2020 12:25, Adrian Sevcenco wr

Re: [slurm-users] howto list/get all scripts run by a job?

2020-06-19 Thread Adrian Sevcenco
On 6/18/20 9:35 AM, Loris Bennett wrote: Hi Adrain, Hi Adrian Sevcenco writes: Hi! I'm trying to retrieve the actual executable of jobs but i did not find how to do it .. i would like to found this for both case when the job is started with sbatch or with srun. For running

[slurm-users] find jobs killed by slurm

2020-06-17 Thread Adrian Sevcenco
Is there a way to query specifically for jobs that were killed by slurm? (so excluding scancels) Thank you! Adrian

[slurm-users] howto list/get all scripts run by a job?

2020-06-17 Thread Adrian Sevcenco
Hi! I'm trying to retrieve the actual executable of jobs but i did not find how to do it .. i would like to found this for both case when the job is started with sbatch or with srun. Thank you! Adrian

[slurm-users] sbatch : job fail without any output or indication

2020-02-04 Thread Adrian Sevcenco
Hi! How can i debug a job that fail without any output or indication? My job that start with sbatch has the following form : #!/bin/bash #SBATCH --job-name QCUT_SEV #SBATCH -p CLUSTER # Partition to submit to #SBATCH --output=%x_%j.out # File to which STDOUT will be writ

[slurm-users] slurm config :: set up a workdir for each job

2019-09-19 Thread Adrian Sevcenco
Hi! Is there a method for setting up a work directory unique for each job from a system setting? and than clean that up? can i use somehow the prologue and epilogue sections? Thank you! Adrian -- -- Adrian Sevcenco, Ph.D

[slurm-users] slurmctld 17.11.2 :: fatal we have less TRES than should be here

2018-01-05 Thread Adrian Sevcenco
Hi! I just upgraded to 17.11.2 and when i try to start slurmctld i get this : slurmctld[12552]: fatal: You are running with a database but for some reason we have less TRES than should be here (4 < 5) and/or the "billing" TRES is missing. This should only happen if the database is down after an