n one job from being allocated on each core.
On 11/06/2025 0:53, Adrian Sevcenco via
slurm-users wrote:
Hi!
i have a weird situation in which only cores are used instead of
CPUs
this is Alma
Hi! i have a weird situation in which only cores are used instead of CPUs
this is Alma9/slurm 22.05.9 (the last one from epel)
I have:
conf.d/resources.conf
9:SelectType=select/cons_res
conf.d/resources.conf
11:SelectTypeParameters=CR_CPU
conf.d/resources.conf
5:TaskPluginParam=autobind=thread
Hi! I have a weird situation with a cluster that i switched from CR_Core to
CR_CPU
select/cons_res, TaskPlugin=task/affinity,task/cgroup
TaskPluginParam=autobind=threads
despite reporting in the jobs that only 1 CPU is needed:
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
problem with my expectation that i should not see oom-killer?
or with my configuration?
Thank you!
Adrian
--
------
Adrian Sevcenco, Ph.D. |
Institute of Space Science - ISS, Romania|
adrian.sevcenco at {cern.ch,spacescience.ro} |
--
Hi! I have a problem with the enforcing the memory limits...
I'm using the cgroup to enforce the limits and i had expected that when
cgroup memory limits are reach the job is killed ..
instead i see in log a lot of oom-killer reports that act only a certain process
from cgroup ...
Did i missed
On 05.12.2021 20:54, Chris Samuel wrote:
On 4/12/21 9:34 am, Adrian Sevcenco wrote:
actually is not ... so, once again, does anyone have an idea about customization of the timeout of init script defined
in job_container.conf?
Looking at the source it's hard-coded in Slurm 21.08, so
On 03.12.2021 23:50, Adrian Sevcenco wrote:
On 03.12.2021 16:31, Adrian Sevcenco wrote:
Hi! I have a rather long init script in job_container
and while i tried to raise various timeouts i still get:
[2021-12-03T16:22:08.070] error: run_command: initscript poll timeout @ 1
msec
[2021-12
On 03.12.2021 16:31, Adrian Sevcenco wrote:
Hi! I have a rather long init script in job_container
and while i tried to raise various timeouts i still get:
[2021-12-03T16:22:08.070] error: run_command: initscript poll timeout @ 1
msec
[2021-12-03T16:22:08.080] error: _create_ns: init
Hi! I have a rather long init script in job_container
and while i tried to raise various timeouts i still get:
[2021-12-03T16:22:08.070] error: run_command: initscript poll timeout @ 1
msec
[2021-12-03T16:22:08.080] error: _create_ns: init script:
/etc/slurm/cvmfs_makeshared.sh failed
what
..
lets see how it goes, if autofs will mount stuff if i bind in job ns the /cvmfs
(that i make rshared)
Thanks a lot!!!
Adrian
https://www.kernel.org/doc/Documentation/filesystems/sharedsubtree.txt has more
info on the options.
Ryan
On 12/2/21 15:58, Adrian Sevcenco wrote:
Hi! I have a a
Hi! I have a annoying problem with the namespaces and the shared attribute
of an autofs mountpoint...
so, there is a directory named /cvmfs where autofs will mount various
directories
depending of the job requests.
these directories, named repositories they do not need to be defined, regardles
Hi!
On 01.12.2021 10:25, Bjørn-Helge Mevik wrote:
Adrian Sevcenco writes:
Hi! Does anyone know what could the the cause of such error?
I have a shared home, slurm 20.11.8 and i try a simple script in the submit
directory
which is in the home that is nfs shared...
We had the "Too
Hi! Does anyone know what could the the cause of such error?
I have a shared home, slurm 20.11.8 and i try a simple script in the submit
directory
which is in the home that is nfs shared...
also i have job_container.conf defined, but i have no idea if this is a
problem..
Thank you!
Adrian
Hi! Maybe this was discussed before but i did not had a problem with this until
now ..
It would seem that slurm see as "available to allocate to the job" memory the
free memory instead
of available memory .. is this by design? can this be changed to available with
a flag or something?
The ar
Hi! I'm trying to prepare and test for some jobs that will arrive and that will
use multiple processes (i have no control on this, there are multiple
executables
that are being started in parallel within the job and communicate between them
with
a customization of zmq)
the submitting method i
Hi!
On 8/8/21 3:19 AM, Chris Samuel wrote:
On Friday, 6 August 2021 12:02:45 AM PDT Adrian Sevcenco wrote:
i was wondering why a node is drained when killing of task fails and how can
i disable it? (i use cgroups) moreover, how can the killing of task fails?
(this is on slurm 19.05)
Slurm
On 8/7/21 9:50 PM, Adrian Sevcenco wrote:
Hi! I just upgraded slurm from 19.05 to 20.11 (all services stopped before)
and now, after checking the configuration slurmdbd do not start anymore:
[2021-08-07T21:42:01.890] error: Database settings not recommended values: innodb_buffer_pool_size
ata Services
P: (619) 519-4435
On 8/6/21 6:16 AM, Adrian Sevcenco wrote:
On 8/6/21 3:19 PM, Diego Zuccato wrote:
IIRC we increased SlurmdTimeout to 7200 .
Thanks a lot!
Adrian
Il 06/08/2021 13:33, Adrian Sevcenco ha scritto:
On 8/6/21 1:56 PM, Diego Zuccato wrote:
We had a similar proble
Hi! I just upgraded slurm from 19.05 to 20.11 (all services stopped before)
and now, after checking the configuration slurmdbd do not start anymore:
[2021-08-07T21:42:01.890] error: Database settings not recommended values: innodb_buffer_pool_size innodb_log_file_size
innodb_lock_wait_timeout
[
On 8/6/21 3:19 PM, Diego Zuccato wrote:
IIRC we increased SlurmdTimeout to 7200 .
Thanks a lot!
Adrian
Il 06/08/2021 13:33, Adrian Sevcenco ha scritto:
On 8/6/21 1:56 PM, Diego Zuccato wrote:
We had a similar problem some time ago (slow creation of big core files) and
solved it by
wouldn't trigger it. Then, once the need for core files was over, I disabled
core files and restored the timeouts.
and how much did you increased them? i have
SlurmctldTimeout=300
SlurmdTimeout=300
Thank you!
Adrian
Il 06/08/2021 12:46, Adrian Sevcenco ha scritto:
On 8/6/21 1:27 PM, Diego Zu
control/disable
this?
Thank you!
Adrian
BYtE,
Diego
Il 06/08/2021 09:02, Adrian Sevcenco ha scritto:
Having just implemented some triggers i just noticed this:
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT
AVAIL_FE REASON
alien-0-47 1 alien
Having just implemented some triggers i just noticed this:
NODELISTNODES PARTITION STATE CPUSS:C:T MEMORY TMP_DISK WEIGHT
AVAIL_FE REASON
alien-0-47 1alien*draining 48 48:1:1 193324 214030 1
rack-0,4 Kill task failed
alien-0-56 1alien* drained
Hi! can a 19.05 cluster be directly upgraded to 20.11?
Thank you!
Adrian
.
Thanks a lot for info!
Adrian
-Paul Edmon-
On 8/2/2021 12:05 PM, Adrian Sevcenco wrote:
On 8/2/21 6:26 PM, Paul Edmon wrote:
Probably more like
MaxTRESPERJob=cpu=8
i see, thanks!!
i'm still searching for the definition of GrpTRES :)
Thanks a lot!
Adrian
You would need to specify how
21 11:24 AM, Adrian Sevcenco wrote:
On 8/2/21 5:44 PM, Paul Edmon wrote:
You can set up a Partition based QoS that can set this limit: https://slurm.schedmd.com/resource_limits.html See the
MaxTRESPerJob limit.
oh, thanks a lot!!
would something like this work/be in line with your indication? :
MaxTRESPerJob=8
modify account blah DefaultQOS=8cpu
Thanks a lot!
Adrian
-Paul Edmon-
On 8/2/2021 10:40 AM, Adrian Sevcenco wrote:
Hi! Is there a way to declare that jobs can request up to 8 cores?
Or is it allowed by default (as i see no limit regarding this .. ) .. i just
have MaxNodes=1
this
Hi! Is there a way to declare that jobs can request up to 8 cores?
Or is it allowed by default (as i see no limit regarding this .. ) .. i just
have MaxNodes=1
this is CR_CPU alocator
Thank you!
Adrian
Hi! I just encountered the situation that i cannot submit jobs from
other location than $HOME... i just have an exit of 1, and in slurmctld
log i just see :
_job_complete: JobId=30831 WEXITSTATUS 1
_job_complete: JobId=30831 done
both the custom location and HOME are NFS shared and i checked cl
ngineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
On Wed, 20 Jan 2021 at 06:50, Adrian Sevcenco
<mailto:adrian.sevce...@spacescience.ro>> wrote:
UoM notice: External email. Be cautious of links, attachme
Hi! So, i have a very strange situation that i do not even know how to
troubleshoot...
I'm running with
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory,CR_LLN
TaskPlugin=task/affinity,task/cgroup
TaskPluginParam=autobind=threads
and a partition defined with:
LLN=yes DefMemPerCPU=40
have continuous monitoring (ganglia) data, but this
is beyond the scope of this list.
Thanks a lot!
Adrian
-Paul Edmon-
On 12/2/2020 6:27 AM, Adrian Sevcenco wrote:
Hi! I encountered a situation when a bunch of jobs were restarted
and this is seen from Requeue=1 Restarts=1 BatchFlag=1 Reboot
On 12/2/20 1:27 PM, Fabio Moreira wrote:
Hi,
I would like to know if Slurm has any configuration to enable a
randomize node allocation, since we have 256 nodes in our cluster and
the first nodes are always allocated at first. Is there any way to
allocate them in an aleatory way? We have already
Hi! I encountered a situation when a bunch of jobs were restarted
and this is seen from Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0
So, i would like to know, how i can i find why there is a Requeue
(when there is only one partition defined) and why there is a restart ..
Thanks a lot!!
On 6/19/20 12:35 PM, mercan wrote:
Hi;
For running jobs, you can get the running script with using:
scontrol write batch_script "$SLURM_JOBID" -
wow, thanks a lot!!!
Adrian
command. the - parameter reqired for screen output.
Ahmet M.
On 19.06.2020 12:25, Adrian Sevcenco wr
On 6/18/20 9:35 AM, Loris Bennett wrote:
Hi Adrain,
Hi
Adrian Sevcenco writes:
Hi! I'm trying to retrieve the actual executable of jobs but i did not find how
to do it .. i would like to found this for both case when the job is started
with sbatch or with srun.
For running
Is there a way to query specifically for jobs that were killed by slurm?
(so excluding scancels)
Thank you!
Adrian
Hi! I'm trying to retrieve the actual executable of jobs but i did not
find how to do it .. i would like to found this for both case when the
job is started with sbatch or with srun.
Thank you!
Adrian
Hi! How can i debug a job that fail without any output or indication?
My job that start with sbatch has the following form :
#!/bin/bash
#SBATCH --job-name QCUT_SEV
#SBATCH -p CLUSTER # Partition to submit to
#SBATCH --output=%x_%j.out # File to which STDOUT will be writ
Hi! Is there a method for setting up a work directory unique for each
job from a system setting? and than clean that up?
can i use somehow the prologue and epilogue sections?
Thank you!
Adrian
--
--
Adrian Sevcenco, Ph.D
Hi! I just upgraded to 17.11.2 and when i try to start slurmctld i get
this :
slurmctld[12552]: fatal: You are running with a database but for some
reason we have less TRES than should be here (4 < 5) and/or the
"billing" TRES is missing. This should only happen if the database is
down after an
41 matches
Mail list logo