[slurm-users] memory high water mark reporting

2024-05-16 Thread Emyr James via slurm-users
Hi,

We are trying out slurm having been running grid engine for a long while.
In grid engine, the cgroups peak memory and max_rss are generated at the end of 
a job and recorded. It logs the information from the cgroup hierarchy as well 
as doing a getrusage call right at the end on the parent pid of the whole job 
"container" before cleaning up.
With slurm it seems that the only way memory is recorded is by the acct gather 
polling. I am trying to add something in an epilog script to get the 
memory.peak but It looks like the cgroup hierarchy has been destroyed by the 
time the epilog is run.
Where in the code is the cgroup hierarchy cleared up ? Is there no way to add 
something in so that the accounting is updated during the job cleanup process 
so that peak memory usage can be accurately logged ?

I can reduce the polling interval from 30s to 5s but don't know if this causes 
a lot of overhead and in any case this seems to not be a sensible way to get 
values that should just be determined right at the end by an event rather than 
using polling.

Many thanks,

Emyr

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: memory high water mark reporting

2024-05-16 Thread Emyr James via slurm-users
Hi,

I have got a very simple LD_PRELOAD that can do this. Maybe I should see if I 
can force slurmstepd to be run with that LD_PRELOAD and then see if that does 
it.

Ultimately am trying to get all the useful accounting metrics into a clickhouse 
database. If the LD_PRELOAD on slurmstepd seems to work then I can expand it to 
insert the relevant row into the clickhouse DB in the C code of the preload 
library.

But still...this seems like a very basic thing to do and am very suprised that 
it seems so difficult to do this with the standard accounting recording out of 
the box.

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Davide DelVento 
Sent: 17 May 2024 01:02
To: Emyr James 
Cc: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] memory high water mark reporting

Not exactly the answer to your question (which I don't know) but if you can get 
to prefix whatever is executed with this 
https://github.com/NCAR/peak_memusage<https://urldefense.com/v3/__https://github.com/NCAR/peak_memusage__;!!D9dNQwwGXtA!XXr8CcM11ikS-fYyDe0CFyQWal6Qp5cgv1os4oHtVrAAJE68Fp6qqvZFKoNvW26ROOv3uLzwqRZLge3-6zV8CPYLzg$>
 (which also uses getrusage) or a variant you will be able to do that.

On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users 
mailto:slurm-users@lists.schedmd.com>> wrote:
Hi,

We are trying out slurm having been running grid engine for a long while.
In grid engine, the cgroups peak memory and max_rss are generated at the end of 
a job and recorded. It logs the information from the cgroup hierarchy as well 
as doing a getrusage call right at the end on the parent pid of the whole job 
"container" before cleaning up.
With slurm it seems that the only way memory is recorded is by the acct gather 
polling. I am trying to add something in an epilog script to get the 
memory.peak but It looks like the cgroup hierarchy has been destroyed by the 
time the epilog is run.
Where in the code is the cgroup hierarchy cleared up ? Is there no way to add 
something in so that the accounting is updated during the job cleanup process 
so that peak memory usage can be accurately logged ?

I can reduce the polling interval from 30s to 5s but don't know if this causes 
a lot of overhead and in any case this seems to not be a sensible way to get 
values that should just be determined right at the end by an event rather than 
using polling.

Many thanks,

Emyr

--
slurm-users mailing list -- 
slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>
To unsubscribe send an email to 
slurm-users-le...@lists.schedmd.com<mailto:slurm-users-le...@lists.schedmd.com>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: memory high water mark reporting

2024-05-20 Thread Emyr James via slurm-users
Looking here :

https://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINS

It looks like it's possible to hook something in at the right place using the 
slurm_spank_task_exit or slurm_spank_exit plugins. Does anyone have any 
experience or examples of doing this ? Is there any more documentation 
available on this functionality ?

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Emyr James via slurm-users 
Sent: 17 May 2024 01:15
To: Davide DelVento 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: memory high water mark reporting

Hi,

I have got a very simple LD_PRELOAD that can do this. Maybe I should see if I 
can force slurmstepd to be run with that LD_PRELOAD and then see if that does 
it.

Ultimately am trying to get all the useful accounting metrics into a clickhouse 
database. If the LD_PRELOAD on slurmstepd seems to work then I can expand it to 
insert the relevant row into the clickhouse DB in the C code of the preload 
library.

But still...this seems like a very basic thing to do and am very suprised that 
it seems so difficult to do this with the standard accounting recording out of 
the box.

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Davide DelVento 
Sent: 17 May 2024 01:02
To: Emyr James 
Cc: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] memory high water mark reporting

Not exactly the answer to your question (which I don't know) but if you can get 
to prefix whatever is executed with this 
https://github.com/NCAR/peak_memusage<https://urldefense.com/v3/__https://github.com/NCAR/peak_memusage__;!!D9dNQwwGXtA!XXr8CcM11ikS-fYyDe0CFyQWal6Qp5cgv1os4oHtVrAAJE68Fp6qqvZFKoNvW26ROOv3uLzwqRZLge3-6zV8CPYLzg$>
 (which also uses getrusage) or a variant you will be able to do that.

On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users 
mailto:slurm-users@lists.schedmd.com>> wrote:
Hi,

We are trying out slurm having been running grid engine for a long while.
In grid engine, the cgroups peak memory and max_rss are generated at the end of 
a job and recorded. It logs the information from the cgroup hierarchy as well 
as doing a getrusage call right at the end on the parent pid of the whole job 
"container" before cleaning up.
With slurm it seems that the only way memory is recorded is by the acct gather 
polling. I am trying to add something in an epilog script to get the 
memory.peak but It looks like the cgroup hierarchy has been destroyed by the 
time the epilog is run.
Where in the code is the cgroup hierarchy cleared up ? Is there no way to add 
something in so that the accounting is updated during the job cleanup process 
so that peak memory usage can be accurately logged ?

I can reduce the polling interval from 30s to 5s but don't know if this causes 
a lot of overhead and in any case this seems to not be a sensible way to get 
values that should just be determined right at the end by an event rather than 
using polling.

Many thanks,

Emyr

--
slurm-users mailing list -- 
slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>
To unsubscribe send an email to 
slurm-users-le...@lists.schedmd.com<mailto:slurm-users-le...@lists.schedmd.com>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: memory high water mark reporting

2024-05-20 Thread Emyr James via slurm-users
Siwmae Thomas,

I grepped for memory.peak in the source and it's not there. memory.current is 
there and is used in src/plugins/cgroup/v2/cgroup_v2.c

Adding the ability to get memory.peak in this source file seems to be something 
that should be done?

Should extern cgroup_acct_t *cgroup_p_task_get_acct_data(uint32_t task_id) be 
modified to include looking at memory.peak ?

This may mean needing to modify the acct_stat struct in interfaces/cgroup.h  to 
include it ?

typedef struct {
  uint64_t usec;
  uint64_t ssec;
  uint64_t total_rss;
uint64_t mas_rss;
  uint64_t total_pgmajfault;
  uint64_t total_vmem;
} cgroup_acct_t;

Presumably, with the polling method, it keeps looking at the current value and 
then keeps track of the max of these values. But the actual max may occur in 
between 2 polls so it would never see the true max value. At least by also 
reading memory.peak there is a chance to get closer to the real value with the 
polling method even if this not  optimal. Ideally it should run this during 
cleanup of tasks as well as at the poll interval.

As an aside, I also did a grep for getrusage and it doesn't seem to be used at 
all. I see that it is looking at /proc/%d/stat so maybe this is where its 
getting the maxrss for non cgroup accounting. Still, getrusage would seem to be 
the more obvious choice for this ?

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Thomas Green - Staff in University IT, Research Technologies / Staff 
Technoleg Gwybodaeth, Technolegau Ymchwil 
Sent: 20 May 2024 13:08
To: Emyr James ; Davide DelVento 
Cc: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Re: memory high water mark reporting


Hi,



We have had similar questions from users regarding how best to find out the 
high memory peak of a job since they may run a job and get a not very useful 
value for variables in sacct such as the MaxRSS since Slurm didn’t poll during 
the use of its maximum memory usage.



With Cgroupv1 looking online it looks like memory.max_usage_in_bytes takes into 
account caches so can vary on how much I/O is used whilst total_rss in 
memory.stats looks more useful maybe. Maybe memory.peak is clearer?



Its not clear in the documentation how a user should in the sacct values to 
infer the actual usage of jobs to correct their behaviour in future submissions.



I would be keen to see improvements in high water mark reporting.  I noticed 
that the jobacctgather plugin documentation was deleted back in Slurm 21.08 – 
Spank plugin does possibly look like the way to go.  Also it seems a common 
problem across technologies e.g. 
https://github.com/google/cadvisor/issues/3286<https://urldefense.com/v3/__https://github.com/google/cadvisor/issues/3286__;!!D9dNQwwGXtA!UW9JUyJ5ByL6XxihSUX-hn_HC2rYL-BZ8HtbdSlP10hGha71tuIHFmUOQ7dPpEseh3Ecyo-rrPUDVWPKJ280u9w$>



Tom



From: Emyr James via slurm-users 
Date: Monday, 20 May 2024 at 10:50
To: Davide DelVento , Emyr James 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: memory high water mark reporting

External email to Cardiff University - Take care when replying/opening 
attachments or links.

Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor 
atodiadau neu ddolenni.



Looking here :



https://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINS<https://urldefense.com/v3/__https://slurm.schedmd.com/spank.html*SECTION_SPANK-PLUGINS__;Iw!!D9dNQwwGXtA!UW9JUyJ5ByL6XxihSUX-hn_HC2rYL-BZ8HtbdSlP10hGha71tuIHFmUOQ7dPpEseh3Ecyo-rrPUDVWPK6HobAdg$>



It looks like it's possible to hook something in at the right place using the 
slurm_spank_task_exit or slurm_spank_exit plugins. Does anyone have any 
experience or examples of doing this ? Is there any more documentation 
available on this functionality ?



Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation



________

From: Emyr James via slurm-users 
Sent: 17 May 2024 01:15
To: Davide DelVento 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: memory high water mark reporting



Hi,



I have got a very simple LD_PRELOAD that can do this. Maybe I should see if I 
can force slurmstepd to be run with that LD_PRELOAD and then see if that does 
it.



Ultimately am trying to get all the useful accounting metrics into a clickhouse 
database. If the LD_PRELOAD on slurmstepd seems to work then I can expand it to 
insert the relevant row into the clickhouse DB in the C code of the preload 
library.



But still...this seems like a very basic thing to do and am very suprised that 
it seems so difficult to do this with the standard accounting recording out of 
the box.



Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation





From: Davide DelVento 
Sent: 17 May 2024 01:02
To: Emyr James 
Cc: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-us

[slurm-users] Re: memory high water mark reporting

2024-05-20 Thread Emyr James via slurm-users
A bit more digging

the cgroups stuff seems to be communicating back the values it finds in 
src/plugins/jobacct_gather/cgroup/jobacct_gather_cgroup.c

prec->tres_data[TRES_ARRAY_MEM].size_read =
cgroup_acct_data->total_rss;

I can't find anywhere in the code where it seems to be keeping track of the max 
value of total_rss seen so I can only conclude that it must be done in the 
database when slurmdbd puts in the values rather than being done in the slurm 
binaries themselves.

So this does seem to suggest that the peak value that is accounted at the end 
is just the maximum of the memory.current values that it sees over all the 
polls, even though there may be much higher transient values that may have 
occured in between the polls which would be taken into account by memory.peak 
but slurm never sees these values.

Can anyone more familiar with the code than me corrobarate this ?

Presumably non-cgroup accounting has a similar issue ? I.e. it polls rss and 
then the accounting db reports the highest seen even though using getrusage and 
checking ru_maxrss should be done too ?

Many thanks,

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Emyr James via slurm-users 
Sent: 20 May 2024 13:56
To: Thomas Green - Staff in University IT, Research Technologies / Staff 
Technoleg Gwybodaeth, Technolegau Ymchwil ; Davide 
DelVento 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: memory high water mark reporting

Siwmae Thomas,

I grepped for memory.peak in the source and it's not there. memory.current is 
there and is used in src/plugins/cgroup/v2/cgroup_v2.c

Adding the ability to get memory.peak in this source file seems to be something 
that should be done?

Should extern cgroup_acct_t *cgroup_p_task_get_acct_data(uint32_t task_id) be 
modified to include looking at memory.peak ?

This may mean needing to modify the acct_stat struct in interfaces/cgroup.h  to 
include it ?

typedef struct {
  uint64_t usec;
  uint64_t ssec;
  uint64_t total_rss;
uint64_t mas_rss;
  uint64_t total_pgmajfault;
  uint64_t total_vmem;
} cgroup_acct_t;

Presumably, with the polling method, it keeps looking at the current value and 
then keeps track of the max of these values. But the actual max may occur in 
between 2 polls so it would never see the true max value. At least by also 
reading memory.peak there is a chance to get closer to the real value with the 
polling method even if this not  optimal. Ideally it should run this during 
cleanup of tasks as well as at the poll interval.

As an aside, I also did a grep for getrusage and it doesn't seem to be used at 
all. I see that it is looking at /proc/%d/stat so maybe this is where its 
getting the maxrss for non cgroup accounting. Still, getrusage would seem to be 
the more obvious choice for this ?

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Thomas Green - Staff in University IT, Research Technologies / Staff 
Technoleg Gwybodaeth, Technolegau Ymchwil 
Sent: 20 May 2024 13:08
To: Emyr James ; Davide DelVento 
Cc: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Re: memory high water mark reporting


Hi,



We have had similar questions from users regarding how best to find out the 
high memory peak of a job since they may run a job and get a not very useful 
value for variables in sacct such as the MaxRSS since Slurm didn’t poll during 
the use of its maximum memory usage.



With Cgroupv1 looking online it looks like memory.max_usage_in_bytes takes into 
account caches so can vary on how much I/O is used whilst total_rss in 
memory.stats looks more useful maybe. Maybe memory.peak is clearer?



Its not clear in the documentation how a user should in the sacct values to 
infer the actual usage of jobs to correct their behaviour in future submissions.



I would be keen to see improvements in high water mark reporting.  I noticed 
that the jobacctgather plugin documentation was deleted back in Slurm 21.08 – 
Spank plugin does possibly look like the way to go.  Also it seems a common 
problem across technologies e.g. 
https://github.com/google/cadvisor/issues/3286<https://urldefense.com/v3/__https://github.com/google/cadvisor/issues/3286__;!!D9dNQwwGXtA!UW9JUyJ5ByL6XxihSUX-hn_HC2rYL-BZ8HtbdSlP10hGha71tuIHFmUOQ7dPpEseh3Ecyo-rrPUDVWPKJ280u9w$>



Tom



From: Emyr James via slurm-users 
Date: Monday, 20 May 2024 at 10:50
To: Davide DelVento , Emyr James 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: memory high water mark reporting

External email to Cardiff University - Take care when replying/opening 
attachments or links.

Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor 
atodiadau neu ddolenni.



Looking here :



https://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINS<https://urldefense.com/v3/__https:

[slurm-users] Re: memory high water mark reporting

2024-05-20 Thread Emyr James via slurm-users

I changed the following in  src/plugins/cgroup/v2/cgroup_v2.c

   if (common_cgroup_get_param(&task_cg_info->task_cg,
"memory.current",
&memory_current,
&tmp_sz) != SLURM_SUCCESS) {
if (task_id == task_special_id)
log_flag(CGROUP, "Cannot read task_special memory.peak file");
else
   log_flag(CGROUP, "Cannot read task %d memory.peak file",
task_id);
   }

to

   if (common_cgroup_get_param(&task_cg_info->task_cg,
"memory.peak",
&memory_current,
&tmp_sz) != SLURM_SUCCESS) {
if (task_id == task_special_id)
log_flag(CGROUP, "Cannot read task_special memory.peak file");
else
   log_flag(CGROUP, "Cannot read task %d memory.peak file",
task_id);
   }

and am using a polling interval of 5s. the values I get when adding this to the 
end of a batch script :

dir=$(awk -F: '{print $NF}' /proc/self/cgroup)
echo [$(date +"%Y-%m-%d %H:%M:%S")] peak memory is `cat 
/sys/fs/cgroup$dir/memory.peak`
echo [$(date +"%Y-%m-%d %H:%M:%S")] finished on $(hostname)

compared to what is in maxrss from sacct seem to be spot on for my test jobs at 
least. I guess this will do for now but it still feels very unsatisfactory to 
be using polling for this instead of having the code trigger the relevant stuff 
on job cleanup.

The downside of this "quick fix" is that now during a job run, sstat will 
report the max memory seen so far rather than the current usage. Personally I 
think this is not particularly useful anyway and if you really need to track 
memory usage as a job is running the LD_PRELOAD methods mentioned previously 
are better.

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Emyr James 
Sent: 20 May 2024 14:30
To: Thomas Green - Staff in University IT, Research Technologies / Staff 
Technoleg Gwybodaeth, Technolegau Ymchwil ; Davide 
DelVento ; Emyr James 
Cc: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Re: memory high water mark reporting

A bit more digging

the cgroups stuff seems to be communicating back the values it finds in 
src/plugins/jobacct_gather/cgroup/jobacct_gather_cgroup.c

prec->tres_data[TRES_ARRAY_MEM].size_read =
cgroup_acct_data->total_rss;

I can't find anywhere in the code where it seems to be keeping track of the max 
value of total_rss seen so I can only conclude that it must be done in the 
database when slurmdbd puts in the values rather than being done in the slurm 
binaries themselves.

So this does seem to suggest that the peak value that is accounted at the end 
is just the maximum of the memory.current values that it sees over all the 
polls, even though there may be much higher transient values that may have 
occured in between the polls which would be taken into account by memory.peak 
but slurm never sees these values.

Can anyone more familiar with the code than me corrobarate this ?

Presumably non-cgroup accounting has a similar issue ? I.e. it polls rss and 
then the accounting db reports the highest seen even though using getrusage and 
checking ru_maxrss should be done too ?

Many thanks,

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Emyr James via slurm-users 
Sent: 20 May 2024 13:56
To: Thomas Green - Staff in University IT, Research Technologies / Staff 
Technoleg Gwybodaeth, Technolegau Ymchwil ; Davide 
DelVento 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: memory high water mark reporting

Siwmae Thomas,

I grepped for memory.peak in the source and it's not there. memory.current is 
there and is used in src/plugins/cgroup/v2/cgroup_v2.c

Adding the ability to get memory.peak in this source file seems to be something 
that should be done?

Should extern cgroup_acct_t *cgroup_p_task_get_acct_data(uint32_t task_id) be 
modified to include looking at memory.peak ?

This may mean needing to modify the acct_stat struct in interfaces/cgroup.h  to 
include it ?

typedef struct {
  uint64_t usec;
  uint64_t ssec;
  uint64_t total_rss;
uint64_t mas_rss;
  uint64_t total_pgmajfault;
  uint64_t total_vmem;
} cgroup_acct_t;

Presumably, with the polling method, it keeps looking at the current value and 
then keeps track of the max of these values. But the actual max may occur in 
between 2 polls so it would never see the true max value. At least by also 
reading memory.peak there is a chance to get closer to the real value with the 
polling method even if this not  optimal. Ideally it should run this during 
cleanup of tasks as well as at 

[slurm-users] Job Step State

2024-07-12 Thread Emyr James via slurm-users
Dear all,

I am working on a script to take completed job accounting data from the slurm 
accounting database and insert the equivalent data into a clickhouse table for 
fast reporting

I can see that all the information is included in the cluster_job_table and 
cluster_job_step_table which seem to be joined on job_db_inx

To get the cpu usage and peak memory usage etc. I can see that I need to parse 
the tres columns in the job steps. I couldn't find any column called MaxRSS in 
the database even though the sacct command prints this. I then found some data 
in tres_table and assume that sacct is using this. Please correct me if I'm 
wrong and if sacct is getting information from somwhere other than the 
accounting database?

for the state column I get this...

select state, count(*) as num from  crg_step_table group by state order by num 
desc limit 10;

+---++
| state | num|
+---++
| 3 | 590635 |
| 5 |  28345 |
| 4 |   4401 |
|11 |962 |
| 1 |  8 |
+---++

When I use sacct I see statuses seach as COMPLETED, OUT_OF_MEMORY etc. so there 
must be a mapping somewhere between these state ids and that text. Can someone 
prvide that mapping or point me to where it's defined in the database or in the 
code ?

Many thanks,


Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Nodes TRES double what is requested

2024-07-12 Thread Emyr James via slurm-users
Not sure if this is correct but I think you need to leave a bit of RAM for the 
OS to use so best not to allow slurm to allocate ALL of it. I usually take 8G 
off to allow for that - negligible when our nodes have at least 768GB of RAM. 
At least this is my experience when using cgroups.

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Diego Zuccato via slurm-users 
Sent: 11 July 2024 08:06
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: Nodes TRES double what is requested

Hint: round down a bit the RAM reported by 'slurmd -C'. Or you risk the
nodes not coming back up after an upgrade that leaves a bit less free
RAM than configured.

Diego

Il 10/07/2024 17:29, Brian Andrus via slurm-users ha scritto:
> Jack,
>
> To make sure things are set right, run 'slurmd -C' on the node and use
> that output in your config.
>
> It can also give you insight as to what is being seen on the node versus
> what you may expect.
>
> Brian Andrus
>
> On 7/10/2024 1:25 AM, jack.mellor--- via slurm-users wrote:
>> Hi,
>>
>> We are running slurm 23.02.6. Our nodes have hyperthreading disabled
>> and we have slurm.conf set to CPU=32 for each node (each node has 2
>> processes with 16 cores). When we allocated a job, such as salloc -n
>> 32, it will allocate a whole node but using sinfo shows double the
>> allocation in the TRES=64. It also shows in sinfo that the node has
>> 4294967264 idle CPUs.
>>
>> Not sure if its a known bug, or an issue with our config? I have tried
>> various things, like setting the sockets/boards in slurm.conf.
>>
>> Thanks
>> Jack
>>
>

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Job Step State

2024-10-01 Thread Emyr James via slurm-users

Hi,

I am continuing to try to get thic clickhouse integration working.

I have a completed array job which shows up in sacct. The job id is 245385. If 
i do a select in the accounting db for id_job=245385 then I get one row. I then 
see that the job_db_inx for this job is 497857 and then selecting in the step 
table for this job_db_inx returns no rows.

So the question is that since the information for the job does not seem to be 
in the job or step table, where is sacct getting this info from ?

Does anyone have any information on recreating the output of sacct by using 
queries on the db ?

Regards,

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Emyr James via slurm-users 
Sent: 12 July 2024 11:51
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Job Step State

Dear all,

I am working on a script to take completed job accounting data from the slurm 
accounting database and insert the equivalent data into a clickhouse table for 
fast reporting

I can see that all the information is included in the cluster_job_table and 
cluster_job_step_table which seem to be joined on job_db_inx

To get the cpu usage and peak memory usage etc. I can see that I need to parse 
the tres columns in the job steps. I couldn't find any column called MaxRSS in 
the database even though the sacct command prints this. I then found some data 
in tres_table and assume that sacct is using this. Please correct me if I'm 
wrong and if sacct is getting information from somwhere other than the 
accounting database?

for the state column I get this...

select state, count(*) as num from  crg_step_table group by state order by num 
desc limit 10;

+---++
| state | num|
+---++
| 3 | 590635 |
| 5 |  28345 |
| 4 |   4401 |
|11 |962 |
| 1 |  8 |
+---++

When I use sacct I see statuses seach as COMPLETED, OUT_OF_MEMORY etc. so there 
must be a mapping somewhere between these state ids and that text. Can someone 
prvide that mapping or point me to where it's defined in the database or in the 
code ?

Many thanks,


Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] GPU Accounting

2024-10-02 Thread Emyr James via slurm-users
We have a node with 8 H100 GPUs that are split into MIG instances. We are using 
cgroups. This seems to work fine. Users can do something like

sbatch --gres="gpu:1g.10gb:1"...

and the job starts on the node with the gpus and cuda visible devices and the 
pytorch debug shows that the cgroup only gives them the gpu they asked for.

In the accounting database, jobs in the job table always have the "gres_used" 
column be empty. I'd expect to see "gpu:1g.10gb:1" appearing for the job above.

I have this set in slurm.conf

AccountingStorageTRES=gres/gpu

How can I see what gres was requested with the job ? At the moment I only see 
something like this in AllocTres

billing=1,cpu=1,gres/gpu=1,mem=8G,node=1

and can't see any way to see what the specific MIG gpu asked for was. This is 
related to the email from Richard Lefebvre dated 7th June 2023 entitled 
"Billing/accounting for MIGs is not working". As far as I can see this got no 
replies.

We are running slurm version 23.11.6.

Regards,

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com