A bit more digging....
the cgroups stuff seems to be communicating back the values it finds in
src/plugins/jobacct_gather/cgroup/jobacct_gather_cgroup.c
prec->tres_data[TRES_ARRAY_MEM].size_read =
cgroup_acct_data->total_rss;
I can't find anywhere in the code where it seems to be keeping track of the max
value of total_rss seen so I can only conclude that it must be done in the
database when slurmdbd puts in the values rather than being done in the slurm
binaries themselves.
So this does seem to suggest that the peak value that is accounted at the end
is just the maximum of the memory.current values that it sees over all the
polls, even though there may be much higher transient values that may have
occured in between the polls which would be taken into account by memory.peak
but slurm never sees these values.
Can anyone more familiar with the code than me corrobarate this ?
Presumably non-cgroup accounting has a similar issue ? I.e. it polls rss and
then the accounting db reports the highest seen even though using getrusage and
checking ru_maxrss should be done too ?
Many thanks,
Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation
________________________________
From: Emyr James via slurm-users <[email protected]>
Sent: 20 May 2024 13:56
To: Thomas Green - Staff in University IT, Research Technologies / Staff
Technoleg Gwybodaeth, Technolegau Ymchwil <[email protected]>; Davide
DelVento <[email protected]>
Cc: [email protected] <[email protected]>
Subject: [slurm-users] Re: memory high water mark reporting
Siwmae Thomas,
I grepped for memory.peak in the source and it's not there. memory.current is
there and is used in src/plugins/cgroup/v2/cgroup_v2.c
Adding the ability to get memory.peak in this source file seems to be something
that should be done?
Should extern cgroup_acct_t *cgroup_p_task_get_acct_data(uint32_t task_id) be
modified to include looking at memory.peak ?
This may mean needing to modify the acct_stat struct in interfaces/cgroup.h to
include it ?
typedef struct {
uint64_t usec;
uint64_t ssec;
uint64_t total_rss;
uint64_t mas_rss;
uint64_t total_pgmajfault;
uint64_t total_vmem;
} cgroup_acct_t;
Presumably, with the polling method, it keeps looking at the current value and
then keeps track of the max of these values. But the actual max may occur in
between 2 polls so it would never see the true max value. At least by also
reading memory.peak there is a chance to get closer to the real value with the
polling method even if this not optimal. Ideally it should run this during
cleanup of tasks as well as at the poll interval.
As an aside, I also did a grep for getrusage and it doesn't seem to be used at
all. I see that it is looking at /proc/%d/stat so maybe this is where its
getting the maxrss for non cgroup accounting. Still, getrusage would seem to be
the more obvious choice for this ?
Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation
________________________________
From: Thomas Green - Staff in University IT, Research Technologies / Staff
Technoleg Gwybodaeth, Technolegau Ymchwil <[email protected]>
Sent: 20 May 2024 13:08
To: Emyr James <[email protected]>; Davide DelVento <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: [slurm-users] Re: memory high water mark reporting
Hi,
We have had similar questions from users regarding how best to find out the
high memory peak of a job since they may run a job and get a not very useful
value for variables in sacct such as the MaxRSS since Slurm didn’t poll during
the use of its maximum memory usage.
With Cgroupv1 looking online it looks like memory.max_usage_in_bytes takes into
account caches so can vary on how much I/O is used whilst total_rss in
memory.stats looks more useful maybe. Maybe memory.peak is clearer?
Its not clear in the documentation how a user should in the sacct values to
infer the actual usage of jobs to correct their behaviour in future submissions.
I would be keen to see improvements in high water mark reporting. I noticed
that the jobacctgather plugin documentation was deleted back in Slurm 21.08 –
Spank plugin does possibly look like the way to go. Also it seems a common
problem across technologies e.g.
https://github.com/google/cadvisor/issues/3286<https://urldefense.com/v3/__https://github.com/google/cadvisor/issues/3286__;!!D9dNQwwGXtA!UW9JUyJ5ByL6XxihSUX-hn_HC2rYL-BZ8HtbdSlP10hGha71tuIHFmUOQ7dPpEseh3Ecyo-rrPUDVWPKJ280u9w$>
Tom
From: Emyr James via slurm-users <[email protected]>
Date: Monday, 20 May 2024 at 10:50
To: Davide DelVento <[email protected]>, Emyr James <[email protected]>
Cc: [email protected] <[email protected]>
Subject: [slurm-users] Re: memory high water mark reporting
External email to Cardiff University - Take care when replying/opening
attachments or links.
Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor
atodiadau neu ddolenni.
Looking here :
https://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINS<https://urldefense.com/v3/__https://slurm.schedmd.com/spank.html*SECTION_SPANK-PLUGINS__;Iw!!D9dNQwwGXtA!UW9JUyJ5ByL6XxihSUX-hn_HC2rYL-BZ8HtbdSlP10hGha71tuIHFmUOQ7dPpEseh3Ecyo-rrPUDVWPK6HobAdg$>
It looks like it's possible to hook something in at the right place using the
slurm_spank_task_exit or slurm_spank_exit plugins. Does anyone have any
experience or examples of doing this ? Is there any more documentation
available on this functionality ?
Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation
________________________________
From: Emyr James via slurm-users <[email protected]>
Sent: 17 May 2024 01:15
To: Davide DelVento <[email protected]>
Cc: [email protected] <[email protected]>
Subject: [slurm-users] Re: memory high water mark reporting
Hi,
I have got a very simple LD_PRELOAD that can do this. Maybe I should see if I
can force slurmstepd to be run with that LD_PRELOAD and then see if that does
it.
Ultimately am trying to get all the useful accounting metrics into a clickhouse
database. If the LD_PRELOAD on slurmstepd seems to work then I can expand it to
insert the relevant row into the clickhouse DB in the C code of the preload
library.
But still...this seems like a very basic thing to do and am very suprised that
it seems so difficult to do this with the standard accounting recording out of
the box.
Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation
________________________________
From: Davide DelVento <[email protected]>
Sent: 17 May 2024 01:02
To: Emyr James <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: [slurm-users] memory high water mark reporting
Not exactly the answer to your question (which I don't know) but if you can get
to prefix whatever is executed with this
https://github.com/NCAR/peak_memusage<https://urldefense.com/v3/__https://github.com/NCAR/peak_memusage__;!!D9dNQwwGXtA!XXr8CcM11ikS-fYyDe0CFyQWal6Qp5cgv1os4oHtVrAAJE68Fp6qqvZFKoNvW26ROOv3uLzwqRZLge3-6zV8CPYLzg$>
(which also uses getrusage) or a variant you will be able to do that.
On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users
<[email protected]<mailto:[email protected]>> wrote:
Hi,
We are trying out slurm having been running grid engine for a long while.
In grid engine, the cgroups peak memory and max_rss are generated at the end of
a job and recorded. It logs the information from the cgroup hierarchy as well
as doing a getrusage call right at the end on the parent pid of the whole job
"container" before cleaning up.
With slurm it seems that the only way memory is recorded is by the acct gather
polling. I am trying to add something in an epilog script to get the
memory.peak but It looks like the cgroup hierarchy has been destroyed by the
time the epilog is run.
Where in the code is the cgroup hierarchy cleared up ? Is there no way to add
something in so that the accounting is updated during the job cleanup process
so that peak memory usage can be accurately logged ?
I can reduce the polling interval from 30s to 5s but don't know if this causes
a lot of overhead and in any case this seems to not be a sensible way to get
values that should just be determined right at the end by an event rather than
using polling.
Many thanks,
Emyr
--
slurm-users mailing list --
[email protected]<mailto:[email protected]>
To unsubscribe send an email to
[email protected]<mailto:[email protected]>
--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]