[slurm-users] Re: Dependency jobs

2024-10-17 Thread Adam Holmes via slurm-users
Hi Laura,

that might work for what we need to catch,

Many Thanks,

Adam 

-Original Message-
From: Laura Hild via slurm-users  
Sent: 16 October 2024 16:49
To: a...@bramblecfd.com
Cc: slurm-users@lists.schedmd.com
Subject: [slurm-users] Re: Dependency jobs

> I know you can show job info and find what dependency a job is waiting
> for, But more after checking if there are jobs waiting on the current
> job to complete using the job ID,

You mean you don't wanna like

  squeue -o%i,%E | grep SOME_JOBID

?

Although I guess that won't catch a matching `singleton`.


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
This email has been checked for viruses by Avast antivirus software.
www.avast.com


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXTERN] How do you guys track which GPU is used by which job ?

2024-10-17 Thread Pierre-Antoine Schnell via slurm-users

Hello,

we recently started monitoring GPU usage on our GPUs with NVIDIA's DCGM: 
https://developer.nvidia.com/blog/job-statistics-nvidia-data-center-gpu-manager-slurm/


We create a new dcgmi group for each job and start the statistics 
retrieval for it in a prolog script.


Then we stop the retrieval, save the dcgmi verbose stats output and 
delete the dcgmi group in an epilog script.


The output presents JobID, GPU IDs, runtime, energy consumed, and SM 
utilization, among other things.


We retrieve the relevant data into a database and hope to be able to 
advise our users on better practices based on the analysis of it.


Best wishes,
Pierre-Antoine Schnell

Am 16.10.24 um 15:10 schrieb Sylvain MARET via slurm-users:

Hey guys !

I'm looking to improve GPU monitoring on our cluster. I want to install 
this https://github.com/NVIDIA/dcgm-exporter and saw in the README that 
it can support tracking of job id : 
https://github.com/NVIDIA/dcgm-exporter?tab=readme-ov-file#enabling-hpc-job-mapping-on-dcgm-exporter


However I haven't been able to see any examples on how to do it nor does 
slurm seem to expose this information by default.
Does anyone do this here ? And if so do you have any examples I could 
try to follow ? If you have advise on best practices to monitor GPU I'd 
be happy to hear it out !


Regards,
Sylvain Maret




--
Pierre-Antoine Schnell

Medizinische Universität Wien
IT-Dienste & Strategisches Informationsmanagement
Enterprise Technology & Infrastructure
High Performance Computing

1090 Wien, Spitalgasse 23
Bauteil 88, Ebene 00, Büro 611

+43 1 40160-21304

pierre-antoine.schn...@meduniwien.ac.at

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Issue with interactive jobs

2024-10-17 Thread Nerjes, Onno via slurm-users
Dear all,

we've set up SLURM 24.05.3 on our cluster and are experiencing an issue with 
interactive jobs. Before, we used 21.08 and pretty much the same settings, but 
without these issues. We've started with a fresh DB etc.

The behavior of interactive jobs is very erratic. Sometimes they start 
absolutely fine, at other times they die silently in the background, while the 
user has to wait indefinitely. We have been unable to isolate certain users or 
nodes affected by this. On a given node, one user might be able to start an 
interactive job, while another user at the same time isn't able to. The day 
after, the situation might be the other way around.

The exception are jobs that use a reservation. These start fine every time as 
far as we can tell. At the same time, the number of idle nodes does not seem to 
influence the behavior I described above.

Failed allocation on the front end:
[user1@login1 ~]$ salloc
salloc: Pending job allocation 5052052
salloc: job 5052052 queued and waiting for resources

The same job on the backend:
 2024-10-14 11:41:57.680 slurmctld: _job_complete: JobId=5052052 done
2024-10-14 11:41:57.678 slurmctld: _job_complete: JobId=5052052 WEXITSTATUS 1
2024-10-14 11:41:57.678 slurmctld: Killing interactive JobId=5052052: 
Communication connection failure
2024-10-14 11:41:46.666 slurmctld: sched/backfill: _start_job: Started 
JobId=5052052 in devel on m02n01
2024-10-14 11:41:30.096 slurmctld: sched: _slurm_rpc_allocate_resources 
JobId=5052052 NodeList=(null) usec=6258

Raising the debug level has not brought additional information. We were hoping, 
that one of you might be able to provide some insight into what the next steps 
in troubleshooting might be.

Best regards,


Onno


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXTERN] How do you guys track which GPU is used by which job ?

2024-10-17 Thread Markus Kötter via slurm-users

Hi,


As their example was limited too "allgpus", I had posted my take on this 
on the nvidia developer blog.


Basically all the same, but lookups the groupid from the dcgmi group 
json using jp instead of a file.


https://developer.nvidia.com/blog/job-statistics-nvidia-data-center-gpu-manager-slurm/

prolog

group=$(sudo -u $SLURM_JOB_USER dcgmi group -c j$SLURM_JOB_ID)
if [ $? -eq 0 ]; then
  groupid=$(echo $group | awk '{print $10}')
  sudo -u $SLURM_JOB_USER dcgmi group --group $groupid --add $SLURM_JOB_GPUS
  sudo -u $SLURM_JOB_USER dcgmi stats --group $groupid --enable
  sudo -u $SLURM_JOB_USER dcgmi stats --group $groupid --jstart $SLURM_JOBID
fi



epilog

OUTPUTDIR=/tmp/
sudo -u $SLURM_JOB_USER dcgmi stats --jstop $SLURM_JOBID
sudo -u $SLURM_JOB_USER dcgmi stats --verbose --job $SLURM_JOBID | sudo -u 
$SLURM_JOB_USER tee $OUTPUTDIR/dcgm-gpu-stats-$HOSTNAME-$SLURM_JOBID.out

groupid=$(sudo -u $SLURM_JOB_USER dcgmi group -l --json | jp  
"body.Groups.children.[*][0][?children.\"Group Name\".value=='j$SLURM_JOBID'].children.\"Group 
ID\".value | [0] " | sed s/\"//g)

sudo -u $SLURM_JOB_USER dcgmi group --delete $groupid



MfG
--
Markus Kötter, +49 681 870832434
30159 Hannover, Lange Laube 6
Helmholtz Center for Information Security


smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXTERN] How do you guys track which GPU is used by which job ?

2024-10-17 Thread Paul Raines via slurm-users


We do the same thing.  Our prolog has

==
# setup DCGMI job stats
if [ -n "$CUDA_VISIBLE_DEVICES" ] ; then
if [ -d /var/slurm/gpu_stats.run ] ; then
if pgrep -f nv-hostengine >/dev/null 2>&1 ; then

  groupstr=$(/usr/bin/dcgmi group -c J$SLURM_JOB_ID -a 
$CUDA_VISIBLE_DEVICES)

  groupid=$(echo $groupstr | awk '{print $10}')

  /usr/bin/dcgmi stats -e
  /usr/bin/dcgmi stats -g $groupid -s $SLURM_JOB_ID

  echo $groupid > /var/slurm/gpu_stats.run/J$SLURM_JOB_ID
fi
fi
fi
==

And our epilog has

==
if [ -n "$CUDA_VISIBLE_DEVICES" ] ; then
if [ -f /var/slurm/gpu_stats.run/J$SLURM_JOB_ID ] ; then
if pgrep -f nv-hostengine >/dev/null 2>&1 ; then

  groupid=$(cat /var/slurm/gpu_stats.run/J$SLURM_JOB_ID)

  /usr/bin/dcgmi stats -v -j $SLURM_JOBID > /var/slurm/gpu_stats/$SLURM_JOBID
  if [ $? -eq 0 ] ; then
/bin/rsync -a /var/slurm/gpu_stats/$SLURM_JOBID /cluster/batch/GPU/
/bin/rm -rf /tmp/gpuprocess.out
# put the data in MYSQL database with perl script
/cluster/batch/ADMIN/SCRIPTS/gpuprocess.pl $SLURM_JOB_ID > /tmp/gpuprocess.out 
2>&1
if [ -s /tmp/gpuprocess.out ] ; then
  cat /tmp/gpuprocess.out | mail -s GPU_stat_process_error 
al...@nmr.mgh.harvard.edu
fi
  fi

  /usr/bin/dcgmi stats -x $SLURM_JOBID

  /usr/bin/dcgmi group -d $groupid

  /bin/rm /var/slurm/gpu_stats.run/J$SLURM_JOB_ID
fi
fi
fi
===


We also have a cron job on each node with GPUs that runs every 10 minutes
to query dcgmi stats to write snapshot data on each GPU to the MYSQL 
database.


If you are on RHEL based boxes, the RPM you need from NVIDIA repos is
datacenter-gpu-manager

On Thu, 17 Oct 2024 4:45am, Pierre-Antoine Schnell via slurm-users wrote:

   External Email - Use Caution 


Hello,

we recently started monitoring GPU usage on our GPUs with NVIDIA's DCGM: 
https: //secure-web.cisco.com/1KWAURVYDpmQYgABxXHpjl1HYdnLi1gOud_xdNWc3Pxea1JmFPHq-ARojCPZ7k2sn7nHLarge9d-vm4Yo0OwdO4jS-sxbbhr1mfGvdZ9653UOKmqqhQKiF7pNgB9ox8xEcuiLC-y_J7z3yC63xAdOL5pJKatcCaePuaoY4u2mTMIrOpNU-wulYVHWlLnv65d4AAFY6ipTgzp6As2PTZJlPcIP7RcToXJVUJhzDaMPYHRWsgRXaVU5156mcMRwn7bstXHH58PpmS2MkxpRJ0HGSA-Mjsmr6SKV3HixQxohY3OzyPnIslJt-kBC_AJvILCO/https%3A%2F%2Fdeveloper.nvidia.com%2Fblog%2Fjob-statistics-nvidia-data-center-gpu-manager-slurm%2F


We create a new dcgmi group for each job and start the statistics retrieval 
for it in a prolog script.


Then we stop the retrieval, save the dcgmi verbose stats output and delete 
the dcgmi group in an epilog script.


The output presents JobID, GPU IDs, runtime, energy consumed, and SM 
utilization, among other things.


We retrieve the relevant data into a database and hope to be able to advise 
our users on better practices based on the analysis of it.


Best wishes,
Pierre-Antoine Schnell

Am 16.10.24 um 15:10 schrieb Sylvain MARET via slurm-users:

 Hey guys !

 I'm looking to improve GPU monitoring on our cluster. I want to install
 this
 
https://secure-web.cisco.com/1fZ-E5mpOZvWDiBjPS6nGvTxPwlnYDhKBDJvrIMLGr18l4nxmu2j5qnQ5Q3nf51p_peVya-eFakdHLnxy5JuaaiSunHX8y9NYwfuF7DtezBTP4Eo1Io69GtnBnZKY8c8ZLP205ow2pp8eMB4DXR2lSrk2V3lplhlQdynRS7IbAsMsnchdYrMOQ65ncIttaVEMu1QNgaxpPEhP34bqm2aC3vcCevXNtevJFTJ9W_4ott9z-iUfXTx8ZseRo9W3ogUBdjWmOSelFM859D4khf_WwBk_HXkVivUKBR_CbvsCzQy1N5Mmx61GHK0hyQR6OwA2/https%3A%2F%2Fgithub.com%2FNVIDIA%2Fdcgm-exporter
 and saw in the README that it can support tracking of job id : 
https: //secure-web.cisco.com/1_JvkKV0Jm0yqxhTNbhLO9yC7U4G3sl2GSQRb2wrb-zRFRzd5kjwL7go8M2ESNdeIlaQM_peIOOHZCtWJibqHA4fl3Bk5xkr1tDe0QiOOCg8DLzw_OImhCSzXej8uZf3wHjpaQXCGtKzhUsW84CSsREcyBNTOTNjzAhr2HmDxYqMapS-TM8QFFrEB0u-3cJjdekUhw2rRhpZifMnj86S4nu6uG3Elyyla8GsaN8OC_Q6Jbu9kiW9hHGspRQ37Q3kbDIj7beBPkuik5eCPDtmabV-j2ppjd05G9eHZIrj9HAU2ZU3sIEsacOJ19eDUmNhl/https%3A%2F%2Fgithub.com%2FNVIDIA%2Fdcgm-exporter%3Ftab%3Dreadme-ov-file%23enabling-hpc-job-mapping-on-dcgm-exporter


 However I haven't been able to see any examples on how to do it nor does
 slurm seem to expose this information by default.
 Does anyone do this here ? And if so do you have any examples I could try
 to follow ? If you have advise on best practices to monitor GPU I'd be
 happy to hear it out !

 Regards,
 Sylvain Maret




--
Pierre-Antoine Schnell

Medizinische Universität Wien
IT-Dienste & Strategisches Informationsmanagement
Enterprise Technology & Infrastructure
High Performance Computing

1090 Wien, Spitalgasse 23
Bauteil 88, Ebene 00, Büro 611

+43 1 40160-21304

pierre-antoine.schn...@meduniwien.ac.at

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com




The information in this e-mail is intended only for the person to whom it is 
addressed.  If you believe this e-mail was sent to you in error and the e-mail 
contains patient information, please contact the Mass General Brigham Compliance 
HelpLine at https://www.massgeneralbrigham.org/complianceline 


[slurm-users] Job information is not being added to accounting database on new setup

2024-10-17 Thread Adrian Brady via slurm-users
Hi Everyone,

I'm a new to slurm administration and looking for a bit of help!

Just added Accounting to an existing cluster but job information is not being 
added to the Accounting Mariadb. When I submit a test job it gets scheduled 
fine and its visible with squeue, I get nothing returned from sacct!

I have turned up the logging to debug5 on both slurmctld and slurmdbd logs and 
can't see any errors. I believe all the comms are ok between slurmctld and 
slurmdbd as when I enter the sacct command I can see the database is being 
queried but returning nothing, because nothing has been added to the tables. 
The cluster tables were created fine when I ran

#sacctmgr add cluster ny5ktt

$ sacct
   JobIDJobName  PartitionAccount  AllocCPUS  State ExitCode
 -- -- -- -- -- 

# tail -f slurmdbd.log
[2024-10-17T12:34:45.232] debug:  REQUEST_PERSIST_INIT: CLUSTER:ny5ktt 
VERSION:9216 UID:10001 IP:10.202.233.117 CONN:10
[2024-10-17T12:34:45.232] debug2: accounting_storage/as_mysql: 
acct_storage_p_get_connection: acct_storage_p_get_connection: request new 
connection 1
[2024-10-17T12:34:45.233] debug2: Attempting to connect to localhost:3306
[2024-10-17T12:34:45.274] debug2: DBD_GET_JOBS_COND: called
[2024-10-17T12:34:45.317] debug2: DBD_FINI: CLOSE:1 COMMIT:0
[2024-10-17T12:34:45.317] debug4: accounting_storage/as_mysql: 
acct_storage_p_commit: got 0 commits

The Mariadb is running on it own node with slurmdbd and munged for 
authentication. I haven't setup any accounts, users, asssociations or 
enforcements yet. On my lab cluster, jobs were visible in the database without 
these being setup. I guess I must be missing something simple in the config 
that is stopping jobs being reported to slurmdbd.

Master Node packages
# rpm -qa |grep slurm
slurm-slurmdbd-20.11.9-1.el8.x86_64
slurm-libs-20.11.9-1.el8.x86_64
slurm-20.11.9-1.el8.x86_64
slurm-slurmd-20.11.9-1.el8.x86_64
slurm-perlapi-20.11.9-1.el8.x86_64
slurm-doc-20.11.9-1.el8.x86_64
slurm-contribs-20.11.9-1.el8.x86_64
slurm-slurmctld-20.11.9-1.el8.x86_64

Database Node packages
# rpm -qa |grep slurm
slurm-slurmdbd-20.11.9-1.el8.x86_64
slurm-20.11.9-1.el8.x86_64
slurm-libs-20.11.9-1.el8.x86_64
slurm-devel-20.11.9-1.el8.x86_64

slurm.conf
#
# See the slurm.conf man page for more information.
#
ClusterName=ny5ktt
ControlMachine=ny5-pr-kttslurm-01
ControlAddr=10.202.233.71
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=99
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
MailProg=/bin/true
MaxJobCount=20
#MaxStepCount=4
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/d
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurm/ctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
#MinJobAge=300
#MinJobAge=43200
# CHG0057915
MinJobAge=14400
# CHG0057915
#MaxJobCount=5
#MaxJobCount=10
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
DefMemPerCPU=3000
#FastSchedule=1
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
#SelectTypeParameters=CR_Core
#SelectTypeParameters=CR_CPU
SelectTypeParameters=CR_CPU_Memory
# ECR CHG0056915 10/14/2023
MaxArraySize=5001
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageEnforce=limits
AccountingStorageHost=ny5-pr-kttslurmdb-01.ktt.schonfeld.com
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
#AccountingStorageType=accounting_storage/none
AccountingStor

[slurm-users] Re: Why AllowAccounts not work in slurm-23.11.6

2024-10-17 Thread Paul Raines via slurm-users


I am using Slurm 23.11.3 and it AllowAccounts works for me.  We
have a partition  defied with AllowAccounts and if one tries to
submit in an account not in the list one will get

srun: error: Unable to allocate resources: Invalid account or 
account/partition combination specified



Do you have EnforcePartLimits=ALL


On Wed, 16 Oct 2024 8:55pm, shaobo liu via slurm-users wrote:

   External Email - Use Caution 


Tested slurm-23.* version, AllowAccounts parameter does not work.

daijiangkuicgo--- via slurm-users 
于2024年6月29日周六 16:30写道:


AllowGroups is ok.

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


The information in this e-mail is intended only for the person to whom it is 
addressed.  If you believe this e-mail was sent to you in error and the e-mail 
contains patient information, please contact the Mass General Brigham Compliance 
HelpLine at https://www.massgeneralbrigham.org/complianceline 
 .
Please note that this e-mail is not secure (encrypted).  If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately.  Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail. 

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: How do you guys track which GPU is used by which job ?

2024-10-17 Thread Sylvain MARET via slurm-users

Started testing in prolog and you're right !
Before doing anything I wanted to see if there was a best practices.

Regards,
Sylvain Maret

On 16/10/2024 18:03, Brian Andrus via slurm-users wrote:
 CAUTION : External Sender. Please do not click on links or open 
attachments from senders you do not trust.


Looks like there is a step you would need to do to create the required 
job mapping files:


/The DCGM-exporter can include High-Performance Computing (HPC) job 
information into its metric labels. To achieve this, HPC environment 
administrators must configure their HPC environment to generate files 
that map GPUs to HPC jobs./


It does go on to show the conventions/format of the files.

I imagine you could have some bits in a prologue script that creates 
that as the job starts on the node and point dcgm-exporter there.


Brian Andrus

On 10/16/24 06:10, Sylvain MARET via slurm-users wrote:

Hey guys !

I'm looking to improve GPU monitoring on our cluster. I want to 
install this https://github.com/NVIDIA/dcgm-exporter and saw in the 
README that it can support tracking of job id : 
https://github.com/NVIDIA/dcgm-exporter?tab=readme-ov-file#enabling-hpc-job-mapping-on-dcgm-exporter


However I haven't been able to see any examples on how to do it nor 
does slurm seem to expose this information by default.
Does anyone do this here ? And if so do you have any examples I could 
try to follow ? If you have advise on best practices to monitor GPU 
I'd be happy to hear it out !


Regards,
Sylvain Maret



-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXTERN] How do you guys track which GPU is used by which job ?

2024-10-17 Thread Sylvain MARET via slurm-users

Interesting solution didn't know it was possible to do this.
Will try to test this also !

Sylvain

On 17/10/2024 10:45, Pierre-Antoine Schnell via slurm-users wrote:
CAUTION : External Sender. Please do not click on links or open 
attachments from senders you do not trust.



Hello,

we recently started monitoring GPU usage on our GPUs with NVIDIA's DCGM:
https://developer.nvidia.com/blog/job-statistics-nvidia-data-center-gpu-manager-slurm/ 



We create a new dcgmi group for each job and start the statistics
retrieval for it in a prolog script.

Then we stop the retrieval, save the dcgmi verbose stats output and
delete the dcgmi group in an epilog script.

The output presents JobID, GPU IDs, runtime, energy consumed, and SM
utilization, among other things.

We retrieve the relevant data into a database and hope to be able to
advise our users on better practices based on the analysis of it.

Best wishes,
Pierre-Antoine Schnell

Am 16.10.24 um 15:10 schrieb Sylvain MARET via slurm-users:

Hey guys !

I'm looking to improve GPU monitoring on our cluster. I want to install
this https://github.com/NVIDIA/dcgm-exporter and saw in the README that
it can support tracking of job id :
https://github.com/NVIDIA/dcgm-exporter?tab=readme-ov-file#enabling-hpc-job-mapping-on-dcgm-exporter 



However I haven't been able to see any examples on how to do it nor does
slurm seem to expose this information by default.
Does anyone do this here ? And if so do you have any examples I could
try to follow ? If you have advise on best practices to monitor GPU I'd
be happy to hear it out !

Regards,
Sylvain Maret




--
Pierre-Antoine Schnell

Medizinische Universität Wien
IT-Dienste & Strategisches Informationsmanagement
Enterprise Technology & Infrastructure
High Performance Computing

1090 Wien, Spitalgasse 23
Bauteil 88, Ebene 00, Büro 611

+43 1 40160-21304

pierre-antoine.schn...@meduniwien.ac.at

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com