Thank you for your response. Thanks to your explanation, I was able to understand.
After writing and running a new test program that only logs on SIGTERM, I could confirm that the GraceTime was applied. Thank you once again. Below is a sample code for reference for others: $ cat run-gpu.cu #include <stdio.h> #include <stdlib.h> #include <signal.h> #include <unistd.h> #include <cuda_runtime.h> void sigterm_handler(int signum) { printf("Received SIGTERM, but not terminating\n"); } __global__ void dummy_kernel(int *data) { int idx = blockIdx.x * blockDim.x + threadIdx.x; data[idx] = idx; } int main() { signal(SIGTERM, sigterm_handler); int *device_data; cudaMalloc((void **)&device_data, 1024 * sizeof(int)); dummy_kernel<<<1, 1024>>>(device_data); cudaDeviceSynchronize(); while(1) { sleep(1); printf("Working with GPU...\n"); } cudaFree(device_data); return 0; } 2023년 11월 8일 (수) 오후 5:02, Rémi Palancher <r...@rackslab.io>님이 작성: > Le 08/11/2023 à 02:28, 김형진 a écrit : > > Hello ~____ > > > > … > > > > However, as soon as the base QoS job is created, the large QoS job is > > immediately canceled without any waiting time.____ > > > > __ __ > > > > But in the slurmctld log, there is a grace time log.____ > > > > [2023-11-02T11:37:36.589] debug: setting 3600 sec preemption grace time > > for JobId=153 to reclaim resources for JobId=154____ > > > > __ __ > > > > Could you help me understand what might be going wrong?____ > > Note that Slurm sends SIGTERM signal by default to slurmstepd immediate > children (which might be gpu_burn in your case) at _the beginning_ of > the GraceTime, to notify them of approaching termination. > > If the processes react to SIGTERM by terminating, which generally the > case, you may have the impression GraceTime is not honored. > > To benefit from the GraceTime, your program must either trap SIGTERM > with a signal handler or you must enable send_user_signal > PreemptParameters flag and submit your job with --signal and another > signal. > > -- > Rémi Palancher > Rackslab: Open Source Solutions for HPC Operations > https://rackslab.io/ > > >