Hi, As an update, I able to clear out the orphan/cancelled jobs by rebooting the compute nodes which had cancelled jobs. The error messages have ceased.
Regards, Jeff On Wed, Dec 6, 2023 at 8:26 AM Jeffrey McDonald <jmcdo...@umn.edu> wrote: > Hi, > Yesterday, an upgrade to slurm from 22.05.4 to 23.11.0 went sideways and I > ended up losing a number of jobs on the compute nodes. Ultimately, the > installation seems to be successful but I now have some issues with job > remnants it appears. About once per minute (per job), the slurmctld > daemon is logging: > > [2023-12-06T08:16:32.505] error: slurm_receive_msg [146.57.133.18:39104]: > Zero Bytes were transmitted or received > [2023-12-06T08:16:32.505] error: slurm_receive_msg [146.57.133.18:39106]: > Zero Bytes were transmitted or received > [2023-12-06T08:16:32.792] error: slurm_receive_msg [146.57.133.38:54722]: > Zero Bytes were transmitted or received > [2023-12-06T08:16:34.189] error: slurm_receive_msg [146.57.133.49:59058]: > Zero Bytes were transmitted or received > [2023-12-06T08:16:34.197] error: slurm_receive_msg [146.57.133.49:58232]: > Zero Bytes were transmitted or received > [2023-12-06T08:16:35.757] error: slurm_receive_msg [146.57.133.39:48856]: > Zero Bytes were transmitted or received > [2023-12-06T08:16:35.757] error: slurm_receive_msg [146.57.133.39:48860]: > Zero Bytes were transmitted or received > [2023-12-06T08:16:36.329] error: slurm_receive_msg [146.57.133.46:50848]: > Zero Bytes were transmitted or received > [2023-12-06T08:16:59.827] error: slurm_receive_msg [146.57.133.14:60328]: > Zero Bytes were transmitted or received > [2023-12-06T08:16:59.828] error: slurm_receive_msg [146.57.133.37:37734]: > Zero Bytes were transmitted or received > [2023-12-06T08:17:03.285] error: slurm_receive_msg [146.57.133.35:41426]: > Zero Bytes were transmitted or received > [2023-12-06T08:17:13.244] error: slurm_receive_msg [146.57.133.105:34416]: > Zero Bytes were transmitted or received > [2023-12-06T08:17:13.726] error: slurm_receive_msg [146.57.133.15:60164]: > Zero Bytes were transmitted or received > > The controller also shows orphaned jobs: > > [2023-12-06T07:47:42.010] error: Orphan StepId=9050.extern reported on > node amd03 > [2023-12-06T07:47:42.010] error: Orphan StepId=9055.extern reported on > node amd03 > [2023-12-06T07:47:42.011] error: Orphan StepId=8862.extern reported on > node amd12 > [2023-12-06T07:47:42.011] error: Orphan StepId=9065.extern reported on > node amd07 > [2023-12-06T07:47:42.011] error: Orphan StepId=9066.extern reported on > node amd07 > [2023-12-06T07:47:42.011] error: Orphan StepId=8987.extern reported on > node amd09 > [2023-12-06T07:47:42.012] error: Orphan StepId=9068.extern reported on > node amd08 > [2023-12-06T07:47:42.012] error: Orphan StepId=8862.extern reported on > node amd13 > [2023-12-06T07:47:42.012] error: Orphan StepId=8774.extern reported on > node amd10 > [2023-12-06T07:47:42.012] error: Orphan StepId=9051.extern reported on > node amd10 > [2023-12-06T07:49:22.009] error: Orphan StepId=9071.extern reported on > node aslab01 > [2023-12-06T07:49:22.010] error: Orphan StepId=8699.extern reported on > node gpu05 > > > On the compute nodes, I see a corresponding error message like this one: > > [2023-12-06T08:18:03.292] [9052.extern] error: hash_g_compute: hash plugin > with id:0 not exist or is not loaded > [2023-12-06T08:18:03.292] [9052.extern] error: slurm_send_node_msg: > hash_g_compute: REQUEST_STEP_COMPLETE has error > > > > The error seems to be a reference always to a job that was canceled, e.g. > 9052: > > # sacct -j 9052 > JobID JobName Partition Account AllocCPUS State > ExitCode > ------------ ---------- ---------- ---------- ---------- ---------- > -------- > 9052 sys/dashb+ a40gpu 24 CANCELLED > 0:0 > 9052.batch batch 24 CANCELLED > 0:0 > 9052.extern extern 24 CANCELLED > 0:0 > > These jobs were running at the start of the update but we subsequently > canceled because of the slurmd or slurmctld timeouts during the update. > How can I clean this up? I've tried canceling the jobs but nothing seems > to work to remove them. > > Thanks in advance, > Jeff > >