Tried deleting the existing reservations today, and got this: ----- root@captain1:~# scontrol show res ReservationName=res17-pc2 StartTime=2019-02-25T14:58:40 EndTime=2029-02-22T14:58:40 Duration=3650-00:00:00 Nodes=res17-pc2 NodeCnt=1 CoreCnt=6 Features=(null) PartitionName=desktops Flags=SPEC_NODES TRES=cpu=12 Users=samuel Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
ReservationName=res18-pc5 StartTime=2019-04-25T11:47:05 EndTime=2020-04-24T11:47:05 Duration=365-00:00:00 Nodes=res18-pc5 NodeCnt=1 CoreCnt=6 Features=(null) PartitionName=(null) Flags=SPEC_NODES TRES=cpu=12 Users=grv Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a root@captain1:~# scontrol delete ReservationName=res17-pc2 delete_reservation ReservationName=res17-pc2: Unspecified error root@captain1:~# scontrol delete ReservationName=res18-pc5 delete_reservation ReservationName=res18-pc5: Unspecified error ----- In looking in the logs, slurmctld.log has many of these lines: [(date/time-string)] debug: slurmdbd: PERSIST_RC is -1 from DBD_MODIFY_RESV(1463): (null) And as mentioned before, many of these: [(date/time-string)] debug2: DBD_MODIFY_RESV: called [(date/time-string)] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster' [(date/time-string)] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster' Can anyone out there help me to understand how to fix this? -----Original Message----- From: Will Dennis Sent: Tuesday, May 07, 2019 11:01 AM To: slurm-users@lists.schedmd.com Subject: Slurm database error messages (redux) Hi all, We had to restart the slurmdbd service on one of our clusters running Slurm 17.11.7 yesterday, since folks were experiencing errors with job scheduling, and running 'sacct': ----- $ sacct -X -p -o jobid,jobname,user,partition%-30,nodelist,alloccpus,reqmem,cputime,qos,state,exitcode,AllocTRES%-50 -s R --allusers sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to captain1:6819: Connection refused sacct: error: slurmdbd: Sending PersistInit msg: Connection refused sacct: error: Problem talking to the database: Connection refused ----- Looking in the logs post-restart, I see a large number of messages such as these: ----- [2019-05-07T07:35:17.000] debug2: DBD_MODIFY_RESV: called [2019-05-07T07:35:17.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster' [2019-05-07T07:35:17.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster' [2019-05-07T07:35:35.000] debug2: DBD_MODIFY_RESV: called [2019-05-07T07:35:35.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster' [2019-05-07T07:35:35.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster' [2019-05-07T07:35:53.000] debug2: DBD_MODIFY_RESV: called [2019-05-07T07:35:53.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster' [2019-05-07T07:35:53.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster' [2019-05-07T07:36:11.000] debug2: DBD_MODIFY_RESV: called [2019-05-07T07:36:11.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster' [2019-05-07T07:36:11.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster' ----- I read today's list message entitled "Slurm database failure messages", and although different, I saw that there was a linked bug report that had to do with problems with reservations. It suggested gathering data via three commands, the output of which from our cluster are seen here: ----- root@captain1:/var/log# scontrol show reservations ReservationName=res17-pc2 StartTime=2019-02-25T14:58:40 EndTime=2029-02-22T14:58:40 Duration=3650-00:00:00 Nodes=res17-pc2 NodeCnt=1 CoreCnt=6 Features=(null) PartitionName=desktops Flags=SPEC_NODES TRES=cpu=12 Users=samuel Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=res18-pc5 StartTime=2019-04-25T11:47:05 EndTime=2020-04-24T11:47:05 Duration=365-00:00:00 Nodes=res18-pc5 NodeCnt=1 CoreCnt=6 Features=(null) PartitionName=(null) Flags=SPEC_NODES TRES=cpu=12 Users=grv Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a root@captain1:/var/log# sacctmgr show reservations Cluster Name TRES TimeStart TimeEnd UnusedWall ---------- --------------- ------------------------------ ------------------- ------------------- ---------- rescluster res17-pc2 cpu=12 2019-04-24T13:29:52 2029-02-22T14:58:40 0.000000 mysql> select * from rescluster_resv_table\G *************************** 1. row *************************** id_resv: 1 deleted: 1 assoclist: 12 flags: 65535 nodelist: res17-pc2,captain2,server13k,server15k,server25k node_inx: 0-4 resv_name: res17-pc2 time_start: 1551135476 time_end: 1551135512 tres: 1=140 unused_wall: 36 *************************** 2. row *************************** id_resv: 2 deleted: 0 assoclist: 12 flags: 32768 nodelist: res17-pc2 node_inx: 0 resv_name: res17-pc2 time_start: 1551135520 time_end: 1551141705 tres: 1=12 unused_wall: 6176.5 *************************** 3. row *************************** id_resv: 2 deleted: 0 assoclist: 12 flags: 32768 nodelist: res17-pc2 node_inx: 0 resv_name: res17-pc2 time_start: 1551141705 time_end: 1551734095 tres: 1=12 unused_wall: 581590 *************************** 4. row *************************** id_resv: 2 deleted: 0 assoclist: 12 flags: 32768 nodelist: res17-pc2 node_inx: 0 resv_name: res17-pc2 time_start: 1551734095 time_end: 1551847812 tres: 1=12 unused_wall: 117173.666667 *************************** 5. row *************************** id_resv: 2 deleted: 0 assoclist: 12 flags: 32768 nodelist: res17-pc2 node_inx: 0 resv_name: res17-pc2 time_start: 1551847812 time_end: 1552353438 tres: 1=12 unused_wall: 480521 *************************** 6. row *************************** id_resv: 2 deleted: 0 assoclist: 12 flags: 32768 nodelist: res17-pc2 node_inx: 0 resv_name: res17-pc2 time_start: 1552353438 time_end: 1554771615 tres: 1=12 unused_wall: 2367043 *************************** 7. row *************************** id_resv: 2 deleted: 0 assoclist: 12 flags: 32768 nodelist: res17-pc2 node_inx: 0 resv_name: res17-pc2 time_start: 1554771615 time_end: 1556137792 tres: 1=12 unused_wall: 2006236 *************************** 8. row *************************** id_resv: 2 deleted: 0 assoclist: 12 flags: 32768 nodelist: res17-pc2 node_inx: 0 resv_name: res17-pc2 time_start: 1556137792 time_end: 1866495520 tres: 1=12 unused_wall: 0 8 rows in set (0.00 sec) ----- So it seems to me that the reservations are messed up; how to go about fixing this? Thanks in advance for any help provided...