I tried upgrading Slurm to 18.08.9 and I am still getting this Segmentation Fault!
On Tue, May 5, 2020 at 2:39 PM Dustin Lang <dstnd...@gmail.com> wrote: > Hi, > > Apparently my colleague upgraded the mysql client and server, but, as far > as I can tell, this was only 5.7.29 to 5.7.30, and checking the mysql > release notes I don't see anything that looks suspicious there... > > cheers, > --dustin > > > On Tue, May 5, 2020 at 1:37 PM Dustin Lang <dstnd...@gmail.com> wrote: > >> Hi, >> >> We're running Slurm 17.11.12. Everything has been working fine, and then >> suddenly slurmctld is crashing and slurmdbd is crashing. >> >> We use fair-share as part of the queuing policy, and previously set up >> accounts with sacctmgr; that has been working fine for months. >> >> If I run slurmdbd in debug mode, >> >> slurmdbd -D -v -v -v -v -v >> >> it eventually (after being contacted by slurmctld) segfaults with: >> >> ... >> slurmdbd: debug2: DBD_NODE_STATE: NODE:cn049 STATE:UP REASON:(null) >> TIME:1588695584 >> slurmdbd: debug4: got 0 commits >> slurmdbd: debug2: DBD_NODE_STATE: NODE:cn050 STATE:UP REASON:(null) >> TIME:1588695584 >> slurmdbd: debug4: got 0 commits >> slurmdbd: debug4: got 0 commits >> slurmdbd: debug2: DBD_GET_TRES: called >> slurmdbd: debug4: got 0 commits >> slurmdbd: debug2: DBD_GET_QOS: called >> slurmdbd: debug4: got 0 commits >> slurmdbd: debug2: DBD_GET_USERS: called >> slurmdbd: debug4: got 0 commits >> slurmdbd: debug2: DBD_GET_ASSOCS: called >> slurmdbd: debug4: 10(as_mysql_assoc.c:2033) query >> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0); select >> @par_id, @mj, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos, >> @delta_qos; >> Segmentation fault (core dumped) >> >> >> It looks (running slurmdbd in gdb) like that segfault is coming from >> >> >> https://github.com/SchedMD/slurm/blob/slurm-17-11-12-1/src/plugins/accounting_storage/mysql/as_mysql_assoc.c#L2073 >> >> and If I connect to the mysql database directly and call that stored >> procedure, I get >> >> mysql> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0); >> >> +---------------------+-----------------+-------------------------+----------------------+---------------------------+-------------+-----------------------------------------------------------------+-------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-----------------------------+ >> | @par_id := id_assoc | @mj := max_jobs | @msj := max_submit_jobs | @mwpj >> := max_wall_pj | @def_qos_id := def_qos_id | @qos := qos | @delta_qos := >> REPLACE(CONCAT(delta_qos, @delta_qos), ',,', ',') | @mtpj := CONCAT(@mtpj, >> if (@mtpj != '' && max_tres_pj != '', ',', ''), max_tres_pj) | @mtpn := >> CONCAT(@mtpn, if (@mtpn != '' && max_tres_pn != '', ',', ''), max_tres_pn) >> | @mtmpj := CONCAT(@mtmpj, if (@mtmpj != '' && max_tres_mins_pj != '', ',', >> ''), max_tres_mins_pj) | @mtrm := CONCAT(@mtrm, if (@mtrm != '' && >> max_tres_run_mins != '', ',', ''), max_tres_run_mins) | @my_acct_new := >> parent_acct | >> >> +---------------------+-----------------+-------------------------+----------------------+---------------------------+-------------+-----------------------------------------------------------------+-------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-----------------------------+ >> | 1 | NULL | NULL | >> NULL | NULL | ,1, | NULL >> | NULL >> | NULL >> | NULL >> >> | NULL >> | | >> >> +---------------------+-----------------+-------------------------+----------------------+---------------------------+-------------+-----------------------------------------------------------------+-------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-----------------------------+ >> >> and if I run >> >> mysql> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0); >> select @par_id, @mj, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, >> @qos, @delta_qos; >> >> I get >> >> >> +---------+------+------+-------+-------+-------+--------+-------+-------------+------+------------+ >> | @par_id | @mj | @msj | @mwpj | @mtpj | @mtpn | @mtmpj | @mtrm | >> @def_qos_id | @qos | @delta_qos | >> >> +---------+------+------+-------+-------+-------+--------+-------+-------------+------+------------+ >> | 1 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | >> NULL | ,1, | NULL | >> >> +---------+------+------+-------+-------+-------+--------+-------+-------------+------+------------+ >> >> but I don't know what to do about this. >> >> We use another product ("Bright Cluster Manager") to manage some aspects >> of the cluster and Slurm installation, so we are hesitant to just upgrade >> Slurm. >> >> I would appreciate any tips. >> >> Thanks, >> --dustin >> >> >>