Dear all,

We have been running a computing cluster using slurm since 2016, that I installed back then, with some help from others. I was pretty late on upgrades and decided to upgrade the cluster up to debian Bullseye, which runs slurm 20.11.7, starting from stretch, that runs slurm 16.05.9.

While the update of the system in itself went smoothly, slurm is broken. Of course, that's the stage at which I thought "Oh, I should have checked if the upgrade is supposed to be harmless"... Now that's the self-bashing is rightfully done, I would be very happy with some help! I hesitate between two strategies: removing slurm completely and a completely new installation, or trying to save what can be saved... I am tempted by the former since I remember suffering a bit to get the installation right in the first place...

Munge works still fine but when I run

slurmctld -Dvvvvv -c

every goes smoothly until:

[...]
slurmctld: accounting_storage/slurmdbd: init: Accounting storage SLURMDBD plugin loaded
slurmctld: debug3: Success.
slurmctld: debug2: slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 127.0.1.1:6819: Connection refused slurmctld: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:kandinsky:6819: Connection refused
slurmctld: error: Sending PersistInit msg: Connection refused
slurmctld: accounting_storage/slurmdbd: _load_dbd_state: recovered 0 pending RPCs slurmctld: accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd
slurmctld: debug2: slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 127.0.1.1:6819: Connection refused
slurmctld: error: Sending PersistInit msg: Connection refused
slurmctld: debug:  Association database appears down, reading from state file. slurmctld: debug:  create_mmap_buf: Failed to open file `/var/spool/slurm.state/last_tres`, No such file or directory slurmctld: debug2: No last_tres file (/var/spool/slurm.state/last_tres) to recover slurmctld: debug:  create_mmap_buf: Failed to open file `/var/spool/slurm.state/assoc_mgr_state`, No such file or directory slurmctld: debug2: No association state file (/var/spool/slurm.state/assoc_mgr_state) to recover slurmctld: fatal: You are running with a database but for some reason we have no TRES from it.  This should only happen if the database is down and you don't have any state files.

6819 is the port on which slurmdb is supposed to be running so I tried:

slurmdbd -Dvvvvv

which yields

slurmdbd: debug:  Log file re-opened
slurmdbd: pidfile not locked, assuming no running daemon
slurmdbd: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/auth_munge.so
slurmdbd: debug:  auth/munge: init: Munge authentication plugin loaded
slurmdbd: debug3: Success.
slurmdbd: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/accounting_storage_mysql.so slurmdbd: debug2: accounting_storage/as_mysql: init: mysql_connect() called for db slurm_db
slurmdbd: debug2: Attempting to connect to localhost:3306
slurmdbd: accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 10.5.11-MariaDB-1 slurmdbd: debug2: accounting_storage/as_mysql: _check_database_variables: innodb_buffer_pool_size: 134217728 slurmdbd: debug2: accounting_storage/as_mysql: _check_database_variables: innodb_log_file_size: 100663296 slurmdbd: debug2: accounting_storage/as_mysql: _check_database_variables: innodb_lock_wait_timeout: 50 slurmdbd: error: Database settings not recommended values: innodb_buffer_pool_size innodb_lock_wait_timeout slurmdbd: debug4: accounting_storage/as_mysql: _set_db_curr_ver: 0(as_mysql_convert.c:128) query
select version from convert_version_table
slurmdbd: debug4: accounting_storage/as_mysql: as_mysql_convert_tables_pre_create: as_mysql_convert_tables_pre_create: No conversion needed, Horray! slurmdbd: debug4: accounting_storage/as_mysql: as_mysql_convert_tables_post_create: as_mysql_convert_tables_post_create: No conversion needed, Horray! slurmdbd: debug4: accounting_storage/as_mysql: as_mysql_convert_non_cluster_tables_post_create: as_mysql_convert_non_cluster_tables_post_create: No conversion needed, Horray! slurmdbd: error: mysql_query failed: 1558 Column count of mysql.proc is wrong. Expected 21, found 20. Created with MariaDB 100126, now running 100511. Please use mariadb-upgrade to fix this error drop procedure if exists get_parent_limits; create procedure get_parent_limits(my_table text, acct text, cluster text, without_limits int) begin set @par_id = NULL; set @mj = NULL; set @mja = NULL; set @mpt = NULL; set @msj = NULL; set @mwpj = NULL; set @mtpj = ''; set @mtpn = ''; set @mtmpj = ''; set @mtrm = ''; set @prio = NULL; set @def_qos_id = NULL; set @qos = ''; set @delta_qos = ''; set @my_acct = acct; if without_limits then set @mj = 0; set @msj = 0; set @mwpj = 0; set @prio = 0; set @def_qos_id = 0; set @qos = 1; end if; REPEAT set @s = 'select '; if @par_id is NULL then set @s = CONCAT(@s, '@par_id := id_assoc, '); end if; if @mj is NULL then set @s = CONCAT(@s, '@mj := max_jobs, '); end if; if @mja is NULL then set @s = CONCAT(@s, '@mja := max_jobs_accrue, '); end if; if @mpt is NULL then set @s = CONCAT(@s, '@mpt := min_prio_thresh, '); end if; if @msj is NULL then set @s = CONCAT(@s, '@msj := max_submit_jobs, '); end if; if @mwpj is NULL then set @s = CONCAT(@s, '@mwpj := max_wall_pj, '); end if; if @prio is NULL then set @s = CONCAT(@s, '@prio := priority, '); end if; if @def_qos_id is NULL then set @s = CONCAT(@s, '@def_qos_id := def_qos_id, '); end if; if @qos = '' then set @s = CONCAT(@s, '@qos := qos, @delta_qos := REPLACE(CONCAT(delta_qos, @delta_qos), \',,\', \',\'), '); end if; set @s = concat(@s, '@mtpj := CONCAT(@mtpj, if (@mtpj != \'\' && max_tres_pj != \'\', \',\', \'\'), max_tres_pj), @mtpn := CONCAT(@mtpn, if (@mtpn != \'\' && max_tres_pn != \'\', \',\', \'\'), max_tres_pn), @mtmpj := CONCAT(@mtmpj, if (@mtmpj != \'\' && max_tres_mins_pj != \'\', \',\', \'\'), max_tres_mins_pj), @mtrm := CONCAT(@mtrm, if (@mtrm != \'\' && max_tres_run_mins != \'\', \',\', \'\'), max_tres_run_mins), @my_acct_new := parent_acct from "', cluster, '_', my_table, '" where acct = \'', @my_acct, '\' && user=\'\''); prepare query from @s; execute query; deallocate prepare query; set @my_acct = @my_acct_new; UNTIL without_limits || @my_acct = '' END REPEAT; END; slurmdbd: error: mysql_query failed: 1558 Column count of mysql.proc is wrong. Expected 21, found 20. Created with MariaDB 100126, now running 100511. Please use mariadb-upgrade to fix this error drop procedure if exists get_coord_qos; create procedure get_coord_qos(my_table text, acct text, cluster text, coord text) begin set @qos = ''; set @delta_qos = ''; set @found_coord = NULL; set @my_acct = acct; REPEAT set @s = 'select @qos := t1.qos, @delta_qos := REPLACE(CONCAT(t1.delta_qos, @delta_qos), \',,\', \',\'), @my_acct_new := parent_acct, @found_coord_curr := t2.user '; set @s = concat(@s, 'from "', cluster, '_', my_table, '" as t1 left outer join acct_coord_table as t2 on t1.acct=t2.acct where t1.acct = @my_acct && t1.user=\'\' && (t2.user=\'', coord, '\' || t2.user is null)'); prepare query from @s; execute query; deallocate prepare query; if @found_coord_curr is not NULL then set @found_coord = @found_coord_curr; end if; if @found_coord is NULL then set @qos = ''; set @delta_qos = ''; end if; set @my_acct = @my_acct_new; UNTIL @qos != '' || @my_acct = '' END REPEAT; select REPLACE(CONCAT(@qos, @delta_qos), ',,', ','); END; slurmdbd: accounting_storage/as_mysql: init: Accounting storage MYSQL plugin failed slurmdbd: error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed slurmdbd: error: cannot create accounting_storage context for accounting_storage/mysql slurmdbd: fatal: Unable to initialize accounting_storage/mysql accounting storage plugin

It thus seems that the database format is wrong. I do not care about previous logs so I would be happy erasing previous table and creating a new one, if possible, but I do not know what to do :-)

I tried running

mariadb-upgrade

but got

Version check failed. Got the following error when calling the 'mysql' command line client ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO)
FATAL ERROR: Upgrade failed

I have to admit that I do not remember setting a root password, but it starts to date back and I was not the only one messing with the cluster... I tried to follow this to change the root password:

https://linuxize.com/post/how-to-reset-a-mysql-root-password/

but this does not seem to be working. I would be happy with some suggestions !

Best,

Julien Tailleur





Reply via email to