Hello,

Thank you for this explanation.

We also had experienced this problem (when migrating from 16.05.4 to 17.02.1) and fixed it finally by executing database query:

UPDATE slurm_acct_db.prometheus_job_table SET mem_req=IF(mem_req&0x80000000,(mem_req&0x7fffffff)|0x8000000000000000,mem_req);

where "prometheus" is our cluster name.

This query does modifications you described.

Regards,
Jacek

W dniu 26.01.2018 o 11:59, Lech Nieroda pisze:
Dear slurm users,

we have run into a problem after upgrading from slurm 15.08.12 to 17.02.6 back in August 2017: all old jobs which had their memory requested with the ‚mem-per-cpu’ option have shown absurd values in the ‚reqmem‘ attribute when queried with sacct. The values were somewhere in the PetaByte range, whereas they should have been in the GigaByte range.

An analysis of the issue has shown the following:
The attribute corresponding to ‚reqmem’ in the database is ‚mem_req‘ in the ‚cheops_job_table‘ table. It stores both ‚mem‘ and ‚mem-per-cpu’ values - the ‚mem‘ value is stored directly and the ‚mem-per-cpu’ is stored with a certain flag (bit) set. In slurm 15.08.12 the ‚mem_req‘ attribute is a simple int (32bit) and the flag is the 32nd bit. In slurm 17.02.6 the ‚mem_req‘ attribute is a bigint (64bit) and the flag is the 64th bit. Thus the 'mem-per-cpu‘ values with ‚2^31‘ „added" to them appeared as PetaBytes.

The uint32_t -> uint64_t change took place with the commit at 2016-06-27 with the adnotation that it requires further "table conversion logic to MySQL, as mem_req column needs to change type to 'bigint unsigned' from 'int unsigned‘.“. I don’t know if this work has been done but when we’ve upgraded slurm and the database was converted automatically, the values were not corrected and there was no error concerning this issue.

In case you have run into something similar, the fix is simple - we’ve converted the values ‚manually‘, i.e. made a query that selected all entries with 2^31 <= mem_req < 2^63, made a backup, cleared the 2^31 bit, set the 2^63 bit, stored and checked the values.


Regards,
Lech

--
Dipl.-Wirt.-Inf. Lech Nieroda
Regionales Rechenzentrum der Universität zu Köln (RRZK)







--
Jacek Budzowski
System administrator
ACC Cyfronet AGH


Reply via email to