Hello,
Thank you for this explanation.
We also had experienced this problem (when migrating from 16.05.4 to
17.02.1) and fixed it finally by executing database query:
UPDATE slurm_acct_db.prometheus_job_table SET
mem_req=IF(mem_req&0x80000000,(mem_req&0x7fffffff)|0x8000000000000000,mem_req);
where "prometheus" is our cluster name.
This query does modifications you described.
Regards,
Jacek
W dniu 26.01.2018 o 11:59, Lech Nieroda pisze:
Dear slurm users,
we have run into a problem after upgrading from slurm 15.08.12 to
17.02.6 back in August 2017: all old jobs which had their memory
requested with the ‚mem-per-cpu’ option have shown absurd values in
the ‚reqmem‘ attribute when queried with sacct.
The values were somewhere in the PetaByte range, whereas they should
have been in the GigaByte range.
An analysis of the issue has shown the following:
The attribute corresponding to ‚reqmem’ in the database is ‚mem_req‘
in the ‚cheops_job_table‘ table. It stores both ‚mem‘ and
‚mem-per-cpu’ values - the ‚mem‘ value is stored directly and the
‚mem-per-cpu’ is stored with a certain flag (bit) set.
In slurm 15.08.12 the ‚mem_req‘ attribute is a simple int (32bit) and
the flag is the 32nd bit.
In slurm 17.02.6 the ‚mem_req‘ attribute is a bigint (64bit) and the
flag is the 64th bit.
Thus the 'mem-per-cpu‘ values with ‚2^31‘ „added" to them appeared as
PetaBytes.
The uint32_t -> uint64_t change took place with the commit at
2016-06-27 with the adnotation that it requires further "table
conversion logic to MySQL, as mem_req column needs to change type to
'bigint unsigned' from 'int unsigned‘.“.
I don’t know if this work has been done but when we’ve upgraded slurm
and the database was converted automatically, the values were not
corrected and there was no error concerning this issue.
In case you have run into something similar, the fix is simple - we’ve
converted the values ‚manually‘, i.e. made a query that selected all
entries with 2^31 <= mem_req < 2^63, made a backup, cleared the 2^31
bit, set the 2^63 bit, stored and checked the values.
Regards,
Lech
--
Dipl.-Wirt.-Inf. Lech Nieroda
Regionales Rechenzentrum der Universität zu Köln (RRZK)
--
Jacek Budzowski
System administrator
ACC Cyfronet AGH