In the past couple of days we've noticed an odd issue when creating new accounts with which we think is related to the length of the account name.

Having recently launched a new cluster we've switched to using account names with the format <PIsurname>-<servicelevel>-<type> where servicelevel is SL[1-4] and type is CPU, GPU, or KNL. At a minimum we'd expect to have <PIsurname>-SL3-CPU and <PIsurname>-SL4-CPU then optionally <PIsurname>-SL[3-4]-GPU and/or <PIsurname>-SL[34]-KNL and possibly paying SL1 or SL2 accounts for any, or all of, CPU, GPU, and KNL.

What we've found is that if we create a PI_TESTABCDE-FGHIJK (I've replaced actual PI's surname with TESTABCDE-FGHIJK but it was that long - a double-barrelled surname) account then a TESTABCDE-FGHIJK-SL3-CPU and TESTABCDE-FGHIJK-SL4-CPU account, each with PI_TESTABCDE-FGHIJK as their parent, sacctmgr then complains when we try and create a TESTABCDE-FGHIJK-SL3-GPU account. See below for various commands and output:

[root@slurm-master ~]# sacctmgr -vi add account Name=pi_testabcde-fghijk Description="Simon Flood" Cluster=csd3 parent=uis fairshare=parent
sacctmgr: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
 Adding Account(s)
  pi_testabcde-fghijk
 Settings
  Description     = simon flood
  Organization    = Parent/Account Name
 Associations
  A = pi_testabc C = csd3
 Settings
  Fairshare     = parent
  Parent        = uis
[root@slurm-master ~]# sacctmgr -vi add account Name=TESTABCDE-FGHIJK-SL3-CPU GrpTRESMins=cpu=12000000 DefaultQOS=cpu2 QOS=cpu2,intr Cluster=csd3 parent=pi_testabcde-fghijk fairshare=0
sacctmgr: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
 Adding Account(s)
  testabcde-fghijk-sl3-cpu
 Settings
  Description     = Account Name
  Organization    = Parent/Account Name
 Associations
  A = testabcde- C = csd3
 Settings
  Fairshare     = 0
  GrpTRESMins   = cpu=12000000
  Parent        = pi_testabcde-fghijk
  QOS           = cpu2,intr
  DefQOS        = cpu2
[root@slurm-master ~]# sacctmgr -vi add account Name=TESTABCDE-FGHIJK-SL4-CPU QOS=cpu3 Cluster=csd3 parent=pi_testabcde-fghijk fairshare=0
sacctmgr: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
 Adding Account(s)
  testabcde-fghijk-sl4-cpu
 Settings
  Description     = Account Name
  Organization    = Parent/Account Name
 Associations
  A = testabcde- C = csd3
 Settings
  Fairshare     = 0
  Parent        = pi_testabcde-fghijk
  QOS           = cpu3
[root@slurm-master ~]# sacctmgr -vi add account Name=TESTABCDE-FGHIJK-SL3-GPU GrpTRESMins=gres/gpu=480000 DefaultQOS=gpu2 QOS=gpu2,intr Cluster=csd3 parent=pi_testabcde-fghijk fairshare=0
sacctmgr: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
 Adding Account(s)
  testabcde-fghijk-sl3-gpu
 Settings
  Description     = Account Name
  Organization    = Parent/Account Name
 Associations
  A = testabcde- C = csd3
 Settings
  Fairshare     = 0
  GrpTRESMins   = gres/gpu=480000
  Parent        = pi_testabcde-fghijk
  QOS           = gpu2,intr
  DefQOS        = gpu2
 Problem adding accounts: Unspecified error
[root@slurm-master ~]# sacctmgr -n show account format=Account'%-25',Description'%-30',Organization'%-20' | grep -i testabcde-fghijk
pi_testabcde-fghijk       simon flood                    uis
testabcde-fghijk-sl3-cpu  testabcde-fghijk-sl3-cpu pi_testabcde-fghijk
testabcde-fghijk-sl4-cpu  testabcde-fghijk-sl4-cpu pi_testabcde-fghijk

When we originally saw this on Monday trying to create the TESTABCDE-FGHIJK-SL3-GPU account gave an output suggesting it was trying to create an association rather than account but that didn't happen when repeating with fake "PI surname" for this message.

The other odd thing which we suspect is related is that when trying to undo these account additions (as we created them with shorter names) is that the delete deletes the association but not the actual accounts:

[root@slurm-master ~]# sacctmgr delete account name=testabcde-fghijk-sl3-cpu cluster=csd3
 Deleting account associations...
  C = csd3       A = testabcde-fghijk-sl3-cpu of pi_testabcde-fghijk
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
[root@slurm-master ~]# sacctmgr delete account name=testabcde-fghijk-sl4-cpu cluster=csd3
 Deleting account associations...
  C = csd3       A = testabcde-fghijk-sl4-cpu of pi_testabcde-fghijk
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
[root@slurm-master ~]# sacctmgr -n show account format=Account'%-25',Description'%-30',Organization'%-20' | grep -i testabcde-fghijk
pi_testabcde-fghijk       simon flood                    uis
testabcde-fghijk-sl3-cpu  testabcde-fghijk-sl3-cpu pi_testabcde-fghijk
testabcde-fghijk-sl4-cpu  testabcde-fghijk-sl4-cpu pi_testabcde-fghijk

If we then check the MySQL table it shows the accounts still exist but not associations. We're then tidying up by deleting the accounts manually in MySQL.

Our guess is that when creating the account sacctmgr is checking and comparing partial existing account names hence thinking there's a clash. I've had a quick look at the various bits of source code for sacctmgr but with my limited C knowledge haven't spotted anything obvious.

Previously we were using a mix of <PIsurname>-<servicelevel> for CPU and <PIsurname>-<servicelevel>-GPU for GPU (we didn't have KNL) so it's possible this issue existed in an earlier version of Slurm (we are using Slurm 14.11.8 on our old cluster) but we weren't hitting it.

Our new Slurm master is running Slurm 17.02.9 on Red Hat Enterprise Linux 7.3.

If anyone wants further information please ask though obviously we're coming up to the Christmas holidays so responses might be delayed.

Regards,
Simon
--
Simon Flood
HPC System Administrator
University of Cambridge Information Services
United Kingdom

Reply via email to