We got the same problem on our clusters. It was due to our backup script
of mysql was locking the tables (and taking to long time).
If looking at ''mod_time'' and ''control_host'' of ''cluster_table'' in
the database:
select mod_time,control_host from cluster_table;
We found that ''mod_time'' was matching the backup time exactly and the
''control_host'' column was empty.
Hope this will help you go forward with your problem.
best regards,
Magnus
On 2018-11-08 19:44, Brian Andrus wrote:
All,
I am seeing what looks like the same issue as
https://bugs.schedmd.com/show_bug.cgi?id=2119
Where, slurmctld is not picking up new accounts unless it is restarted.
I have 4 clusters (non-federated), all using the same slurmdbd
When I added an association for user name=me cluster=DevOps
account=Project1 and then tried to start a job, I kept getting an error:
*srun: error: Unable to allocate resources: Invalid account or
account/partition combination specified*
Then I restarted slurmctld on DevOps master and my job ran fine.
Is there some slurmdbd caching going on by slurmctld?
This is an issue in a production environment. We don't want to have to
restart all the slurmctld daemons anytime there is a change to any
associations. That could get painful
Brian Andrus
--
Magnus Jonsson, Developer, HPC2N, UmeƄ Universitet