We got the same problem on our clusters. It was due to our backup script
of mysql was locking the tables (and taking to long time).

If looking at ''mod_time'' and ''control_host'' of ''cluster_table'' in the database:

select mod_time,control_host from cluster_table;

We found that ''mod_time'' was matching the backup time exactly and the ''control_host'' column was empty.

Hope this will help you go forward with your problem.

best regards,
Magnus

On 2018-11-08 19:44, Brian Andrus wrote:
All,
I am seeing what looks like the same issue as https://bugs.schedmd.com/show_bug.cgi?id=2119

Where, slurmctld is not picking up new accounts unless it is restarted.

I have 4 clusters (non-federated), all using the same slurmdbd
When I added an association for user name=me cluster=DevOps account=Project1 and then tried to start a job, I kept getting an error: *srun: error: Unable to allocate resources: Invalid account or account/partition combination specified*

Then I restarted slurmctld on DevOps master and my job ran fine.

Is there some slurmdbd caching going on by slurmctld?

This is an issue in a production environment. We don't want to have to restart all the slurmctld daemons anytime there is a change to any associations. That could get painful

Brian Andrus

--
Magnus Jonsson, Developer, HPC2N, UmeƄ Universitet

Reply via email to