Re: [slurm-users] Power save doesn't start nodes

2018-07-18 Thread Michael Gutteridge
John: thanks for the link. Curiously, sinfo doesn't show the asterisk, but has it documented. scontrol shows the asterisk and doesn't document it... at least for the state my cluster is in. Antony: Thanks for the steps- I tried it out, but there was no change. It seems like it should do the tri

Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

2018-07-18 Thread Ole Holm Nielsen
On 07/18/2018 10:56 AM, Roshan Thomas Mathew wrote: We ran into this issue trying to move from 16.05.3 -> 17.11.7 with 1.5M records in job table. In our first attempt, MySQL reported "ERROR 1206 The total number of locks exceeds the lock table size" after about 7 hours. Increased InnoDB Buff

Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

2018-07-18 Thread Roshan Thomas Mathew
We ran into this issue trying to move from 16.05.3 -> 17.11.7 with 1.5M records in job table. In our first attempt, MySQL reported "ERROR 1206 The total number of locks exceeds the lock table size" after about 7 hours. Increased InnoDB Buffer Pool size - https://dba.stackexchange.com/questions/27

Re: [slurm-users] Power save doesn't start nodes

2018-07-18 Thread John Hearns
If it is any help, https://slurm.schedmd.com/sinfo.html NODE STATE CODES Node state codes are shortened as required for the field size. These node states may be followed by a special character to identify state flags associated with the node. The following node sufficies and states are used: ***

Re: [slurm-users] Power save doesn't start nodes

2018-07-18 Thread Antony Cleave
I've not seen the IDLE* issue before but when my nodes got stuck I've always beena ble to fix them with this: [root@cloud01 ~]# scontrol update nodename=cloud01 state=down reason=stuck [root@cloud01 ~]# scontrol update nodename=cloud01 state=idle [root@cloud01 ~]# scontrol update nodename=cloud01