On Nov 5, 2008, at 2:26 PM, Bernd Schubert wrote:
Hello Andrew,
sorry for my late response.
On Sunday 02 November 2008 20:32:14 Andrew Beekhof wrote:
On Oct 30, 2008, at 6:08 PM, Bernd Schubert wrote:
Heartbeat calls crmd only if all nodes are already online.
Not everyone uses it on heartbeat anymore ;-)
I grepped the sources of openais and corosync for "KEY_INITDEAD",
but can't
find anything.
Correct. For those two the sanity logic doesn't achieve anything.
Are there any further solutions pacemaker supports?
Just those three.
So introducing
another posssibly huge deadtime here will at least delay the DC
selection
and so resource startup by heartbeats initial deadtime. If one node
e.g.
after a global power failure doesn't come up at all, the DC
selection was
even delayed by 2 x initial hb deadtime. Simply remove the usage of
heartbeats initial deadtime and only use our own.
I don't understand.
The logic below is only triggered for people who haven't set a value
for dc_deadtime... why not just set a value in the cib?
Well firstly, the logs didn't tell me:
"Look here, you didn't set dc_deadtime, so crm is going to use a
huge useless
timeout".
Yeah, but eventually you looked at the code and proposed the patch :-)
But instead on each startup of heartbeat I get hundreds of lines
into syslog
and all of these don't look as if there are for the common admin,
but IMHO
99% of it are developer information.
yeah :-(
the logging is a _lot_ better in 1.0, but could still be improved.
at the cluster summit in prague we also agreed on a "black box"
recorder that should help too.
this way we can log tracing details there and only dump it into the
logs (or recover it from core files) when needed.
but this will live in corosync, so it wont help people running on
heartbeat.
Then after I found the code in pacemaker, I already tested setting
dc_deatime,
but during my initial test that didn't change anything. While we
need for
Lustre installations a heartbeat deadtime > 10min, I set it on my test
systems to 180s.
Now after your suggestion I tested it again, with deadtime=20min, but
dc_deatime=10s and quite odd, crm still needs about 3min to set the
nodes
online (syslog attached). With the code removed it is only 10s.
Hmmm - thats odd - i'll take a look.
Since openais doesn't seem to support the code below at at all and
since it is
wrong when used together with heartbeat, I still think removing
these lines
is right. Please correct me if I'm wrong.
I'd prefer to fix the logic (if it's broken) since it's likely that
we'd add an equivalent default mechanism for CoroSync eventually.
PS: Sorry, the attached syslog is still with heartbeat-2.1.4. If you
think you
fixed it in pacemaker already, please point me to the commit.
No, this area doesn't get updated much (because it mostly works)
_______________________________________________
Pacemaker mailing list
[email protected]
http://list.clusterlabs.org/mailman/listinfo/pacemaker