On Nov 5, 2008, at 3:11 PM, Bernd Schubert wrote:


at the cluster summit in prague we also agreed on a "black box"
recorder that should help too.
this way we can log tracing details there and only dump it into the
logs (or recover it from core files) when needed.

but this will live in corosync, so it wont help people running on
heartbeat.

Well, if openais + corosysnc are better, we can try to switch to it.

note the future tense there though... its not implemented yet.




Then after I found the code in pacemaker, I already tested setting
dc_deatime,
but during my initial test that didn't change anything. While we
need for
Lustre installations a heartbeat deadtime > 10min, I set it on my test
systems to 180s.
Now after your suggestion I tested it again, with deadtime=20min, but
dc_deatime=10s and quite odd, crm still needs about 3min to set the
nodes
online (syslog attached). With the code removed it is only 10s.

Hmmm - thats odd - i'll take a look.

Thanks, I will also try to find some time to look at it again.


Since openais doesn't seem to support the code below at at all and
since it is
wrong when used together with heartbeat, I still think removing
these lines
is right. Please correct me if I'm wrong.

I'd prefer to fix the logic (if it's broken) since it's likely that
we'd add an equivalent default mechanism for CoroSync eventually.

I just don't understand why we need that mechanism at all. I mean if
heartbeat/corosync/openais detect everything

Especially with autojoin, it doesn't know that "everything" is online.
There could be some extra nodes about to start/join the cluster.

Remember, this is only supposed to supply a default value.
Advanced users are free to set it as low as they like.

Of course they need to know they can - thats a documentation issue which can be easily rectified.

is online, why does pacemaker need its own start timeout again?

because it needs to give any existing DC a chance to contact it rather than needlessly causing another DC election.

Shouldn't it try to online everything as
soon as it is started? Well, ok it needs a timeout to detect if other nodes
already have a DC.

exactly.  so any value should only be used by the first node to come up.
is that what you're seeing?

But then the DC detection timeout is not related at all to
node deadtime detection, is it?

at the time it was felt that they were related enough that it made the basis of a good default.


_______________________________________________
Pacemaker mailing list
[email protected]
http://list.clusterlabs.org/mailman/listinfo/pacemaker

Reply via email to