On Wednesday 05 November 2008 14:46:14 Andrew Beekhof wrote: > On Nov 5, 2008, at 2:26 PM, Bernd Schubert wrote: > > Hello Andrew, > > > > sorry for my late response. > > > > "Look here, you didn't set dc_deadtime, so crm is going to use a > > huge useless > > timeout". > > Yeah, but eventually you looked at the code and proposed the patch :-)
Well yes, but don't ask how much time it took to look at the right code. I was first checking hearbeat and didn't understand what was the problem, since additional logs showed that actually everything was online. Until I found out crmd is entirely independent... > > > But instead on each startup of heartbeat I get hundreds of lines > > into syslog > > and all of these don't look as if there are for the common admin, > > but IMHO > > 99% of it are developer information. > > yeah :-( > the logging is a _lot_ better in 1.0, but could still be improved. Sounds good. I really need to test it. > > at the cluster summit in prague we also agreed on a "black box" > recorder that should help too. > this way we can log tracing details there and only dump it into the > logs (or recover it from core files) when needed. > > but this will live in corosync, so it wont help people running on > heartbeat. Well, if openais + corosysnc are better, we can try to switch to it. > > > Then after I found the code in pacemaker, I already tested setting > > dc_deatime, > > but during my initial test that didn't change anything. While we > > need for > > Lustre installations a heartbeat deadtime > 10min, I set it on my test > > systems to 180s. > > Now after your suggestion I tested it again, with deadtime=20min, but > > dc_deatime=10s and quite odd, crm still needs about 3min to set the > > nodes > > online (syslog attached). With the code removed it is only 10s. > > Hmmm - thats odd - i'll take a look. Thanks, I will also try to find some time to look at it again. > > > Since openais doesn't seem to support the code below at at all and > > since it is > > wrong when used together with heartbeat, I still think removing > > these lines > > is right. Please correct me if I'm wrong. > > I'd prefer to fix the logic (if it's broken) since it's likely that > we'd add an equivalent default mechanism for CoroSync eventually. I just don't understand why we need that mechanism at all. I mean if heartbeat/corosync/openais detect everything is online, why does pacemaker need its own start timeout again? Shouldn't it try to online everything as soon as it is started? Well, ok it needs a timeout to detect if other nodes already have a DC. But then the DC detection timeout is not related at all to node deadtime detection, is it? > > > PS: Sorry, the attached syslog is still with heartbeat-2.1.4. If you > > think you > > fixed it in pacemaker already, please point me to the commit. > > No, this area doesn't get updated much (because it mostly works) Ok, thanks. So I can concentrate on finding the real issue. Thanks a lot for your help! -- Bernd Schubert Q-Leap Networks GmbH _______________________________________________ Pacemaker mailing list [email protected] http://list.clusterlabs.org/mailman/listinfo/pacemaker
