I agree - we are removing 1/2 the masters in a couple weeks to help things. Slaves only talk to masters, there are no "slaves of slaves" as we refer to them. Our architecture goal has been uniformity among the configurations, and this is part of the price we pay for that.
At a functional level, this will persist, just not with as much traffic. As a result of our use of CVS, and our inability to control notifies, whenever we push out big updates, the first master takes all the traffic, while the others sit there unused. We've experimented with breaking up the notifies by pushing updates in chunks to various masters, but that really breaks things, both process-wise and logically. What we really need is some way to say "there is an update, any one of these servers has an acceptable version of the update" instead of "hey, there's an update" "hey, there's an update" "hey there's an update" and having each slave go to each of the masters. While it seems trivial operationally to handle these loads, and we're not concerned about network bandwidth, a single master can only manage so many transactions at once. I doubt we're even in the top 50% of deployments, zone count wise, so I am confident that our number of zones isn't an issue. But I suspect that 80+ slaves are a little out there. with 80 slaves and 1800 zones, each master sends out 144000 notifies (for major changes/a master reload), which triggers 144000 SOA queries back to the master very quickly. That is bound to cause delays. One option we've considered is making our MASTERS and NS records point to an anycast IP/load balancer so that there can be multiple masters answering for the same notify. Another option would be to stop all notifies altogether, then figure out a way to manually trigger notifies (generating notifys via perl script/something clever) so we can control where the notifies come from. When all the EU DNS servers get notifies first from an NA master, they grab the data from there, so being able to control notifies would be nice sometimes. Thankfully we're mid-rearchitecture, and this will (hopefully) be torn out soon, but until it is we need to make sure that our users can manage their changes in a reasonable manner. A for loop doing "rndc retransfer" for changed zones, which seems to bypass all the congestion, is a short term fix until we can figure out how to make things a little smoother. Apologies for the wall of text - this is a frequent discussion with very little in the way of conclusion around here :) Todd. On Wed, Jan 20, 2010 at 10:33 PM, Joseph S D Yao <[email protected]> wrote: > On Wed, Jan 20, 2010 at 03:52:33PM -0500, Todd wrote: >> > serial-query-rate >> >> While this appears to be helping in the lab, it's still taking between >> 2 and 3 minutes for each slave to even finish receiving the NOTIFYs >> from the master. They then start hitting the master(s) with SOA >> queries whch seems to take a really long time. > > > Your NOTIFY tree sounds like it's many-to-many. Maybe you should be > using a sparser tree. > > > -- > /*********************************************************************\ > ** > ** Joe Yao [email protected] - Joseph S. D. Yao > ** > \*********************************************************************/ > _______________________________________________ bind-users mailing list [email protected] https://lists.isc.org/mailman/listinfo/bind-users

