On 8/17/2011 2:43 PM, Alan McKinnon wrote:
I'm just itching to type up the long list of horror stories I've
stored from people doing their own DNS thinking it was real easy.
But there's this little thing called an NDA and it says I can't :-(
heh, I think I can dredge one up for you that no one will care about
these days.
This was at a large ISP in '99 known for their free Internet. Bind 8
was fresh on the scene and somehow Network Engineering was in charge of
DNS rather than Systems. My intern and I came up with a plan to have
ns00.int as the internal master and make the rest of name servers slave
off of it. All ns00 did was supply the production name servers with zones.
ns00 --> ns01(vip) --> ns01-[01-03]
\--> ns02(vip) --> ns02-[01-03]
\-> ns03(vip) --> ns03-[01-03]
Three virtual IPs and three name servers behind each vip.
This way we could have tools deal with updating zones on ns00 on the
internal network and not have to push to a number of name servers. This
worked well for a few months and we generally forgot about it. Almost a
month after a reorganization in the local datacenter DNS went down. Well
not down down, but our zones weren't working. After a hectic hour of
freaking out, troubleshooting random things, and bouncing from machine
to machine by IP address because none of DNS worked we realized our
mistake. The TTL of the zone itself was set to three weeks. In the move
Bind had silently died on ns00 which we didn't monitor because it was
inside the corp network. The slaves dutifully stayed up and working till
they hit the TTL of the zones and demanded to speak to the master again.
Restarting Bind on the prod servers did nothing other than remove the
already expired cache.
Once restarted Bind on ns00 (and made it part of the runlevel) the prod
server checked in and all was well.
The lessons:
Monitor *all* of your DNS infrastructure
DNS can break even with a large distributed system and it is never
pretty.
kashani