Re: [Pdns-users] troubleshooting dnsdist -> recursor instability

Christoph via Pdns-users Mon, 24 Oct 2022 04:56:11 -0700

Hi,

thanks for the pointers.


A coincidence helped getting a bit closer to the root cause maybe:
Due to a linux kernel update the server had to be rebooted,
after the reboot the problem disappeared.

Stacked graphs of
irate(pdns_recursor_sys_msec...
irate(pdns_recursor_user_msec...

show that the recursor CPU usage increased steadily over the 4 weeksuptime and dropped to 1/5 after reboot. At that level dnsdist's healthchecks do not fail anymore currently.

My first idea was: maybe the growing number of cache entries(pdns_recursor_cache_entries) take up more CPU resources over time tosearch thorough, but it takes only 3 hours (not 4 weeks) to fill up thecache and remains a flat line after that.

Also the NSEC cache entries (pdns_recursor_aggressive_nsec_cache_entries)

remains almost a flat line over these 4 weeks period while recursor'sCPU usage grows.

Memory usage (pdns_recursor_real_memory_usage) grows a lot slower than
cache entries and correlates with CPU usage to some degree.

I also checked these metrics for correlations with the growing CPU usagebut didn't find any:

negcache_entries
nsspeeds_entries
packetcache_entries
over_capacity_drops
query_pipe_full_drops

QPS (pdns_recursor_questions...)
slightly decreased during these 4 weeks.

By comparing two setups that are largely identical
we might have a hint, the one that has the growing CPU usage

issue has these 2 lines that the other one has NOT (thats the onlydifference):


loglevel=3
max-busy-dot-probes=5

The rate of pdns_recursor_dot_outqueries is not growing over time
on the one that has DoT probing enabled (and very low <5qps anyways).

If you also enabled DoT probing and are observing CPU usage growth overtime, that would be interesting but unexpected to me.

https://blog.powerdns.com/2022/06/13/probing-dot-support-of-authoritative-servers-just-try-it/

Thomas Mieslinger via Pdns-users wrote:

I'd use dnscap (tcpdump with a decent filter) on dnsdist and recursor
machines. See if check query goes out from dnsdist, comes in to
recursor, see if reply goes out from recursors, comes in to dnsdist.

thanks will use dnscap with -x <custom health check QNAME> when ithappens again

Review nftables config on all machines. Maybe someone of your team
installed hashlimit magic to avoid overload.

no iptables/nftables involved on the loopback interface used to connectdnsdist with recursor

Look for a metric which tells you whether you hit the "max in flight"
limit. If you have long running queries (taking 1000ms in the recursor)
the inflight limit can be reached quickly.


'answers-slow' is at about 5 qps.
I didn't find the "max in flight" metric yet.

The documentation on newServer()https://dnsdist.org/reference/config.html?highlight=newserver#newServer

does not mention the default value for:

checkInterval=NUM -- The time in secondsbetween health checks


Is it one per second?

thanks!
Christoph

_______________________________________________
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-users

Re: [Pdns-users] troubleshooting dnsdist -> recursor instability

Reply via email to