Hi,

thanks for the pointers.

A coincidence helped getting a bit closer to the root cause maybe:
Due to a linux kernel update the server had to be rebooted,
after the reboot the problem disappeared.

Stacked graphs of
irate(pdns_recursor_sys_msec...
irate(pdns_recursor_user_msec...
show that the recursor CPU usage increased steadily over the 4 weeks uptime and dropped to 1/5 after reboot. At that level dnsdist's health checks do not fail anymore currently.

My first idea was: maybe the growing number of cache entries (pdns_recursor_cache_entries) take up more CPU resources over time to search thorough, but it takes only 3 hours (not 4 weeks) to fill up the cache and remains a flat line after that.
Also the NSEC cache entries (pdns_recursor_aggressive_nsec_cache_entries)
remains almost a flat line over these 4 weeks period while recursor's CPU usage grows.
Memory usage (pdns_recursor_real_memory_usage) grows a lot slower than
cache entries and correlates with CPU usage to some degree.

I also checked these metrics for correlations with the growing CPU usage but didn't find any:
negcache_entries
nsspeeds_entries
packetcache_entries
over_capacity_drops
query_pipe_full_drops

QPS (pdns_recursor_questions...)
slightly decreased during these 4 weeks.

By comparing two setups that are largely identical
we might have a hint, the one that has the growing CPU usage
issue has these 2 lines that the other one has NOT (thats the only difference):

loglevel=3
max-busy-dot-probes=5

The rate of pdns_recursor_dot_outqueries is not growing over time
on the one that has DoT probing enabled (and very low <5qps anyways).

If you also enabled DoT probing and are observing CPU usage growth over time, that would be interesting but unexpected to me.
https://blog.powerdns.com/2022/06/13/probing-dot-support-of-authoritative-servers-just-try-it/

Thomas Mieslinger via Pdns-users wrote:
I'd use dnscap (tcpdump with a decent filter) on dnsdist and recursor
machines. See if check query goes out from dnsdist, comes in to
recursor, see if reply goes out from recursors, comes in to dnsdist.

thanks will use dnscap with -x <custom health check QNAME> when it happens again

Review nftables config on all machines. Maybe someone of your team
installed hashlimit magic to avoid overload.

no iptables/nftables involved on the loopback interface used to connect dnsdist with recursor

Look for a metric which tells you whether you hit the "max in flight"
limit. If you have long running queries (taking 1000ms in the recursor)
the inflight limit can be reached quickly.

'answers-slow' is at about 5 qps.
I didn't find the "max in flight" metric yet.

The documentation on newServer() https://dnsdist.org/reference/config.html?highlight=newserver#newServer
does not mention the default value for:
checkInterval=NUM -- The time in seconds between health checks

Is it one per second?

thanks!
Christoph

_______________________________________________
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-users

Reply via email to