I think, first you should find out if there is a problem with PowerDNS
or the network - or inbetween.
If this happens regularly, just use tcpdump to caputre all DNS traffic
to a file (rotate files, keep only X files and choose X to not fill your
complete hard disk).
Or even simpler - just capture with tcpdump the loopback traffic (your
own check script) with -i lo.
Make sure you really see the requests to PDNS, but no answers. Of maybe
there are answers, but much too late.
The problem may be that PDNS reads fromt he socket to sloow. Then the
socket fills up and you have packet loss (tools like netstat can report
this).
Also monitor the PDNS statistics. Ie read:
https://blog.powerdns.com/2014/12/11/powerdns-graphing-as-a-service/
Then watch the number of outstanding queries, maybe send them every second.
regards
Klaus
Am 17.09.2019 um 16:09 schrieb Netsons - Federico Chiacchiaretta:
Hi,
we have a PowerDNS cluster of authoritative servers running on 4 nodes:
OS: CentOS 7.6.1810 (fully updated)
Version: pdns-4.1.13-1pdns.el7.x86_64
Backend: mysql - MariaDB-server-10.1.41-1.el7.centos.x86_64
Backend is configured with 1 master and 3 slaves.
We perform recurring checks (every 30s) to check if DNS server is
working, and these checks randomly time out.
Check are performed both from:
* an external tool (Pingdom) with a timeout of 30s
* a bash scripts on each node, which performs a dig on the public IP
address of that node (default time out of 5 seconds).
When a timeout occurs, it occurs only on one check mechanism (pingdom
or script), never on both simultaneously.
Output from our script is simply:
";; connection timed out; no servers could be reached"
Logs from pdns.service reports a lot of these messages
set 17 06:00:00 dns4.netsons.net pdns_server[10277]: TCP Connection
Thread died because of network error: Timeout reading data
set 17 06:00:14 dns4.netsons.net pdns_server[10277]: TCP Connection
Thread died because of network error: Timeout reading data
set 17 06:00:29 dns4.netsons.net pdns_server[10277]: TCP Connection
Thread died because of network error: Timeout reading data
set 17 06:00:34 dns4.netsons.net pdns_server[10277]: TCP Connection
Thread died because of network error: Timeout reading data
set 17 06:00:34 dns4.netsons.net pdns_server[10277]: TCP Connection
Thread died because of network error: Timeout reading data
set 17 06:00:34 dns4.netsons.net pdns_server[10277]: TCP Connection
Thread died because of network error: Timeout reading data
set 17 06:00:34 dns4.netsons.net pdns_server[10277]: TCP Connection
Thread died because of network error: Timeout reading data
set 17 06:00:34 dns4.netsons.net pdns_server[10277]: TCP Connection
Thread died because of network error: Timeout reading data
set 17 06:00:34 dns4.netsons.net pdns_server[10277]: TCP Connection
Thread died because of network error: Timeout reading data
but these messages do not match timeout on our checks (though I'd like
to understand why they get logged).
Do you have any hint about what I can check to further troubleshoot the
issue?
Thanks.
Best,
_______________________________________________
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-users