I think, first you should find out if there is a problem with PowerDNS or the network - or inbetween.

If this happens regularly, just use tcpdump to caputre all DNS traffic to a file (rotate files, keep only X files and choose X to not fill your complete hard disk).

Or even simpler - just capture with tcpdump the loopback traffic (your own check script) with -i lo.

Make sure you really see the requests to PDNS, but no answers. Of maybe there are answers, but much too late.

The problem may be that PDNS reads fromt he socket to sloow. Then the socket fills up and you have packet loss (tools like netstat can report this).

Also monitor the PDNS statistics. Ie read:
https://blog.powerdns.com/2014/12/11/powerdns-graphing-as-a-service/

Then watch the number of outstanding queries, maybe send them every second.

regards
Klaus

Am 17.09.2019 um 16:09 schrieb Netsons - Federico Chiacchiaretta:
Hi,
we have a PowerDNS cluster of authoritative servers running on 4 nodes:

OS: CentOS 7.6.1810 (fully updated)
Version: pdns-4.1.13-1pdns.el7.x86_64
Backend: mysql - MariaDB-server-10.1.41-1.el7.centos.x86_64

Backend is configured with 1 master and 3 slaves.

We perform recurring checks (every 30s) to check if DNS server is
working, and these checks randomly time out.
Check are performed both from:

* an external tool (Pingdom) with a timeout of 30s
* a bash scripts on each node, which performs a dig on the public IP
address of that node (default time out of 5 seconds).

When a timeout occurs, it occurs only on one check mechanism (pingdom
or script), never on both simultaneously.

Output from our script is simply:

";; connection timed out; no servers could be reached"

Logs from pdns.service reports a lot of these messages

set 17 06:00:00 dns4.netsons.net pdns_server[10277]: TCP Connection
Thread died because of network error: Timeout reading data
set 17 06:00:14 dns4.netsons.net pdns_server[10277]: TCP Connection
Thread died because of network error: Timeout reading data
set 17 06:00:29 dns4.netsons.net pdns_server[10277]: TCP Connection
Thread died because of network error: Timeout reading data
set 17 06:00:34 dns4.netsons.net pdns_server[10277]: TCP Connection
Thread died because of network error: Timeout reading data
set 17 06:00:34 dns4.netsons.net pdns_server[10277]: TCP Connection
Thread died because of network error: Timeout reading data
set 17 06:00:34 dns4.netsons.net pdns_server[10277]: TCP Connection
Thread died because of network error: Timeout reading data
set 17 06:00:34 dns4.netsons.net pdns_server[10277]: TCP Connection
Thread died because of network error: Timeout reading data
set 17 06:00:34 dns4.netsons.net pdns_server[10277]: TCP Connection
Thread died because of network error: Timeout reading data
set 17 06:00:34 dns4.netsons.net pdns_server[10277]: TCP Connection
Thread died because of network error: Timeout reading data

but these messages do not match timeout on our checks (though I'd like
to understand why they get logged).

Do you have any hint about what I can check to further troubleshoot the
issue?

Thanks.

Best,


_______________________________________________
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-users

Reply via email to