Problem with a certain domain
> My go-to DNS debugging site at > > https://dnsviz.net/d/s1._domainkey.mg-esp-prod-eu-eu.mallorcazeitung.es/dnssec/ > > > appears to indicte there is more than one problem, but the most > serious one is probably this one: > > It might look like one or more of the publishing name servers responds > incorrectly when queried for an "empty non-terminal" name > (e.g. _domainkey...), which probably itself doesn't have any data on > that node, but has data on "names below". The correct response code > is then NOERROR with answer count=0 (aka. "NODATA"), not NXDOMAIN. > > When a recursor gets NXDOMAIN back, it is free to assume that the > queried-for name does not exist (which is obvious), and nothing exists > below that node either. See RFC 8020. > > Regards, > > - Håvard Håvard, w hat you say is correct about the NXDOMAIN RCODE . However, Thomas's logs and dig output suggest that the failure is a timeout, possibly because BIND/named is not responding. So I don't think that DNSViz error matches the problem description. Having said that, one or more problems with the relevant zones could be triggering something in BIND... Thomas, can you clarify whether all queries to 127.0.0.1/53 result in: ;; communications error to 127.0.0.1#53: timed out when this problem occurs, or do just queries for s1._domainkey.mg-esp-prod-eu-eu.mallorcazeitung.es fail (or some level of failure in between all queries and the ones for that one domain)? And at that time, can you successfully query from the same system using a public resolver (e.g. "dig @9.9.9.9 s1._domainkey.mg-esp-prod-eu-eu.mallorcazeitung.es TXT")? And do you have BIND's logging for the queries that fail? Thanks, b. Michael Batchelder ISC Support -- Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information. bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Problem with a certain domain
> The newsletter is only sent out once a day, so I would have to wait > until tomorrow. I'll record it then. I have already experimented with > tshark and recorded port 53. When you run your packet capture, do not restrict your capture to only port 53. As a general rule, always keep your filtering as open as possible. That will allow for capturing potentially critical evidence such as ICMP error messages, ARP broadcasts, etc... or the absence of such things when they should be there. So at minimum add "icmp and arp" to your filter expression. > What I noticed as a network layman is that a certain > response takes much longer on server 1 with the problems than > on server 2. Your tshark snippets do not show "a certain response" taking much longer. That might be the explanation, but what you show is not proof of that. Your snippets only show response packets with varying amounts of separation between them. Without the request packet which generated the response, we can't calculate an actual time to respond, and have no way of knowing with certainty what the situation really is. Another general rulle: don't limit the amount of information you provide to those who are trying to help you or make them infer information. It's fine to mention only certain packets in an email, but put the full packet capture on a public resource somewhere accessible. Michael Batchelder ISC Support -- Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information. bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Problem with a certain domain
Thomas, I just incorrectly wrote: > So at minimum add "icmp and arp" to your filter expression. I did not mean to use the logical "and". Your minimum filter should be something like: "src port 53 or icmp or arp" Sorry for the confusion, Michael Michael Batchelder ISC Support -- Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information. bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
SERVFAIL error during the evening
Sami, After you regenerate your rndc key as Mark advised, you will need to provide us with more information, as what you've sent is not sufficient to troubleshoot your symptom. As a first step, take a packet capture on the resolver that shows incoming queries from the client and the corresponding outgoing queries from the resolver to upstream servers. When you capture packets, do not filter out TCP or ICMP or ARP. A tcpdump filter such as 'icmp or arp or port 53' should be sufficient. I would capture on all interfaces of the server (-i any). Send that capture file along with the BIND log segment which contains the failed queries. Michael Batchelder ISC Support -- Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information. bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: qname minimization: me too :(
> Yes, sure. I grabbed three typical cases to analyze further, and > currently trying to understand the proceedings - unsuccessfully, up > to now. :( > > Case 1: > --- > Jun 19 17:42:12 conr named[24481]: lame-servers: >info: success resolving '26.191.165.185.in-addr.arpa/PTR' >after disabling qname minimization due to 'ncache nxdomain' > > This one does not point back to me, but nevertheless I do not > see the lame server. > > Case 2: > --- > Jun 19 18:02:44 conr named[24481]: lame-servers: >info: success resolving 'reactivite.fr.intra.daemon.contact/' >after disabling qname minimization due to 'ncache nxdomain' > > Here, for whatever reason, the client was not happy with the official > answer on "reactivite.fr", and tried to append the search domain for > internal hosts on my LAN. > So this does absolutely point to me, only. The recursing LAN server > asks the authoritative LAN server (same image, different view), and> > that one basically says, this is bogus. > > Case 3: > --- > Jun 19 18:28:48 conr named[24481]: lame-servers: >info: success resolving > > '1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.b.1.0.0.3.2.f.1.0.7.4.0.1.0.0.2.ip6.arpa/PTR' >after disabling qname minimization due to 'ncache nxdomain' Peter, Case 1: The 191.165.185.in-addr.arpa zone (@200.3.13.14) responds with NXDOMAIN to queries for any QTYPE for QNAME 191.165.185.in-addr.arpa. Case 2: The intra.daemon.contact zone (@195.154.230.217) responds with NXDOMAIN to queries for any QTYPE of QNAME intra.daemon.contact. Case 3: The f.1.0.7.4.0.1.0.0.2.ip6.arpa zone (@216.66.80.18) responds with NXDOMAIN to queries for any QTYPE for QNAME f.1.0.7.4.0.1.0.0.2.ip6.arpa You'll need to fix these zones so that the response is NOERROR rather than NXDOMAIN. b. -- Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information. bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: SERVFAIL error during the evening
> Hello Michael > Thank you for your response. Here is a pcap file and some logs. Hello Sami, Your pcap shows your resolver making thousands of queries that get no responses (or at least the pcap does not contain them). There's not much I can say, beyond that this does not appear to be a problem related to BIND. You will need to look at your infrastructure and beyond to determine why you are not getting responses to your queries. One possibility may be in your infrastructure/network, where a firewall or other stateful inspection device is running out of resources to make additional state table entries. You will need to speak with the technical support of that device's vendor if you need help in assessing this. Michael -- Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information. bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: SERVFAIL error during the evening
>> Hello Michael >> Thank you for your response. Here is a pcap file and some logs. > > Hello Sami, > > Your pcap shows your resolver making thousands of queries that get > no responses (or at least the pcap does not contain them). There's > not much I can say, beyond that this does not appear to be a > problem > related to BIND. Sami, My co-worker helpfully pointed out something I missed when reviewing your packet capture. A large number of your resolution failures are because your BIND is configured to use QNAME minimization (a.k.a. "qmin") and the queries are to zones whose configuration is done incorrectly and breaks qmin. The pcap indicates you have the 'qname-minimization strict' setting in your BIND configuration file. See the "qname-minimization" statement in the Options section of the BIND ARM (https://bind9.readthedocs.io/en/v9.16.25/reference.html#options-statement-definition-and-usage). For the general background on qmin, read RFCs 7816 and 9156. I don't know of a reason why you would experience more qmin failures in the evening, other than the requests that fail are only made at that time. Regardless, if you want to stop the failures completely, you can change the 'qname-minimization strict' setting to 'qname-minimization disabled'. The drawback is that your queries will no longer be minimized, so all authoritative servers will see the full query name during recursion. As a compromise between doing nothing and fully disabling qmin, you can use the 'qname-minimization relaxed' setting which will try qmin and if BIND encounters a zone which breaks qmin, then BIND will switch to not doing qmin and do normal recursion (equivalent to 'qname-minimization disabled') for that query. Also, you should upgrade your version of BIND, as we can see that the qmin queries are those used in older versions of BIND. Michael -- Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information. bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: SERVFAIL error during the evening
> I have configured qname to disabled for now. Once the issue is resolved, > I will set it to relaxed. I have provided a download link for the log > files and a dig +trace test for more details on this issue, which I do > not think is related to BIND or its configuration. Sami, Discussions of non-BIND issues are outside the scope of this list. If you believe an issue is not related to BIND, you should look for a different forum or resource (such as vendor technical support) whose purview is relevant to the problem you have. Regarding your example dig +trace for push-rtmp-l96.douyincdn.com, you stopped at the point of tracing that produced this answer: push-rtmp-l96.douyincdn.com. 600 IN CNAME push-rtmp-l96.douyincdn.com.d.live.cdn.chinamobile.com. You will need to do the next steps of troubleshooting to see how push-rtmp-l96.douyincdn.com.d.live.cdn.chinamobile.com is resolved. To do that, I recommend using an excellent tool written by Shumon Huque: resolve.py (https://github.com/shuque/resolve). In particular, this tool will help you to see problems when QNAME minimization breaks due to bad zone configurations. Running the tool with the appropriate command-line switches: resolve.py -mv push-rtmp-l96.douyincdn.com.d.live.cdn.chinamobile.com will reveal multiple issues. One is: # QUERY: com.d.live.cdn.chinamobile.com. A IN at zone d.live.cdn.chinamobile.com. address 139.159.208.46 #[Got answer in 0.378 s] ERROR: NXDOMAIN: com.d.live.cdn.chinamobile.com. not found NXDOMAIN is an incorrect response for this query; the correct response is NODATA (i.e. RCODE = 0, ANSWER = 0). So China Mobile's CDN has broken DNS configuration and this breaks QNAME minimization. And querying the domain of the CNAME you would get if this failure wasn't present (cmcczjcdnl.pushcmcc.rtmps.gslb.d.live.cdn.chinamobile.com) also produces an NXDOMAIN at gslb.d.live.cdn.chinamobile.com and the nodes below it. So same problem. > I suspected that a firewall was blocking the DNS traffic, so I bypassed > the firewall, but the result is the same. How can we ensure that this is > a network-level issue? I looked at some of your logs. The resolver.log file is mostly errors of the form: resolver: notice: DNS format error from #53 resolving ns-open3.qq.com/ for : Name qq.com (SOA) not subdomain of zone ns-open3.qq.com -- invalid response If you look at the corresponding packets in your pcap, the responses are NODATA with an SOA record for the qq.com zone indicating the authoritative zone. But if you query for the NS records from the authoritative servers, you get a reply that indicates the zone ns-open3.qq.com is authoritative for resolving ns-open3.qq.com/all QTYPEs: # dig @59.36.132.142 ns-open3.qq.com ns ; <<>> DiG 9.18.18-0ubuntu0.22.04.2-Ubuntu <<>> @59.36.132.142 ns-open3.qq.com ns ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1621 ;; flags: qr aa rd ad; QUERY: 1, ANSWER: 4, AUTHORITY: 4, ADDITIONAL: 1 ;; WARNING: recursion requested but not available ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4304 ; COOKIE: 4cd34d4c82645709 (echoed) ;; QUESTION SECTION: ;ns-open3.qq.com. IN NS ;; ANSWER SECTION: ns-open3.qq.com.86400 IN NS ns-tel1.qq.com. ns-open3.qq.com.86400 IN NS ns-cnc1.qq.com. ns-open3.qq.com.86400 IN NS ns-os1.qq.com. ns-open3.qq.com.86400 IN NS ns-cmn1.qq.com. This mismatch between the authoritative zone name in the SOA record (qq.com) and what the delegated nameservers claim is the authoritative zone (ns-open3.qq.com) causes these messages. If you use the search for this mailing list at https://lists.isc.org/pipermail/bind-users/ or just use any public search engine you will see examples of people reporting this issue, and even citing this particular domain. This is not a BIND problem, it's a misconfiguration of records/zones. You can try contacting the administrator of the zone, webmas...@qq.com (per the SOA record). And before you ask for help from this list for future issues, I strongly recommend you run any domain that is failing to resolve through dnsviz.net to ensure that you're not asking about another zone misconfiguration, rather than an actual BIND problem. > download link: > > https://we.tl/t-M77os84duE This link does not appear to be a "public" link. A login appears to be required. In the future, please check that you are providing a public link (i.e. no login required) by using "private" mode of your chosen browser to see if a link can be accessed without prior login. Beyond that... As I mentioned in my initial email, your version of BIND is old and end-of-life. You should upgrade so that any issues can be discussed and bugs filed if necessary. Problems found in EOL'd versions are less likely to be addressed by listmembers (beyond indicating that you should upgrade)