On 19/11/2025 21:56, Mark Andrews wrote:
On 20 Nov 2025, at 02:49, Petr Menšík<[email protected]> 
wrote:

No, I disagree.
On 19/11/2025 03:47, John Levine wrote:

The spec has always been clear that TXT records are strings of
arbitrary 8-bit data. If you want to put a particular interpretation
on some TXT records, pick an underscore _prefix and write a spec that
says what the format of the records is. See this registry for a dozen
examples:

https://www.iana.org/assignments/dns-parameters/dns-parameters.xhtml#underscored-globally-scoped-dns-node-names
RFC 1035 says about TXT record:
TXT RRs are used to hold descriptive text. The semantics of the text depends on 
the domain where it is found.
Now it does not specify anything about escaping the text when any byte with value 
>=127 is used. Yes, it is 8bit data. But records designed to hold generic 
binary data use presentation format in base64. TXT does not use that by design. 
Your assumption UTF-8 code points are not letters, just because they have higher 
byte value, are wrong in my opinion.
ASCII only text is always valid UTF-8. There does not need to by any change to 
process utf-8 encoded zone file. When you escape it, then it is escaped. When 
you do that, it becomes unreadable by humans.
Petr, when RFC 1035 was written UTF did not exist, nor did base64.  You need
to read RFC 1035 taking into account the state of the computing in 1987.  Yes
there is still code from that time running that needs to be able to read these
records.  Network ASCII existed i.e. RFC 20.  Octets with high bits set had
no specified character set so where unsafe to display.  End of line conventions
still differ in systems and emitting control characters is dangerous so
they need to be escaped.

Yes, I understand that. I am old enough to remember what mess plain text had with different encodings. I am born 1982 and started early with computers. After all, in our Czech it was different depending on OS used. Windows-1250 codepage on Microsoft systems, ISO-8859-2 on Unices. And on DOS we had CP852. I think we might have experienced a bit more confusion than common western people. Well, that was clear chaos and thank gods it is a history! It made complete sense to escape all bytes >= 127 at that time.

I think it is clear why did they not specify any encoding of those files and why escaping everything non-ASCII made sense back in the day. But what those old 8bit codepages lacked was any way to check what is encoding of that text. UTF-8 can provide some hint that it is probably it. I do not propose any form of normalization, case insensitive comparison or risky stuff. The content is binary, consider it binary. When it is compared, use memcmp(). Do not try to sort it by any locale dependent comparison. Only for printing, do not escape UTF-8 text on UTF-8 enabled terminal, unless it contains something isprint() would not allow too.

What I am most interested about are tools like dig, host or nsupdate. They are designed for less technical people. If you insist that is a problem for zone parsers, well, okay. That is not for common users. I personally do not see a good reason why to skip utf-8 support even in their text form, but something $utf8 would solve at the beginning of the zone.

But I think reasons for that 8bit encodings are gone most of time. Automatic conversions from one codepage to different codepage are no longer common. Sure, still possible. But nothing you would meet often unless you do wery weird things in your pipeline. But it is simple and fast check whether your zone file contains ASCII only or valid UTF-8 or some more random bytes.


RFC 1035

Because these files are text files several special encodings are
necessary to allow arbitrary data to be loaded. In particular:

of the root.

@ A free standing @ is used to denote the current origin.

\X where X is any character other than a digit (0-9), is
used to quote that character so that its special meaning
does not apply. For example, "\." can be used to place
a dot character in a label.

\DDD where each D is a digit is the octet corresponding to
the decimal number described by DDD. The resulting
octet is assumed to be text and is not checked for
special meaning.

Yes, I understand this works on all 7-bit safe terminal and even when zone files may convert from ISO-8859-1 to UTF-8 or back. If that is done anywhere for a good reason, I would like to hear that.

These were important to have consistent results when systems tried to automatically convert plain text to your native code page. I doubt that happens still today.

I do not want to forbid any escaping whenever you want. I say it is wrong to make such escaping always by default and make non-ASCII text unreadable and unusable. But I would like to omit escaping of valid utf-8 text if there is no indication my system cannot display that.


( ) Parentheses are used to group data that crosses a line
boundary. In effect, line terminations are not
recognized within parentheses.

; Semicolon is used to start a comment; the remainder of
the line is ignored.


--
Petr Menšík
Senior Software Engineer, RHEL
Red Hat,https://www.redhat.com/
PGP: DFCF908DB7C87E8E529925BC4931CA5B6C9FC5CB
_______________________________________________
DNSOP mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to