[DNSOP] Re: Character encoding in DNS

Petr Menšík Fri, 21 Nov 2025 16:51:20 -0800

On 19/11/2025 21:56, Mark Andrews wrote:

On 20 Nov 2025, at 02:49, Petr Menšík<[email protected]> 
wrote:


No, I disagree.
On 19/11/2025 03:47, John Levine wrote:

The spec has always been clear that TXT records are strings of
arbitrary 8-bit data. If you want to put a particular interpretation
on some TXT records, pick an underscore _prefix and write a spec that
says what the format of the records is. See this registry for a dozen
examples:

https://www.iana.org/assignments/dns-parameters/dns-parameters.xhtml#underscored-globally-scoped-dns-node-names

RFC 1035 says about TXT record:
TXT RRs are used to hold descriptive text. The semantics of the text depends on 
the domain where it is found.
Now it does not specify anything about escaping the text when any byte with value 
>=127 is used. Yes, it is 8bit data. But records designed to hold generic 
binary data use presentation format in base64. TXT does not use that by design. 
Your assumption UTF-8 code points are not letters, just because they have higher 
byte value, are wrong in my opinion.
ASCII only text is always valid UTF-8. There does not need to by any change to 
process utf-8 encoded zone file. When you escape it, then it is escaped. When 
you do that, it becomes unreadable by humans.

Petr, when RFC 1035 was written UTF did not exist, nor did base64.  You need
to read RFC 1035 taking into account the state of the computing in 1987.  Yes
there is still code from that time running that needs to be able to read these
records.  Network ASCII existed i.e. RFC 20.  Octets with high bits set had
no specified character set so where unsafe to display.  End of line conventions
still differ in systems and emitting control characters is dangerous so
they need to be escaped.

Yes, I understand that. I am old enough to remember what mess plain texthad with different encodings. I am born 1982 and started early withcomputers. After all, in our Czech it was different depending on OSused. Windows-1250 codepage on Microsoft systems, ISO-8859-2 on Unices.And on DOS we had CP852. I think we might have experienced a bit moreconfusion than common western people. Well, that was clear chaos andthank gods it is a history! It made complete sense to escape all bytes>= 127 at that time.

I think it is clear why did they not specify any encoding of those filesand why escaping everything non-ASCII made sense back in the day. Butwhat those old 8bit codepages lacked was any way to check what isencoding of that text. UTF-8 can provide some hint that it is probablyit. I do not propose any form of normalization, case insensitivecomparison or risky stuff. The content is binary, consider it binary.When it is compared, use memcmp(). Do not try to sort it by any localedependent comparison. Only for printing, do not escape UTF-8 text onUTF-8 enabled terminal, unless it contains something isprint() would notallow too.

What I am most interested about are tools like dig, host or nsupdate.They are designed for less technical people. If you insist that is aproblem for zone parsers, well, okay. That is not for common users. Ipersonally do not see a good reason why to skip utf-8 support even intheir text form, but something $utf8 would solve at the beginning of thezone.

But I think reasons for that 8bit encodings are gone most of time.Automatic conversions from one codepage to different codepage are nolonger common. Sure, still possible. But nothing you would meet oftenunless you do wery weird things in your pipeline. But it is simple andfast check whether your zone file contains ASCII only or valid UTF-8 orsome more random bytes.


RFC 1035

Because these files are text files several special encodings are
necessary to allow arbitrary data to be loaded. In particular:

of the root.

@ A free standing @ is used to denote the current origin.

\X where X is any character other than a digit (0-9), is
used to quote that character so that its special meaning
does not apply. For example, "\." can be used to place
a dot character in a label.

\DDD where each D is a digit is the octet corresponding to
the decimal number described by DDD. The resulting
octet is assumed to be text and is not checked for
special meaning.

Yes, I understand this works on all 7-bit safe terminal and even whenzone files may convert from ISO-8859-1 to UTF-8 or back. If that is doneanywhere for a good reason, I would like to hear that.

These were important to have consistent results when systems tried toautomatically convert plain text to your native code page. I doubt thathappens still today.

I do not want to forbid any escaping whenever you want. I say it iswrong to make such escaping always by default and make non-ASCII textunreadable and unusable. But I would like to omit escaping of validutf-8 text if there is no indication my system cannot display that.


( ) Parentheses are used to group data that crosses a line
boundary. In effect, line terminations are not
recognized within parentheses.

; Semicolon is used to start a comment; the remainder of
the line is ignored.

--
Petr Menšík
Senior Software Engineer, RHEL
Red Hat,https://www.redhat.com/
PGP: DFCF908DB7C87E8E529925BC4931CA5B6C9FC5CB

_______________________________________________
DNSOP mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[DNSOP] Re: Character encoding in DNS

Reply via email to