Hi, Teor, and sorry for the long delay! You had a lot of good questions on this proposal, and I didn't know how to answer them all. So in hopes of making progress here, I'm taking wild guesses and asking for help in making the wild guesses better :)
On Mon, Nov 13, 2017 at 5:28 PM, teor <teor2...@gmail.com> wrote: > On 14 Nov 2017, at 05:51, Nick Mathewson <ni...@torproject.org> wrote: > > Filename: 285-utf-8.txt > Title: Directory documents should be standardized as UTF-8 > Author: Nick Mathewson > Created: 13 November 2017 > Status: Open > > 1. Summary and motivation > > People frequently want to include non-ASCII text in their router > descriptors. The Contact line is a favorite place to do this, but in > principle the platform line would also be pretty logical. > > Unfortunately, there's no specified way to encode non-ASCII in our > directory documents. > > Fortunately, almost everybody who does it, uses UTF-8 anyway. > > > How many current descriptors will be rejected as non-UTF-8? I think that when last I checked, the number was something like 3. > As we move towards Rust support in Tor, we gain another motivation > for standarding on UTF-8, since Rust's native strings strongly prefer > UTF-8. > > So, in this proposal, we describe a migration path to having all > directory documents be fully UTF-8. > > 2. Proposal > > First, we should have Tor relays reject ContactInfo lines (and any > other lines copied directly into router descriptors) that are not > UTF-8. > > > How do we define UTF-8? I tried to do so as follows: We define the allowable set of UTF-8 as: * Encoding the codepoints U+01 through U+10FFFF, * but excluding the codepoints U+D800 through U+DFFF, * each encoded with the shortest possible encoding. * without any BOM Are there other restrictions we should make? If so, how should we phrase them? [...] > How do we carry forward existing ASCII restrictions into UTF-8? I don't understand this question. > We will need to update the directory spec to acknowledge that > contact and platform lines may be parsed as UTF-8 or > ASCII-including-arbitrary-bytes-except-NUL, and that they are > terminated by single-byte newlines regardless. Ack. > How do we deal with format confusion attacks? > > UTF-8 has a few alternative whitespace characters. These could > be used in an attack that confuses either humans viewing the file, > or automated software: > > If a human uses a UTF-8 compatible viewer or editor, it likely shows > Unicode newlines and ASCII newlines in an identical way. Similarly, > it may show Unicode spaces and ASCII spaces in the same way. > This may confuse the human reader. Right. I don't see an obvious attack here, but we should keep it in mind. Do you have a different suggestion of what to do here? > Similarly, if automated software parses using a Unicode whitespace > or newline character class, it will mis-parse directory documents. > (Our Rust protover code looks for ASCII spaces, so it appears to > be fine.) > > Note that we already have this issue with line feeds and carriage > returns, which I thought we had solved by banning carriage returns > in directory documents. But it appears we allow "any printing ASCII > character". (We will have to edit this to include Unicode.) Also let's consider all the nonprinting ASCII: it's already a potential display problem if you're using a bad editor, or whatever. > https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n218 > > At the same time, we should have authorities reject any router > descriptors or extrainfo documents that are not valid UTF-8. > Simultaneously, we can have all Tor instances reject all > non-directory-descriptor directory documents that are not UTF-8, > since none should exist today. > > > If we apply the existing restrictions in dir-spec, which require > non-directory-descriptor directory documents to be ASCII, they will > also be UTF-8. > > Isn't it confusing to say "UTF-8", when what we really mean is "ASCII"? > Do we expect to migrate these to non-ASCII UTF-8 at some point? I think having non-ASCII in extrainfos is a reasonable possibility. I'm not so sure about the others: there could be reasons in the future. My rationale for declaring everything to be UTF-8 was that it seemed more reasonable to have a single set of rules for parsing everything than to have different rules for different documents. > Also, does "non-directory-descriptor directory documents" mean we > can reject non-UTF-8 microdescriptors? I think we should. I think so. > Does the NS consensus contain any lines that are copied verbatim from > descriptors? I don't think so. [...] > should be rejected entirely: "reject-encrypted-non-utf-8". If that > parameter is set to 1, then hidden service clients will not only > warn, but reject the descriptors. > > Once the vast majority of clients are running versions that support > the "reject-encrypted-non-utf-8" parameter, that parameter can be set > to 1. > > > We also can't reject bridge descriptors at the authority level. > (Bridge clients download bridge descriptors directly from bridges.) > Do we need bridge clients to also use this consensus parameter? I added an extra section for this, basically saying "bridge clients should do that too": 2.2. Bridge descriptors Since clients download bridge descriptors directly from the bridges, they also need a two-phase plan as for hidden service descriptors above. Here we take the same approach as in section 2.1 above, except using the parameter "reject-bridge-descriptor-non-utf-8". _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev