On Sun, Oct 5, 2014 at 7:26 PM, Simon Sapin <simon.sa...@exyr.org> wrote: > JavaScript strings, however, can. (They are effectively potentially > ill-formed UTF-16.) It’s possible (?) that the Web depends on these > surrogates being preserved.
It's clear that JS programs depend on being able to hold unpaired surrogates and then to be able to later pair them. However, is there evidence of the Web depending on the preservation of unpaired surrogates outside the JS engine? > * Specification: https://simonsapin.github.io/wtf-8/ Looks great, except for the definition of "code unit". I think it's a bad idea to redefine "code unit". I suggest minting a different defined term, e.g. "16-bit code unit", to avoid redefining "code unit". Also, even though it's pretty obvious, it might be worthwhile to brag in an informative note that "To convert lossily from WTF-8 to UTF-8, replace any surrogate byte sequence with the sequence of three bytes <0xEF, 0xBF, 0xBD>, the UTF-8 encoding of the replacement character." means that you can do it *in place*, since a surrogate byte sequence is always three bytes long and the UTF-8 representation of the REPLACEMENT CHARACTER is three bytes long. > * document.write() converts its argument to WTF-8 Did you instrument Gecko to see if supporting the halves of a surrogate pair falling into different document.write() calls is actually needed for Web compat? (Of course, if you believe that supporting the halves of a surrogate pair falling into different [adjacent] text nodes in the DOM is required for Web compat, you might as well make the parser operate on wtf-8.) > In the future, if the JS team thinks it’s a good idea (and figures something > out for .charAt() and friends), SpiderMonkey could support WTF-8 internally > for JS strings and Servo’s bindings could remove the conversion. For me, absent evidence, it's much easier to believe that using WTF-8 instead of potentially ill-formed UTF-16 would be a win for the JS engine than to believe that using WTF-8 instead of UTF-8 in the DOM would be a win. Did anyone instrument SpiderMonkey in the Gecko case to see if performance-sensitive repetitive random-access charAt() actually occurs on the real Web? (If perf-sensitive charAt() occurs with sequential indeces in practice, it should be possible to optimize charAt() on WTF-8 backing storage to be O(1) in that case even if it was O(N) in the general case.) -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/ _______________________________________________ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-servo