Re: [dev-servo] WTF-8 encoding for DOM strings and HTML parsing

Henri Sivonen Sun, 05 Oct 2014 23:58:08 -0700

On Sun, Oct 5, 2014 at 7:26 PM, Simon Sapin <simon.sa...@exyr.org> wrote:
> JavaScript strings, however, can. (They are effectively potentially
> ill-formed UTF-16.) It’s possible (?) that the Web depends on these
> surrogates being preserved.


It's clear that JS programs depend on being able to hold unpaired
surrogates and then to be able to later pair them. However, is there
evidence of the Web depending on the preservation of unpaired
surrogates outside the JS engine?

> * Specification: https://simonsapin.github.io/wtf-8/

Looks great, except for the definition of "code unit". I think it's a
bad idea to redefine "code unit". I suggest minting a different
defined term, e.g. "16-bit code unit", to avoid redefining "code
unit".

Also, even though it's pretty obvious, it might be worthwhile to brag
in an informative note that "To convert lossily from WTF-8 to UTF-8,
replace any surrogate byte sequence with the sequence of three bytes
<0xEF, 0xBF, 0xBD>, the UTF-8 encoding of the replacement character."
means that you can do it *in place*, since a surrogate byte sequence
is always three bytes long and the UTF-8 representation of the
REPLACEMENT CHARACTER is three bytes long.

> * document.write() converts its argument to WTF-8

Did you instrument Gecko to see if supporting the halves of a
surrogate pair falling into different document.write() calls is
actually needed for Web compat? (Of course, if you believe that
supporting the halves of a surrogate pair falling into different
[adjacent] text nodes in the DOM is required for Web compat, you might
as well make the parser operate on wtf-8.)

> In the future, if the JS team thinks it’s a good idea (and figures something
> out for .charAt() and friends), SpiderMonkey could support WTF-8 internally
> for JS strings and Servo’s bindings could remove the conversion.

For me, absent evidence, it's much easier to believe that using WTF-8
instead of potentially ill-formed UTF-16 would be a win for the JS
engine than to believe that using WTF-8 instead of UTF-8 in the DOM
would be a win.

Did anyone instrument SpiderMonkey in the Gecko case to see if
performance-sensitive repetitive random-access charAt() actually
occurs on the real Web? (If perf-sensitive charAt() occurs with
sequential indeces in practice, it should be possible to optimize
charAt() on WTF-8 backing storage to be O(1) in that case even if it
was O(N) in the general case.)

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Re: [dev-servo] WTF-8 encoding for DOM strings and HTML parsing

Reply via email to