Re: [dev-servo] WTF-8 encoding for DOM strings and HTML parsing

Simon Sapin Mon, 06 Oct 2014 09:02:30 -0700

On 06/10/14 07:57, Henri Sivonen wrote:

On Sun, Oct 5, 2014 at 7:26 PM, Simon Sapin <simon.sa...@exyr.org> wrote:

JavaScript strings, however, can. (They are effectively potentially
ill-formed UTF-16.) It’s possible (?) that the Web depends on these
surrogates being preserved.


It's clear that JS programs depend on being able to hold unpaired
surrogates and then to be able to later pair them. However, is there
evidence of the Web depending on the preservation of unpaired
surrogates outside the JS engine?


Is "outside the JS engine" anything other than DOM and CSSOM?

* Specification: https://simonsapin.github.io/wtf-8/


Looks great, except for the definition of "code unit". I think it's a
bad idea to redefine "code unit". I suggest minting a different
defined term, e.g. "16-bit code unit", to avoid redefining "code
unit".

Also, even though it's pretty obvious, it might be worthwhile to brag
in an informative note that "To convert lossily from WTF-8 to UTF-8,
replace any surrogate byte sequence with the sequence of three bytes
<0xEF, 0xBF, 0xBD>, the UTF-8 encoding of the replacement character."
means that you can do it *in place*, since a surrogate byte sequence
is always three bytes long and the UTF-8 representation of the
REPLACEMENT CHARACTER is three bytes long.


Good points. I’ve done both of these as suggested.

* document.write() converts its argument to WTF-8


Did you instrument Gecko to see if supporting the halves of a
surrogate pair falling into different document.write() calls is
actually needed for Web compat?


I have not. Can our telemetry infrastructure support this kind of thing?

FWIW, it is supported in Gecko, WebKit, Blink, and Trident. (Prestoshows two "missing glyph" rectangles.)


  document.write('"\uD83D');
  document.write('\uDCA9"');

(Of course, if you believe that
supporting the halves of a surrogate pair falling into different
[adjacent] text nodes in the DOM is required for Web compat, you might
as well make the parser operate on wtf-8.)

I’m guessing that is not necessary, since WebKit, Blink, and Prestodon’t support it. (Gecko and Trident do.)


  document.body.appendChild(document.createTextNode('"\uD83D'));
  document.body.appendChild(document.createTextNode('\uDCA9"'));

WebKit and Blink appear to ignore for text rendering everything from thefirst unpaired surrogate until the end of the text node. (The data isstill in the DOM.) Presto shows two rectangles.

In the future, if the JS team thinks it’s a good idea (and figures something
out for .charAt() and friends), SpiderMonkey could support WTF-8 internally
for JS strings and Servo’s bindings could remove the conversion.


For me, absent evidence, it's much easier to believe that using WTF-8
instead of potentially ill-formed UTF-16 would be a win for the JS
engine than to believe that using WTF-8 instead of UTF-8 in the DOM
would be a win.

So you’re suggesting Servo could get away with UTF-8 in the DOM? Ihadn’t considered it. I remove my proposal at the start of this thread,I’d like us to try this instead.

Servo would not need to use WTF-8 at all, and could just keep a single16 bit code unit around for when the document.write() input ends with alead surrogate, as Keegan suggested here:


https://github.com/kmcallister/html5ever/issues/6

Did anyone instrument SpiderMonkey in the Gecko case to see if
performance-sensitive repetitive random-access charAt() actually
occurs on the real Web? (If perf-sensitive charAt() occurs with
sequential indeces in practice, it should be possible to optimize
charAt() on WTF-8 backing storage to be O(1) in that case even if it
was O(N) in the general case.)

Sounds interesting, but I’ll leave this to people who actually work onSpiderMonkey.


--
Simon Sapin
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Re: [dev-servo] WTF-8 encoding for DOM strings and HTML parsing

Reply via email to