Re: [dev-servo] WTF-8 encoding for DOM strings and HTML parsing

Henri Sivonen Tue, 07 Oct 2014 06:58:10 -0700

On Mon, Oct 6, 2014 at 7:00 PM, Simon Sapin <simon.sa...@exyr.org> wrote:
> On 06/10/14 07:57, Henri Sivonen wrote:
>> On Sun, Oct 5, 2014 at 7:26 PM, Simon Sapin <simon.sa...@exyr.org> wrote:
>>> JavaScript strings, however, can. (They are effectively potentially
>>> ill-formed UTF-16.) It’s possible (?) that the Web depends on these
>>> surrogates being preserved.
>>
>> It's clear that JS programs depend on being able to hold unpaired
>> surrogates and then to be able to later pair them. However, is there
>> evidence of the Web depending on the preservation of unpaired
>> surrogates outside the JS engine?
>
> Is "outside the JS engine" anything other than DOM and CSSOM?


For practical purposes, probably no.

>>> * Specification: https://simonsapin.github.io/wtf-8/
>>
>>
>> Looks great, except for the definition of "code unit". I think it's a
>> bad idea to redefine "code unit". I suggest minting a different
>> defined term, e.g. "16-bit code unit", to avoid redefining "code
>> unit".
>>
>> Also, even though it's pretty obvious, it might be worthwhile to brag
>> in an informative note that "To convert lossily from WTF-8 to UTF-8,
>> replace any surrogate byte sequence with the sequence of three bytes
>> <0xEF, 0xBF, 0xBD>, the UTF-8 encoding of the replacement character."
>> means that you can do it *in place*, since a surrogate byte sequence
>> is always three bytes long and the UTF-8 representation of the
>> REPLACEMENT CHARACTER is three bytes long.
>
> Good points. I’ve done both of these as suggested.

Thanks.

>>> * document.write() converts its argument to WTF-8
>>
>>
>> Did you instrument Gecko to see if supporting the halves of a
>> surrogate pair falling into different document.write() calls is
>> actually needed for Web compat?
>
>
> I have not. Can our telemetry infrastructure support this kind of thing?

Yes. At least it would be easy to count "number of sessions without an
unpaired surrogate at the start or end of a document.written string"
vs. "number of sessions with at least one unpaired surrogate at the
start or end of a document.written string".

> FWIW, it is supported in Gecko, WebKit, Blink, and Trident.

When using potentially ill-formed UTF-16 in the JS engine, in the DOM
bindings and in the parser, that's the most obvious outcome if you do
nothing special, so this proves nothing about the Web requiring this.

> (Presto shows
> two "missing glyph" rectangles.)

That's encouraging!

>> (Of course, if you believe that
>> supporting the halves of a surrogate pair falling into different
>> [adjacent] text nodes in the DOM is required for Web compat, you might
>> as well make the parser operate on wtf-8.)
>
>
> I’m guessing that is not necessary, since WebKit, Blink, and Presto don’t
> support it. (Gecko and Trident do.)
>
>   document.body.appendChild(document.createTextNode('"\uD83D'));
>   document.body.appendChild(document.createTextNode('\uDCA9"'));
>
> WebKit and Blink appear to ignore for text rendering everything from the
> first unpaired surrogate until the end of the text node. (The data is still
> in the DOM.) Presto shows two rectangles.

Excellent!

>>> In the future, if the JS team thinks it’s a good idea (and figures
>>> something
>>> out for .charAt() and friends), SpiderMonkey could support WTF-8
>>> internally
>>> for JS strings and Servo’s bindings could remove the conversion.
>>
>>
>> For me, absent evidence, it's much easier to believe that using WTF-8
>> instead of potentially ill-formed UTF-16 would be a win for the JS
>> engine than to believe that using WTF-8 instead of UTF-8 in the DOM
>> would be a win.
>
> So you’re suggesting Servo could get away with UTF-8 in the DOM?

Yes.

> Servo would not need to use WTF-8 at all

Assuming "Servo" excludes the JS engine.

On Mon, Oct 6, 2014 at 8:27 PM, Cameron Zwarich <zwar...@mozilla.com> wrote:
>> So you’re suggesting Servo could get away with UTF-8 in the DOM? I hadn’t 
>> considered it. I remove my proposal at the start of this thread, I’d like us 
>> to try this instead.
>
> UTF-8 strings will mean that we will have to copy all non-7-bit ASCII strings 
> between the DOM and JS.

Not if JS stores strings as WTF-8. I think it would be tragic not to
bother to try to make the JS engine use WTF-8 when having the
opportunity to fix things and thereby miss the opportunity to use
UTF-8 in the DOM in Servo. UTF-16 is such a mistake.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Re: [dev-servo] WTF-8 encoding for DOM strings and HTML parsing

Reply via email to