If JS can’t handle WTF-8 natively, then what’s the benefit of using it? I am 
opposed to anything that requires string copies between the DOM and JS, unless 
there’s some really great overriding reason.

Cameron

On Oct 5, 2014, at 9:26 AM, Simon Sapin <simon.sa...@exyr.org> wrote:

> We’ve discussed using UTF-8 internally for strings in Servo, but well-formed 
> UTF-8 can not represent surrogate code points.
> 
> JavaScript strings, however, can. (They are effectively potentially 
> ill-formed UTF-16.) It’s possible (?) that the Web depends on these 
> surrogates being preserved.
> 
> So instead of UTF-8, we can use something we’ll call Wobbly Transformation 
> Format − 8-bit (WTF-8).
> 
> * Specification: https://simonsapin.github.io/wtf-8/
> * Rust implementation, as a Cargo library:
>  https://github.com/SimonSapin/rust-wtf8
> * Library documentation:
>  https://simonsapin.github.io/rust-wtf8/wtf8/index.html
> 
> It is a strict superset of UTF-8 (like UTF-8 is a strict superset of ASCII), 
> so converting from well-formed UTF-8 is a no-op. It can losslessly represent 
> all values JavaScript strings can (code points, including surrogates, as long 
> as they’re not paired.) Concatenating needs care to behave like concatenating 
> JS strings would. (Convert newly-paired surrogates into supplementary code 
> points.)
> 
> 
> Proposal for Servo: use WTF-8 internally for all strings in the DOM and for 
> HTML parsing.
> 
> * rust-encoding decodes bytes form the network into well-formed UTF-8
> * document.write() converts its argument to WTF-8
> * The html5ever tokenizer takes buffers that are either UTF-8 or WTF-8
> * html5ever uses WTF-8 everywhere internally, and emits WTF-8 to the tree 
> builder.
> * (Optionally, html5ever could support a separate UTF-8 only interface for 
> non-Servo users that don’t need to support document.write().)
> * Servo’s DOM stores WTF-8.
> * Strings are converted to/from potentially ill-formed UTF-16 (that 
> SpiderMonkey can use) by the bindings code generation, at the boundary 
> between JavaScript and Rust.
> 
> 
> In the future, if the JS team thinks it’s a good idea (and figures something 
> out for .charAt() and friends), SpiderMonkey could support WTF-8 internally 
> for JS strings and Servo’s bindings could remove the conversion.
> 
> 
> What do you think?
> -- 
> Simon Sapin
> _______________________________________________
> dev-servo mailing list
> dev-servo@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-servo

_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Reply via email to