If JS can’t handle WTF-8 natively, then what’s the benefit of using it? I am opposed to anything that requires string copies between the DOM and JS, unless there’s some really great overriding reason.
Cameron On Oct 5, 2014, at 9:26 AM, Simon Sapin <simon.sa...@exyr.org> wrote: > We’ve discussed using UTF-8 internally for strings in Servo, but well-formed > UTF-8 can not represent surrogate code points. > > JavaScript strings, however, can. (They are effectively potentially > ill-formed UTF-16.) It’s possible (?) that the Web depends on these > surrogates being preserved. > > So instead of UTF-8, we can use something we’ll call Wobbly Transformation > Format − 8-bit (WTF-8). > > * Specification: https://simonsapin.github.io/wtf-8/ > * Rust implementation, as a Cargo library: > https://github.com/SimonSapin/rust-wtf8 > * Library documentation: > https://simonsapin.github.io/rust-wtf8/wtf8/index.html > > It is a strict superset of UTF-8 (like UTF-8 is a strict superset of ASCII), > so converting from well-formed UTF-8 is a no-op. It can losslessly represent > all values JavaScript strings can (code points, including surrogates, as long > as they’re not paired.) Concatenating needs care to behave like concatenating > JS strings would. (Convert newly-paired surrogates into supplementary code > points.) > > > Proposal for Servo: use WTF-8 internally for all strings in the DOM and for > HTML parsing. > > * rust-encoding decodes bytes form the network into well-formed UTF-8 > * document.write() converts its argument to WTF-8 > * The html5ever tokenizer takes buffers that are either UTF-8 or WTF-8 > * html5ever uses WTF-8 everywhere internally, and emits WTF-8 to the tree > builder. > * (Optionally, html5ever could support a separate UTF-8 only interface for > non-Servo users that don’t need to support document.write().) > * Servo’s DOM stores WTF-8. > * Strings are converted to/from potentially ill-formed UTF-16 (that > SpiderMonkey can use) by the bindings code generation, at the boundary > between JavaScript and Rust. > > > In the future, if the JS team thinks it’s a good idea (and figures something > out for .charAt() and friends), SpiderMonkey could support WTF-8 internally > for JS strings and Servo’s bindings could remove the conversion. > > > What do you think? > -- > Simon Sapin > _______________________________________________ > dev-servo mailing list > dev-servo@lists.mozilla.org > https://lists.mozilla.org/listinfo/dev-servo _______________________________________________ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-servo