encoding_rs landed delivering correctness, safety, performance and code size benefits as well as new functionality. Here's a summary of need-to-know stuff from the perspective of using it.
The docs for the Rust-visible API are at: https://docs.rs/encoding_rs/ The docs for the C++-visible API are at: https://searchfox.org/mozilla-central/source/intl/Encoding.h#100 The docs for the Rust-visible API also explain some design decisions. The docs also say how the API maps to the concepts of the Encoding Standard. * We now have the capability of decoding external text directly into UTF-8 and encoding text directly from UTF-8. This is a genuine direct-to-UTF-8 capability that does not pivot through UTF-16 buffers. If you're writing new code that takes textual input from external sources, please make your code operate on UTF-8 internally instead of making it operate on UTF-16. This way, the common decode case becomes mere validation and the parser-sensitive syntax (ASCII in Web formats) takes half the space. * nsIUnicodeDecoder and nsIUnicodeEncoder no longer exist and have been replaced with mozilla::Decoder and mozilla::Encoder, respectively. (encoding_rs::Decoder and encoding_rs::Encoder in Rust.) * The above two types only need to be used for streaming conversions. You no longer need to implement non-streaming conversions yourself on top of the streaming converters. Instead, mozilla::Encoding (C++; both nsAString and nsACString overloads for UTF-16 and UTF-8, respectively) and encoding_rs::Encoding (Rust; UTF-8 only) provide methods for non-streaming conversions and these methods take care of avoiding copies when possible. (If you need to work with XPCOM strings from Rust, there are functions in the encoding_glue crate. They haven't been grouped into a trait but could be if deemed necessary/useful.) * There is now a type-safe representation for the concept of an encoding: const mozilla::Encoding* in C++ and &'static encoding_rs::Encoding in Rust. - The two are toll-free bridged: They are the same thing. I.e. when crossing the FFI, write const mozilla::Encoding* on the C++ side and *const encoding_rs::Encoding on the Rust side of FFI. - The referents are statically allocated, so there's no need to refcount and using the plain pointers is really OK in C++. - Given that we now have a type-safe representation for the concept of an encoding, where possible, please use const mozilla::Encoding* mEncoding instead of nsCString mCharset to represent the concept of an encoding in new code. mozilla::Encoding::ForName() and mozilla::Encoding::Name() provide interop between the old and new ways. - For each encoding, there's a type-safe constant for referring to the encoding. To refer to UTF-8 from C++, use UTF_8_ENCODING. From Rust, use encoding_rs::UTF_8. * The new API provides the full set of options for handling the BOM correctly upon decode. Please pick the right one of the three options: the default (BOM sniffing with the decoder potentially morphing into a decoder for the BOM-indicated encoding), BOM removal (no decoder morphing) and no BOM handling (the BOM is handled like any other input bytes). * The new API handles the end of the stream correctly. Please actually let the decoder know about the end of the stream when using streaming decoding. * You no longer need to implement replacement of unmappable characters yourself. The decoders generate REPLACEMENT CHARACTERs for you by default and the encoders generate HTML number character references for you by default. These are the only modes that exist in the Web Platform, so other replacements are not supported (though it's possible to implement other replacement on top of the API entry points that do not perform replacement). The API lets you know if there where any of these replacements, so you can use this information to e.g. whine to console without having to take over implementing the replacement yourself just because you want to know if there were any replacements. * For old-style type-unsafe use of encoding name in nsACString to represent the concept of an encoding, the set of canonical names is now exactly the set of names from the WHATWG Encoding Standard. This means that: - ISO-8859-1 is no longer a Gecko-canonical name. Use windows-1252 instead. (I forgot to fix the remaining instances. The follow-up patch in https://bugzilla.mozilla.org/show_bug.cgi?id=1372994) - gbk is no longer a Gecko-canonical name. The new canonical name is GBK. - UTF-16 is no longer a Gecko-canonical name. Use UTF-16LE instead. * Encoding to UTF-16 (LE or BE) is no longer supported. That is, Gecko no longer has the capability of generating a _byte_ stream for _interchange_ in an UTF-16 encoding. (Decoding into _in-RAM_ UTF-16 as a stream of _16-bit units_ is, of course, supported.) * The encoders and decoders have no Reset() method. If you need a converter to go back to its start state, just create a new one. It's cheap. The creation doesn't perform any lookup table preparation or things like that. From C++, to avoid using malloc again, you can use mozilla::Encoding::NewDecoderInto() and variants to recycle the old heap allocation. * We don't have third-party crates in m-c that (unconditionally) require rust-encoding. However, if you need to import such a crate and it's infeasible to make it use encoding_rs directly, please do not vendor rust-encoding into the tree. Vendoring rust-encoding into the tree would bring in another set of lookup tables, which encoding_rs is specifically trying to avoid. I have a compatibily shim ready in case the need to vendor rust-encoding-dependent crates arises. https://github.com/hsivonen/encoding_rs_compat -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/ _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform