New character encoding conversion API

Henri Sivonen Thu, 15 Jun 2017 03:33:07 -0700

encoding_rs landed delivering correctness, safety, performance and
code size benefits as well as new functionality. Here's a summary of
need-to-know stuff from the perspective of using it.


The docs for the Rust-visible API are at: https://docs.rs/encoding_rs/
The docs for the C++-visible API are at:
https://searchfox.org/mozilla-central/source/intl/Encoding.h#100

The docs for the Rust-visible API also explain some design decisions.
The docs also say how the API maps to the concepts of the Encoding
Standard.

 * We now have the capability of decoding external text directly into
UTF-8 and encoding text directly from UTF-8. This is a genuine
direct-to-UTF-8 capability that does not pivot through UTF-16 buffers.
If you're writing new code that takes textual input from external
sources, please make your code operate on UTF-8 internally instead of
making it operate on UTF-16. This way, the common decode case becomes
mere validation and the parser-sensitive syntax (ASCII in Web formats)
takes half the space.

 * nsIUnicodeDecoder and nsIUnicodeEncoder no longer exist and have
been replaced with mozilla::Decoder and mozilla::Encoder,
respectively. (encoding_rs::Decoder and encoding_rs::Encoder in Rust.)

 * The above two types only need to be used for streaming conversions.
You no longer need to implement non-streaming conversions yourself on
top of the streaming converters. Instead, mozilla::Encoding (C++; both
nsAString and nsACString overloads for UTF-16 and UTF-8, respectively)
and encoding_rs::Encoding (Rust; UTF-8 only) provide methods for
non-streaming conversions and these methods take care of avoiding
copies when possible. (If you need to work with XPCOM strings from
Rust, there are functions in the encoding_glue crate. They haven't
been grouped into a trait but could be if deemed necessary/useful.)

 * There is now a type-safe representation for the concept of an
encoding: const mozilla::Encoding* in C++ and &'static
encoding_rs::Encoding in Rust.

   - The two are toll-free bridged: They are the same thing. I.e. when
crossing the FFI, write const mozilla::Encoding* on the C++ side and
*const encoding_rs::Encoding on the Rust side of FFI.

   - The referents are statically allocated, so there's no need to
refcount and using the plain pointers is really OK in C++.

   - Given that we now have a type-safe representation for the concept
of an encoding, where possible, please use const mozilla::Encoding*
mEncoding instead of nsCString mCharset to represent the concept of an
encoding in new code. mozilla::Encoding::ForName() and
mozilla::Encoding::Name() provide interop between the old and new
ways.

   - For each encoding, there's a type-safe constant for referring to
the encoding. To refer to UTF-8 from C++, use UTF_8_ENCODING. From
Rust, use encoding_rs::UTF_8.

 * The new API provides the full set of options for handling the BOM
correctly upon decode. Please pick the right one of the three options:
the default (BOM sniffing with the decoder potentially morphing into a
decoder for the BOM-indicated encoding), BOM removal (no decoder
morphing) and no BOM handling (the BOM is handled like any other input
bytes).

 * The new API handles the end of the stream correctly. Please
actually let the decoder know about the end of the stream when using
streaming decoding.

 * You no longer need to implement replacement of unmappable
characters yourself. The decoders generate REPLACEMENT CHARACTERs for
you by default and the encoders generate HTML number character
references for you by default. These are the only modes that exist in
the Web Platform, so other replacements are not supported (though it's
possible to implement other replacement on top of the API entry points
that do not perform replacement). The API lets you know if there where
any of these replacements, so you can use this information to e.g.
whine to console without having to take over implementing the
replacement yourself just because you want to know if there were any
replacements.

 * For old-style type-unsafe use of encoding name in nsACString to
represent the concept of an encoding, the set of canonical names is
now exactly the set of names from the WHATWG Encoding Standard. This
means that:

   - ISO-8859-1 is no longer a Gecko-canonical name. Use windows-1252
instead. (I forgot to fix the remaining instances. The follow-up patch
in https://bugzilla.mozilla.org/show_bug.cgi?id=1372994)

   - gbk is no longer a Gecko-canonical name. The new canonical name is GBK.

   - UTF-16 is no longer a Gecko-canonical name. Use UTF-16LE instead.

 * Encoding to UTF-16 (LE or BE) is no longer supported. That is,
Gecko no longer has the capability of generating a _byte_ stream for
_interchange_ in an UTF-16 encoding. (Decoding into _in-RAM_ UTF-16 as
a stream of _16-bit units_ is, of course, supported.)

 * The encoders and decoders have no Reset() method. If you need a
converter to go back to its start state, just create a new one. It's
cheap. The creation doesn't perform any lookup table preparation or
things like that. From C++, to avoid using malloc again, you can use
mozilla::Encoding::NewDecoderInto() and variants to recycle the old
heap allocation.

 * We don't have third-party crates in m-c that (unconditionally)
require rust-encoding. However, if you need to import such a crate and
it's infeasible to make it use encoding_rs directly, please do not
vendor rust-encoding into the tree. Vendoring rust-encoding into the
tree would bring in another set of lookup tables, which encoding_rs is
specifically trying to avoid. I have a compatibily shim ready in case
the need to vendor rust-encoding-dependent crates arises.
https://github.com/hsivonen/encoding_rs_compat

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

New character encoding conversion API

Reply via email to