Proposal to a) rewrite Gecko's encoding converters and b) to do it in Rust

Henri Sivonen Fri, 04 Dec 2015 03:54:33 -0800

Hi,

I have written a proposal to a) rewrite Gecko's encoding converters
and b) to do it in Rust:
https://docs.google.com/document/d/13GCbdvKi83a77ZcKOxaEteXp1SOGZ_9Fmztb9iX22v0/edit


I'd appreciate comments--especially from the owners of the uconv
module and from people who have worked on encoding-related Rust code
and on Rust code that needs encoding converters and is on track to be
included in Gecko.

I've put the proposal on Google Docs in order to benefit from the GDoc
commenting feature that allows comments from multiple reviewers to be
attached to particular bits of text.

The document is rather long. The summary is:

I think we should rewrite of Gecko's character encoding converters
such that conversion to and from both in-memory UTF-16 and UTF-8 is
supported, because

1) Currently, we can convert to and from UTF-16, which steers us to
write parsers that operate on UTF-16. This is bad: ideally parsers
would operate on UTF-8 to allow parsers to traverse a more compact
memory representation (even HTML has plenty of ASCII markup; other
formats we parse are even more ASCII-dominated). To make sure we don't
write more UTF-16-based parsers in the future, we should have
converters that can convert to and from UTF-8, too, but without paying
the footprint cost of two independent sets of converters.

2) The footprint of Gecko is still a relevant concern in the Fennec
case. (See e.g. the complications arising from Gecko developers being
blocked from including ICU [not its converters] into Gecko on
Android.) Our current converters are bloated due to optimizing the
encode operation for legacy encoding for speed at the expense of
lookup table size and we could make Gecko a bit smaller (i.e. make
some room for good stuff on Android) by being smarter about encoding
converter data tables. (Optimizing the relatively rare and performance
non-sensitive encode operation for legacy encodings for size instead
of speed.)

3) We should ensure the correctness of our converters and then stop
tweaking them.

4) ...But our current converters are so unmaintainable that making
these changes would be the easiest to accomplish via a rewrite.

Furthermore, I think the rewrite should be in Rust, because

a) Now that we have Rust and are starting to include Rust code in
Gecko, it doesn't make sense to write new C++ code when the component
is isolated enough to be suited for being written in Rust.

b ) Importing a separate UTF-8-oriented conversion library written in
Rust for use by future Gecko components written in Rust (which would
ideally use UTF-8 internally, since Rust strings are UTF-8) would be a
footprint problem compared to a single conversion library designed for
both UTF-16 and UTF-8 with the same data tables. (For example, the URL
parser is being rewritten in Rust and the URL parser depends on the
rust-encoding library which doesn’t share data with our
UTF-16-oriented C++-based converters.)

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Proposal to a) rewrite Gecko's encoding converters and b) to do it in Rust

Reply via email to