Hi, I have written a proposal to a) rewrite Gecko's encoding converters and b) to do it in Rust: https://docs.google.com/document/d/13GCbdvKi83a77ZcKOxaEteXp1SOGZ_9Fmztb9iX22v0/edit
I'd appreciate comments--especially from the owners of the uconv module and from people who have worked on encoding-related Rust code and on Rust code that needs encoding converters and is on track to be included in Gecko. I've put the proposal on Google Docs in order to benefit from the GDoc commenting feature that allows comments from multiple reviewers to be attached to particular bits of text. The document is rather long. The summary is: I think we should rewrite of Gecko's character encoding converters such that conversion to and from both in-memory UTF-16 and UTF-8 is supported, because 1) Currently, we can convert to and from UTF-16, which steers us to write parsers that operate on UTF-16. This is bad: ideally parsers would operate on UTF-8 to allow parsers to traverse a more compact memory representation (even HTML has plenty of ASCII markup; other formats we parse are even more ASCII-dominated). To make sure we don't write more UTF-16-based parsers in the future, we should have converters that can convert to and from UTF-8, too, but without paying the footprint cost of two independent sets of converters. 2) The footprint of Gecko is still a relevant concern in the Fennec case. (See e.g. the complications arising from Gecko developers being blocked from including ICU [not its converters] into Gecko on Android.) Our current converters are bloated due to optimizing the encode operation for legacy encoding for speed at the expense of lookup table size and we could make Gecko a bit smaller (i.e. make some room for good stuff on Android) by being smarter about encoding converter data tables. (Optimizing the relatively rare and performance non-sensitive encode operation for legacy encodings for size instead of speed.) 3) We should ensure the correctness of our converters and then stop tweaking them. 4) ...But our current converters are so unmaintainable that making these changes would be the easiest to accomplish via a rewrite. Furthermore, I think the rewrite should be in Rust, because a) Now that we have Rust and are starting to include Rust code in Gecko, it doesn't make sense to write new C++ code when the component is isolated enough to be suited for being written in Rust. b ) Importing a separate UTF-8-oriented conversion library written in Rust for use by future Gecko components written in Rust (which would ideally use UTF-8 internally, since Rust strings are UTF-8) would be a footprint problem compared to a single conversion library designed for both UTF-16 and UTF-8 with the same data tables. (For example, the URL parser is being rewritten in Rust and the URL parser depends on the rust-encoding library which doesn’t share data with our UTF-16-oriented C++-based converters.) -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/ _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform