Thanks Henri, I think we can find a middle ground so as to avoid a ton of
scope creep but leave the door open to a better iterative solution. See
notes inline.

-e

On Wed, May 24, 2017 at 11:18 PM, Henri Sivonen <hsivo...@hsivonen.fi>
wrote:

> On Wed, May 24, 2017 at 10:11 AM, Henri Sivonen <hsivo...@hsivonen.fi>
> wrote:
> >> Our current interface is UTF-16, so that's my target for now. I think
> >> whatever cache-friendliness would be lost converting from UTF-16 ->
> UTF-8 ->
> >> UTF-16.
> >
> > I hope this can be reconsidered, because the assumption that it would
> > have to be UTF-16 -> UTF-8 -> UTF-16 isn't accurate.
>
> I see that this part didn't get an on-list reply but got an blog reply:
> http://www.erahm.org/2017/05/24/a-rust-based-xml-parser-for-firefox/
>

Yes sorry, I should have followed up here as well!


> I continue to think it's a bad idea to write another parser that uses
> UTF-16 internally. Even though I can see your desire to keep the
> project tightly scoped, I think it's fair to ask you to expand the
> scope a bit by 1) adding a way to pass Latin-1 data to text nodes
> directly (and use this when the the parser sees a text node is all
> ASCII) and 2) replacing nsScanner with a bit of new buffering code
> that takes bytes from the network and converts them to UTF-8 using
> encoding_rs.
>
> We've both had the displeasure of modifying nsScanner as part of a
> security fix. nsScanner isn't valuable code that we should try to
> keep. It's no longer scanning for anything. It's just an
> over-complicated way of maintaining a buffer of UTF-16 data. While
> nsScanner and the associated classes are a lot of code, they do
> something simple that should be done in quite a bit less code, so as
> scope creep, replacing nsScanner should be a drop in a bucket
> effort-wise compared to replacing expat.
>
> I think it's super-sad if we get another UTF-16-using parser because
> replacing nsScanner was scoped out of the project.
>

Limiting to modifying nsScanner might be an option, but probably not
changing all callers that use the nsAString interface. I guess we can just
UTF-16 => UTF-8 those and file a bunch of follow ups?

One thing we've ignored are all the consumers expect output to be UTF-16,
so there's a fair amount of work on that side as well.

Maybe a reasonable approach is to use a UTF-8 interface for the replacement
Rust library and work on a staged rollout:

   1. Start just converting UTF-16 => UTF-8 for input at the nsExpatDriver
   level, UTF-8 => UTF-16 for output
   2. Modify/replace nsScanner with something that works with UTF-8 (and
   encoding_rs?), convert UTF-16 => UTF-8 for the nsAString methods
   3. Follow up replacing nsAString methods with UTF-8 versions
   4.  Look into whether modifying the consumers of the tokenized data to
   handle UTF-8 is reasonable, follow up as necessary

WDYT?


> --
> Henri Sivonen
> hsivo...@hsivonen.fi
> https://hsivonen.fi/
>
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to