Thanks Henri, I think we can find a middle ground so as to avoid a ton of scope creep but leave the door open to a better iterative solution. See notes inline.
-e On Wed, May 24, 2017 at 11:18 PM, Henri Sivonen <hsivo...@hsivonen.fi> wrote: > On Wed, May 24, 2017 at 10:11 AM, Henri Sivonen <hsivo...@hsivonen.fi> > wrote: > >> Our current interface is UTF-16, so that's my target for now. I think > >> whatever cache-friendliness would be lost converting from UTF-16 -> > UTF-8 -> > >> UTF-16. > > > > I hope this can be reconsidered, because the assumption that it would > > have to be UTF-16 -> UTF-8 -> UTF-16 isn't accurate. > > I see that this part didn't get an on-list reply but got an blog reply: > http://www.erahm.org/2017/05/24/a-rust-based-xml-parser-for-firefox/ > Yes sorry, I should have followed up here as well! > I continue to think it's a bad idea to write another parser that uses > UTF-16 internally. Even though I can see your desire to keep the > project tightly scoped, I think it's fair to ask you to expand the > scope a bit by 1) adding a way to pass Latin-1 data to text nodes > directly (and use this when the the parser sees a text node is all > ASCII) and 2) replacing nsScanner with a bit of new buffering code > that takes bytes from the network and converts them to UTF-8 using > encoding_rs. > > We've both had the displeasure of modifying nsScanner as part of a > security fix. nsScanner isn't valuable code that we should try to > keep. It's no longer scanning for anything. It's just an > over-complicated way of maintaining a buffer of UTF-16 data. While > nsScanner and the associated classes are a lot of code, they do > something simple that should be done in quite a bit less code, so as > scope creep, replacing nsScanner should be a drop in a bucket > effort-wise compared to replacing expat. > > I think it's super-sad if we get another UTF-16-using parser because > replacing nsScanner was scoped out of the project. > Limiting to modifying nsScanner might be an option, but probably not changing all callers that use the nsAString interface. I guess we can just UTF-16 => UTF-8 those and file a bunch of follow ups? One thing we've ignored are all the consumers expect output to be UTF-16, so there's a fair amount of work on that side as well. Maybe a reasonable approach is to use a UTF-8 interface for the replacement Rust library and work on a staged rollout: 1. Start just converting UTF-16 => UTF-8 for input at the nsExpatDriver level, UTF-8 => UTF-16 for output 2. Modify/replace nsScanner with something that works with UTF-8 (and encoding_rs?), convert UTF-16 => UTF-8 for the nsAString methods 3. Follow up replacing nsAString methods with UTF-8 versions 4. Look into whether modifying the consumers of the tokenized data to handle UTF-8 is reasonable, follow up as necessary WDYT? > -- > Henri Sivonen > hsivo...@hsivonen.fi > https://hsivonen.fi/ > _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform