Thanks for the feedback so far! If I go with the clone route (to work on the snapshot'ed version of the data), how can I later associate the cloned nodes to the original nodes from the document? One way that I thought is to set a a userdata on the DOM nodes and then use the clone handler callback to associate the cloned node with the original one (through weak refs or a WeakMap). That would mean iterating first through all nodes to add the handlers, but that's probably fine (I don't need to analyze anything or visit text nodes).
I think serializing and re-parsing everything in the worker is not the ideal solution unless we can find a way to also keep accurate associations with the original nodes from content. Anything that introduces a possibly lossy data aspect will probably hurt translation which is already an innacurate science. On Tue, Mar 4, 2014 at 6:26 AM, Andrew Sutherland < [email protected]> wrote: > On 03/04/2014 03:13 AM, Henri Sivonen wrote: > >> It saddens me that we are using non-compliant ad hoc parsers when we >> already have two spec-compliant (at least at some point in time) ones. >> > > Interesting! I assume you are referring to: > https://github.com/davidflanagan/html5/blob/master/html5parser.js > > Which seems to be (explicitly) derived from: > https://github.com/aredridel/html5 > > Which in turn seems to actually includes a few parser variants. > > Per the discussion with you on https://groups.google.com/d/ > msg/mozilla.dev.webapi/wDFM_T9v7Tc/Nr9Df4FUwuwJ for the Gaia e-mail app > we initially ended up using an in-page data document mechanism for > sanitization. We later migrated to using a worker based parser. There > were some coordination hiccups with this migration ( > https://bugzil.la/814257) and some time B2G time-pressure so a > comprehensive survey of HTML parsers did not happen so much. > > While we have a defense-in-depth strategy (CSP and iframe sandbox should > be protecting us from the worst possible scenarios) and we're hopeful that > Service Workers will eventually let us provide nsIContentPolicy-level > protection, the quality of the HTML parser is of course fairly important[1] > to the operation of the HTML sanitizer. If you'd like to bless a specific > implementation for workers to perform streaming HTML parsing or other some > other explicit strategy, I'd be happy to file a bug for us to go in that > direction. Because we are using a white-list based mechanism and are > fairly limited and arguably fairly luddite in what we whitelist, it's my > hope that our errors are on the side of safety (and breaking adventurous > HTML email :), but that is indeed largely hope. Your input is definitely > appreciated, especially as it relates to prioritizing such enhancements and > potential risk from our current strategy. > > Andrew > > > 1: understatement > > _______________________________________________ > dev-platform mailing list > [email protected] > https://lists.mozilla.org/listinfo/dev-platform > _______________________________________________ dev-platform mailing list [email protected] https://lists.mozilla.org/listinfo/dev-platform

