Re: [dev-servo] HTML parsing alternatives
On 06/03/14 02:05, Keegan McAllister wrote: Writing our own HTML5 parser would be a lot of work, but does not seem infeasible. The parsers I've found (including the translated C++ code for Gecko) are in the 10-20 KLoC range. We can do a one-time translation from Java for the most mechanical parts, without building a complete translator. FWIW I would estimate that a from-scratch implementation of a HTML parser that could replace Hubbub would be a "summer of code" sized project i.e. I would expect a reasonably new contributer to manage it in a couple of months and an experienced contributer to manage it in much less than that. Indeed much of hubbub itself was originally done as a GSoC project [1]. There is a standard test suite [2] for static HTML5 parsers. Browsers have additional requirements due to speculation and document.write(), but it looks like [3] Gecko implements that outside the translated parser, so this is code we would have to write and test in any case. So part of the difficulty of document.write comes from the fact that it has to interact with the script loading / document lifecycle. Therefore it's going to be hard to get those parts of (any) parser right until we actually implement a more correct model of document loading. Ideally the two things would be designed concurrently so that there isn't an impedance mismatch between the parser and the loading code. For the short term I will continue to work on the translator and see if we can get more clarity about some of these unknowns. But I'm also inclined to try implementing parts of a new HTML5 parser in Rust. At any rate we should pay close attention to Gecko's parser design, and I will continue reading through that code. My suspicion is that it's possible to spend more time talking about various options than it would take to stand up a rough prototype parser (with e.g. less important tokenizer/treebuilder states missing). Therefore I think this sounds like a great idea. [1] http://www.netsurf-browser.org/developers/gsoc/ ___ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-servo
Re: [dev-servo] HTML parsing alternatives
> I've been working on that recently and I have some doubts about this > approach. Java and C++ share some features that Rust does not have. > hsivonen and I have worked around some of these mismatches, but it's been a > fair amount of effort already, and the translator is not that close to > producing Rust code that will even compile. ... > I think the biggest unknown is memory management. Is this the only thing that's blocking compilation or is there something else, too? I thought the control structure translation was already OK a year ago. I hope I wasn't mistaken about the control structure part. > It's likely that an exact > copy of the C++ approach will upset the borrowchecker, requiring either > unsafe code or a more sophisticated translator. ... > The translator directly prints C++ or Rust code as it traverses the Java AST. > This makes it hard to implement anything beyond a close mapping of > individual syntax elements. Yes, if e.g. Rust unique pointer usage doesn't fit the JS/C++/Java code structure, it's not worthwhile to try to do machine translation. In that case, I'd expect developing a parser from scratch (or translating just the tokenizer control structure once as using that as a starting point) to be better. > Writing our own HTML5 parser would be a lot of work, but does not seem > infeasible. The parsers I've found (including the translated C++ code for > Gecko) are in the 10-20 KLoC range. Yeah, if translation doesn't work out fairly quickly, and it look we're are beyond "fairly quickly" by now, writing directly in Rust makes sense. > We can do a one-time translation from > Java for the most mechanical parts, without building a complete translator. Makes sense at least for the tokenizer. > At any rate we should > pay close attention to Gecko's parser design I think the off-the-main-thread design is worth copying. However, it's probably a good idea to design for a code path without the off-the-main-thread (or I guess task in Rust) overhead for innerHTML. See https://bugzilla.mozilla.org/show_bug.cgi?id=959150#c10 -- Henri Sivonen hsivo...@mozilla.com ___ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-servo
Re: [dev-servo] HTML parsing alternatives
(replies inline to multiple messages) >> I think the biggest unknown is memory management. > > Is this the only thing that's blocking compilation Unfortunately it's not. Some other problems I ran into: - The Java code has static data with non-constant initializers that depend on each other. For C++ we produce an initializeStatics() function for each class. We'd need to do the same for Rust. It's harder because Rust distinguishes between "static" and "static mut" and only the former can be used as patterns in "match". - The Java code uses "null" in various places. So the translator needs to insert Option wrapping/unwrapping on object creation, field access, etc. - Java and C++ name data members without a "self" or "this" prefix, but arguments and local variables can shadow them. I wrote a special case to handle the Java idiom "this.foo = foo" but it doesn't catch everything. I think we will need a real shadowing analysis, and I don't expect the Rust compiler to catch mistakes because the types will match. - Java and C++ support function overloading by argument count and type; Rust doesn't. I have manually reassigned unique names to the Java functions, but we might need a name-mangling approach to handle constructors. I'm confident that we can solve all of these problems with reasonable effort. What I don't know is how many more we will discover, especially once we get past syntax and name resolution errors. > However, it's probably a good idea to design for a code path without the > off-the-main-thread (or I guess task in Rust) overhead for innerHTML. See > https://bugzilla.mozilla.org/show_bug.cgi?id=959150#c10 Perhaps we can put the "create tree op" methods in a trait, and have one implementation which just applies them directly. Do we need to do script-initiated parsing in the script task in all cases? I was imagining that (similar to COW DOM) we could let the script continue until it touches the DOM again. For document.write it seems impossible due to e.g. x = 2; document.write('