date:20140306

Re: [dev-servo] HTML parsing alternatives

2014-03-06 Thread James Graham


On 06/03/14 02:05, Keegan McAllister wrote:


Writing our own HTML5 parser would be a lot of work, but does not
seem infeasible.  The parsers I've found (including the translated
C++ code for Gecko) are in the 10-20 KLoC range.  We can do a
one-time translation from Java for the most mechanical parts, without
building a complete translator.


FWIW I would estimate that a from-scratch implementation of a HTML 
parser that could replace Hubbub would be a "summer of code" sized 
project i.e. I would expect a reasonably new contributer to manage it in 
a couple of months and an experienced contributer to manage it in much 
less than that. Indeed much of hubbub itself was originally done as a 
GSoC project [1].



There is a standard test suite [2] for static HTML5 parsers.
Browsers have additional requirements due to speculation and
document.write(), but it looks like [3] Gecko implements that outside
the translated parser, so this is code we would have to write and
test in any case.


So part of the difficulty of document.write comes from the fact that it 
has to interact with the script loading / document lifecycle. Therefore 
it's going to be hard to get those parts of (any) parser right until we 
actually implement a more correct model of document loading. Ideally the 
two things would be designed concurrently so that there isn't an 
impedance mismatch between the parser and the loading code.



For the short term I will continue to work on the translator and see
if we can get more clarity about some of these unknowns.  But I'm
also inclined to try implementing parts of a new HTML5 parser in
Rust.  At any rate we should pay close attention to Gecko's parser
design, and I will continue reading through that code.


My suspicion is that it's possible to spend more time talking about 
various options than it would take to stand up a rough prototype parser 
(with e.g. less important tokenizer/treebuilder states missing). 
Therefore I think this sounds like a great idea.


[1] http://www.netsurf-browser.org/developers/gsoc/
___
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Re: [dev-servo] HTML parsing alternatives

2014-03-06 Thread Henri Sivonen

> I've been working on that recently and I have some doubts about this
> approach.  Java and C++ share some features that Rust does not have.
> hsivonen and I have worked around some of these mismatches, but it's been a
> fair amount of effort already, and the translator is not that close to
> producing Rust code that will even compile.
...
> I think the biggest unknown is memory management. 

Is this the only thing that's blocking compilation or is there something else, 
too? I thought the control structure translation was already OK a year ago. I 
hope I wasn't mistaken about the control structure part.

> It's likely that an exact
> copy of the C++ approach will upset the borrowchecker, requiring either
> unsafe code or a more sophisticated translator.
...
> The translator directly prints C++ or Rust code as it traverses the Java AST.
> This makes it hard to implement anything beyond a close mapping of
> individual syntax elements.

Yes, if e.g. Rust unique pointer usage doesn't fit the JS/C++/Java code 
structure, it's not worthwhile to try to do machine translation. In that case, 
I'd expect developing a parser from scratch (or translating just the tokenizer 
control structure once as using that as a starting point) to be better.

> Writing our own HTML5 parser would be a lot of work, but does not seem
> infeasible.  The parsers I've found (including the translated C++ code for
> Gecko) are in the 10-20 KLoC range.

Yeah, if translation doesn't work out fairly quickly, and it look we're are 
beyond "fairly quickly" by now, writing directly in Rust makes sense.

> We can do a one-time translation from
> Java for the most mechanical parts, without building a complete translator.

Makes sense at least for the tokenizer.

> At any rate we should
> pay close attention to Gecko's parser design

I think the off-the-main-thread design is worth copying.

However, it's probably a good idea to design for a code path without the 
off-the-main-thread (or I guess task in Rust) overhead for innerHTML. See 
https://bugzilla.mozilla.org/show_bug.cgi?id=959150#c10

-- 
Henri Sivonen
hsivo...@mozilla.com
___
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Re: [dev-servo] HTML parsing alternatives

2014-03-06 Thread Keegan McAllister

(replies inline to multiple messages)

>> I think the biggest unknown is memory management.
>
> Is this the only thing that's blocking compilation

Unfortunately it's not.  Some other problems I ran into:

- The Java code has static data with non-constant initializers that depend on 
each other.  For C++ we produce an initializeStatics() function for each class. 
 We'd need to do the same for Rust.  It's harder because Rust distinguishes 
between "static" and "static mut" and only the former can be used as patterns 
in "match".

- The Java code uses "null" in various places.  So the translator needs to 
insert Option wrapping/unwrapping on object creation, field access, etc.

- Java and C++ name data members without a "self" or "this" prefix, but 
arguments and local variables can shadow them.  I wrote a special case to 
handle the Java idiom "this.foo = foo" but it doesn't catch everything.  I 
think we will need a real shadowing analysis, and I don't expect the Rust 
compiler to catch mistakes because the types will match.

- Java and C++ support function overloading by argument count and type; Rust 
doesn't.  I have manually reassigned unique names to the Java functions, but we 
might need a name-mangling approach to handle constructors.

I'm confident that we can solve all of these problems with reasonable effort.  
What I don't know is how many more we will discover, especially once we get 
past syntax and name resolution errors.


> However, it's probably a good idea to design for a code path without the 
> off-the-main-thread (or I guess task in Rust) overhead for innerHTML. See 
> https://bugzilla.mozilla.org/show_bug.cgi?id=959150#c10

Perhaps we can put the "create tree op" methods in a trait, and have one 
implementation which just applies them directly.

Do we need to do script-initiated parsing in the script task in all cases?  I 
was imagining that (similar to COW DOM) we could let the script continue until 
it touches the DOM again.  For document.write it seems impossible due to e.g.


x = 2;
document.write('