[dev-servo] Memory management in the HTML parser (was Re: DOM rooting is live)

Keegan McAllister Thu, 08 May 2014 18:07:24 -0700

> How would you create a `JS<T>` on the stack?

I'll provide some context.  I'm designing the HTML parser API to be compatible 
with different DOM representations.  This is desirable for a few reasons:


    - I want the library to be useful outside of Servo.  It will have the
      ability to output a simple static parse tree, for users who don't have
      their own DOM type.

    - Decoupling the parser from the details of Servo's DOM will make both
      systems much easier to modify.

    - For off-thread parsing, we want to build a sequence of tree operations,
      but for on-thread parsing we should manipulate the DOM directly.

My API mock-up at the moment looks like

    pub trait TreeSink<Handle> {
        fn create_element(&mut self, name: Atom) -> Handle;
        fn detach_from_parent(&mut self, child: Handle);
        fn append_element(&mut self, parent: Handle, child: Handle);

        // ...
    }

(In the real API there will be more parameters, e.g. a namespace and attributes 
for create_element.)

The library client will provide an implementation of TreeSink with an 
appropriate Handle type, and the parser will call these methods to manipulate 
the DOM during parsing.

A Handle represents a reference to a mutable node in the DOM.  They're required 
to be Clone, because the parser will hold internal references to nodes, e.g. in 
the stack of open elements.

Implementing this trait for a refcounted DOM is straightforward:

    struct Node {
        pub name: Atom,
        pub parent: Option<WeakHandle>,
        pub children: Vec<Handle>,
    }

    #[deriving(Clone)]
    struct Handle {
        ptr: Rc<RefCell<Node>>,
    }

    struct WeakHandle {
        ptr: Weak<RefCell<Node>>,
    }

    struct Sink {
        root: Option<Handle>,
    }

    impl TreeSink<Handle> for Sink { ... }

In Servo we have a JS-managed DOM, and on-main-thread parsing should manipulate 
it directly.  I have something like

    #[deriving(Encodable)]
    struct Node {
        pub name: StrBuf,
        pub parent: Option<Handle>,
        pub children: Vec<Handle>,
    }

    type Handle = JS<Node>;

    impl TreeSink<Handle> for Sink {
        fn create_element(&mut self, name: Atom) -> Handle {
            let owned = ~Node {
                name: name,
                children: vec!(),
                parent: None,
            };

            // Not shown: also build a JS wrapper object.

            unsafe {
                JS::from_raw(cast::transmute::<~Node, *mut Node>(owned))
            }
        }

        fn append_element(&mut self, parent_hdl: Handle, child_hdl: Handle) {
            let mut parent = parent_hdl.root();
            let mut child = child_hdl.root();

            (*child).parent = Some(parent_hdl.clone());
            parent.children.push(child_hdl.clone());
        }
    }

This (approximately) compiles, but I think it's not memory-safe, because we 
pass and return un-rooted JS<T> values.  To fix this we need two handle types:

    pub trait TreeSink<InHandle, OutHandle> {
        fn create_element(&mut self, name: Atom) -> OutHandle;
        fn detach_from_parent(&mut self, child: InHandle);
        fn append_element(&mut self, parent: InHandle, child: InHandle);

        // ...
    }

which will be instantiated as &JSRef<Node> and Temporary<Node> respectively.  
(And the lifetime of the JSRef will be inferred as the lifetime of each call, 
as in DOM bindings, but I'm not sure how to express this within the trait 
impl.) We'll need another trait to let the generic parser code convert between 
these types.

There is also the question of what handles to store within the parser.

    - We could root every node as it's created, and unroot when the parser
      is destroyed.  We'd store JSRef<Node>, transmuting away the lifetimes.

    - We could root the parser itself, make it traceable, and store JS<Node>.
      This seems safer, but would complicate the generic interface further.

This is all a bit moot if a parser never lives across a JS operation that could 
GC.  But I wouldn't bet on that always being the case.  The current Hubbub 
bindings basically make this assumption, though; see 
http://irclog.gr/#show/irc.mozilla.org/servo/103713

I also thought about something like

    pub trait TreeSink<InHandle, OutHandle> {
        fn create_element<'t>(&'t mut self, name: Atom) -> OutHandle<'t>;
        fn detach_from_parent<'t>(&'t mut self, child: InHandle<'t>);

but this would require higher-kinded polymorphism.  It will never really be 
possible for safe code to use a handle which stores an &'t mut Node, because 
that would completely break Rust's mutable-pointer aliasing rules.

For off-thread parsing, the TreeSink methods just record tree operations to be 
executed later.  In that situation, I think handles should be sequential 
integer IDs.  The tree op executor will use them as indexes into a vector of 
the nodes that it has created.  This is similar to Gecko's approach, where a 
handle is an nsIContent**, i.e. a pointer to a slot where a node pointer will 
eventually be stored.

By the way, it's not relevant to Servo, but I think we can parse into an owning 
tree without refcounting or copying.  During tree building we'll have

    struct BuildNode {
        pub name: Atom,
        pub parent: Option<*mut BuildNode>,
        pub children: Vec<*mut BuildNode>,
    }

and every node will be owned by the TreeSink itself.  When parsing is finished 
we transmute the root to

    struct Node {
        pub name: Atom,
        /*priv*/ unused_parent: Option<uint>,
        pub children: Vec<~Node>,
    }

transferring ownership of each node to its parent.  Then we free any nodes that 
didn't make it into the final tree.

Designing a library to be generic over its client's memory management approach 
in a statically safe way seems to be quite the challenge.  I'd be very happy to 
hear thoughts regarding any of the above.

keegan
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

[dev-servo] Memory management in the HTML parser (was Re: DOM rooting is live)

Reply via email to