Re: [dev-servo] CSS parser architecture: data structures vs. callbacks

Zack Weinberg Mon, 11 Feb 2013 21:32:17 -0800

On 2013-02-08 12:36 PM, Simon Sapin wrote:

Hi dev-servo,


So my CSS parser in Rust is coming along nicely:
https://github.com/SimonSapin/rust-cssparser/

It’s mostly complete, although I still need to write some tests and
catch up with spec changes. We’re working on the remaining css3-syntax
issues in the CSS WG.
http://dev.w3.org/csswg/css3-syntax/

I haven't looked at your code at all, but I have done quite a bit ofwork on the existing CSS parser in Gecko.

Right now the parser returns a tree-like data structures where the nodes
are at-rules, style rules, declarations, and unparsed "component
values". The latter are close to tokens, and expected to by parsed
further into selectors, values for a given property, etc. This is out of
scope for this parser library.

This seems like the Right Thing in an abstract sense. The existingone-pass parser is complicated by its need to be able to recover from anarbitrarily heinous syntax error in the middle of a property value; ifthere were a prescan that identified the boundaries of each value, thatwould make life simpler for the bulk of the code. This also gives you anatural parallelization point and a natural way to preallocate memoryfor "declaration blocks" (in Gecko's terminology).

Next on the roadmap for this parser is to keep track of source location
(line and column number) for error reporting purpose. The current data
structure approach would mean adding source location information to
every token or at least every rule/declaration; which is a bit
heavy-weight.

You want to do this as lazily as possible, since with high probabilityno human will ever look at any given CSS error message. However, Iwould still go for the "add source location information to every token"approach in a greenfields design. It should be possible to make it betwo machine words per token: a pointer into the text of the entire stylesheet, and a line number. That's not bad compared to the data thatalready has to be carried around for every token.

(While I'm on the notion of reducing per-token data, I have thought forsome time that it would be interesting to see if Gecko's CSS parserwould be sped up and/or simplified if we smushed together the existingconcepts of "token type", "symbol", and "keyword". I can elaborate onthis if you're interested.)

I’m told that Gecko does all the parsing (including selectors and
property values) in one pass, and thus just queries the tokenizer for
source location information when an error is encountered.

This is correct. It also holds onto only one token at any given time,which means that error location information is not as accurate as itcould be. Worse, we have to make sure to copy all the information weneed out of the current token before retrieving the next one. Thatleads to nonobvious code constructs like this(CSSParserImpl::ParseCharsetRule):


  nsAutoString charset = mToken.mIdent;
  if (!ExpectSymbol(';', true)) {
    return false;
  }
  nsRefPtr<css::CharsetRule> rule = new css::CharsetRule(charset);

This can't be simplified, because ExpectSymbol clobbers mToken, butyou'd be forgiven for thinking it could. Constructs like this appear innearly every CSSparserImpl method.


zw
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Re: [dev-servo] CSS parser architecture: data structures vs. callbacks

Reply via email to