On 2013-02-08 12:36 PM, Simon Sapin wrote:
Hi dev-servo,

So my CSS parser in Rust is coming along nicely:
https://github.com/SimonSapin/rust-cssparser/

It’s mostly complete, although I still need to write some tests and
catch up with spec changes. We’re working on the remaining css3-syntax
issues in the CSS WG.
http://dev.w3.org/csswg/css3-syntax/

I haven't looked at your code at all, but I have done quite a bit of work on the existing CSS parser in Gecko.

Right now the parser returns a tree-like data structures where the nodes
are at-rules, style rules, declarations, and unparsed "component
values". The latter are close to tokens, and expected to by parsed
further into selectors, values for a given property, etc. This is out of
scope for this parser library.

This seems like the Right Thing in an abstract sense. The existing one-pass parser is complicated by its need to be able to recover from an arbitrarily heinous syntax error in the middle of a property value; if there were a prescan that identified the boundaries of each value, that would make life simpler for the bulk of the code. This also gives you a natural parallelization point and a natural way to preallocate memory for "declaration blocks" (in Gecko's terminology).

Next on the roadmap for this parser is to keep track of source location
(line and column number) for error reporting purpose. The current data
structure approach would mean adding source location information to
every token or at least every rule/declaration; which is a bit
heavy-weight.

You want to do this as lazily as possible, since with high probability no human will ever look at any given CSS error message. However, I would still go for the "add source location information to every token" approach in a greenfields design. It should be possible to make it be two machine words per token: a pointer into the text of the entire style sheet, and a line number. That's not bad compared to the data that already has to be carried around for every token.

(While I'm on the notion of reducing per-token data, I have thought for some time that it would be interesting to see if Gecko's CSS parser would be sped up and/or simplified if we smushed together the existing concepts of "token type", "symbol", and "keyword". I can elaborate on this if you're interested.)

I’m told that Gecko does all the parsing (including selectors and
property values) in one pass, and thus just queries the tokenizer for
source location information when an error is encountered.

This is correct. It also holds onto only one token at any given time, which means that error location information is not as accurate as it could be. Worse, we have to make sure to copy all the information we need out of the current token before retrieving the next one. That leads to nonobvious code constructs like this (CSSParserImpl::ParseCharsetRule):

  nsAutoString charset = mToken.mIdent;
  if (!ExpectSymbol(';', true)) {
    return false;
  }
  nsRefPtr<css::CharsetRule> rule = new css::CharsetRule(charset);

This can't be simplified, because ExpectSymbol clobbers mToken, but you'd be forgiven for thinking it could. Constructs like this appear in nearly every CSSparserImpl method.

zw
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Reply via email to