Re: [dev-servo] HTML parser-related datatypes

Henri Sivonen Tue, 16 Oct 2012 07:13:10 -0700

On Thu, Oct 11, 2012 at 4:03 AM, Johnny Stenback <j...@mozilla.com> wrote:
>> Is there a plan to use a special interned string type for identifiers
>> the way we use nsIAtom* in Gecko?
>
> I would imagine that we will use something like nsINodeInfo, and
> something like nsIAtom, as they not only help with memory use, but also
> with performance when doing selector matching etc.

OK.

> One possibility could be to have our charset converters produce UTF-8,
> the parser parses that, and converts attribute values etc to UTF-16 as
> it creates attributes etc.

Yeah, that seems like the winning approach. In particular, if a page
is in UTF-8 to begin with (and newly-authored pages should be),
there’s no need to validate UTF-8 before parsing. It could be done as
part of the piece-wise conversion to UTF-16. This way, the cost of
UTF-8 validation would be saved for whitespace between attributes and
for all well-known element and attribute names.

On Fri, Oct 12, 2012 at 12:01 AM, Boris Zbarsky <bzbar...@mit.edu> wrote:
> On 10/11/12 4:09 PM, Zack Weinberg wrote:
>>
>> I'm not seeing why the JS engine has to use any particular
>> representation internally just because JS's exposed semantics are
>> defined in terms of UCS-2.
>
> Well, because it's simpler and because it makes charAt() fast?

Does random charAt() need to be O(1) or is it enough for each charAt
in sequence over the string to be O(1)? That is, would it be enough to
store two mutable indeces on each “immutable” string: the next UTF-16
index and the corresponding UTF-8 index so that the operation is fast
when charAt() is called with the index that’s cached as the next
UTF-16 index whose corresponding UTF-8 index is already known? Or
would it be feasible for the JIT to recognize iteration over a string
with charAt(), know about the internal storage and automatically
maintain a temporary UTF-8 index during the iteration without caching
indeces on the string object?

On Fri, Oct 12, 2012 at 8:33 AM, David Herman <dher...@mozilla.com> wrote:
> Simpler maybe, but since strings are immutable it's perfectly reasonable to 
> have multiple internal string types and tag strings in the heap as being e.g. 
> ASCII-only, so that they can be stored more compactly and still have fast 
> accesses. It does mean more proliferation of string types, and string 
> operations have to have multiple code paths, but I imagine for many 
> applications it could have significant wins.

Gecko has compact storage for text nodes that only have characters
whose code point is <= U+00FF. This is not great. If you have a huge
text node as the child of <script> or <style>, the storage of the node
doubles if there is a copyright notice with a single character above
U+00FF even if all the functional parts of the script or style are
ASCII. UTF-8 or CESU-8 would make much more sense in this scenario.

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Re: [dev-servo] HTML parser-related datatypes

Reply via email to