On Thu, Oct 11, 2012 at 4:03 AM, Johnny Stenback <j...@mozilla.com> wrote: >> Is there a plan to use a special interned string type for identifiers >> the way we use nsIAtom* in Gecko? > > I would imagine that we will use something like nsINodeInfo, and > something like nsIAtom, as they not only help with memory use, but also > with performance when doing selector matching etc.
OK. > One possibility could be to have our charset converters produce UTF-8, > the parser parses that, and converts attribute values etc to UTF-16 as > it creates attributes etc. Yeah, that seems like the winning approach. In particular, if a page is in UTF-8 to begin with (and newly-authored pages should be), there’s no need to validate UTF-8 before parsing. It could be done as part of the piece-wise conversion to UTF-16. This way, the cost of UTF-8 validation would be saved for whitespace between attributes and for all well-known element and attribute names. On Fri, Oct 12, 2012 at 12:01 AM, Boris Zbarsky <bzbar...@mit.edu> wrote: > On 10/11/12 4:09 PM, Zack Weinberg wrote: >> >> I'm not seeing why the JS engine has to use any particular >> representation internally just because JS's exposed semantics are >> defined in terms of UCS-2. > > Well, because it's simpler and because it makes charAt() fast? Does random charAt() need to be O(1) or is it enough for each charAt in sequence over the string to be O(1)? That is, would it be enough to store two mutable indeces on each “immutable” string: the next UTF-16 index and the corresponding UTF-8 index so that the operation is fast when charAt() is called with the index that’s cached as the next UTF-16 index whose corresponding UTF-8 index is already known? Or would it be feasible for the JIT to recognize iteration over a string with charAt(), know about the internal storage and automatically maintain a temporary UTF-8 index during the iteration without caching indeces on the string object? On Fri, Oct 12, 2012 at 8:33 AM, David Herman <dher...@mozilla.com> wrote: > Simpler maybe, but since strings are immutable it's perfectly reasonable to > have multiple internal string types and tag strings in the heap as being e.g. > ASCII-only, so that they can be stored more compactly and still have fast > accesses. It does mean more proliferation of string types, and string > operations have to have multiple code paths, but I imagine for many > applications it could have significant wins. Gecko has compact storage for text nodes that only have characters whose code point is <= U+00FF. This is not great. If you have a huge text node as the child of <script> or <style>, the storage of the node doubles if there is a copyright notice with a single character above U+00FF even if all the functional parts of the script or style are ASCII. UTF-8 or CESU-8 would make much more sense in this scenario. -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/ _______________________________________________ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-servo