You could use a rope where individual chunks can be either UTF-8 or UCS-2. UTF-8 strings would also record whether they happen to be 7-bit ASCII, and UCS-2 strings would record whether they contain any surrogates (and also maybe whether all surrogates are correctly paired according to UTF-16). Together with the lengths stored throughout the rope's tree structure, this would enable efficient indexing by either UCS-2 or Unicode codepoint index.
That's a fair bit of complexity and so probably only worth it for large strings built in complicated ways. I don't know whether these are common in the wild. If you want to get really fancy, I think you could use succinct rank/select dictionaries or wavelet trees to make UTF-8 codepoint-indexable with minimal space overhead. (But I don't actually know much about these data structures!) keegan ----- Original Message ----- From: "Luke Wagner" <l...@mozilla.com> To: "Henri Sivonen" <hsivo...@hsivonen.fi> Cc: "Boris Zbarsky" <bzbar...@mit.edu>, mozilla-dev-se...@lists.mozilla.org, "Robert O'Callahan" <rob...@ocallahan.org>, "Eddy Bruel" <ejpbr...@mozilla.com> Sent: Thursday, April 3, 2014 9:02:38 AM Subject: Re: [dev-servo] character encoding in the HTML parser Another option we've just been discussing is to lazily compute a flag on the string indicating "contents are 7-bit ascii" that allowed us to use array indexing. I'd expect this to often be true. There are also many cases where we'd eagerly have this flag (atoms produced during parsing, strings converted from numbers, concatenations of 7-bit ascii strings, substrings of 7-bit ascii strings, as a parameter from the embedding, etc) so that we would be able to avoid much of the overhead of this check. (One could even imagine a background thread that computed 7-bit-ness ;) ----- Original Message ----- > On Wed, Apr 2, 2014 at 4:25 PM, Robert O'Callahan <rob...@ocallahan.org> > wrote: > > If we could get the JS engine to use evil-UTF8 with some hack to handle > > charAt and friends efficiently (e.g. tacking on a UCS-2 version of the > > string when necessary) > > Have we instrumented Gecko to find out what the access patterns are > like? If the main performance-sensitive access pattern is sequential > iteration over the string, instead of storing a UCS-2 copy, we could > store the next expected UCS-2 index and the next UTF-8 index. charAt > would then start by comparing if its argument equals the next expected > UCS-2 index in which case read would start at the next UTF-8 index. > > -- > Henri Sivonen > hsivo...@hsivonen.fi > https://hsivonen.fi/ > _______________________________________________ > dev-servo mailing list > dev-servo@lists.mozilla.org > https://lists.mozilla.org/listinfo/dev-servo > _______________________________________________ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-servo _______________________________________________ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-servo