Re: [dev-servo] character encoding in the HTML parser

Keegan McAllister Tue, 22 Apr 2014 16:16:23 -0700

You could use a rope where individual chunks can be either UTF-8 or UCS-2.  
UTF-8 strings would also record whether they happen to be 7-bit ASCII, and 
UCS-2 strings would record whether they contain any surrogates (and also maybe 
whether all surrogates are correctly paired according to UTF-16).  Together 
with the lengths stored throughout the rope's tree structure, this would enable 
efficient indexing by either UCS-2 or Unicode codepoint index.

That's a fair bit of complexity and so probably only worth it for large strings 
built in complicated ways.  I don't know whether these are common in the wild.

If you want to get really fancy, I think you could use succinct rank/select 
dictionaries or wavelet trees to make UTF-8 codepoint-indexable with minimal 
space overhead.  (But I don't actually know much about these data structures!)

keegan

----- Original Message -----
From: "Luke Wagner" <[email protected]>
To: "Henri Sivonen" <[email protected]>
Cc: "Boris Zbarsky" <[email protected]>, [email protected], 
"Robert O'Callahan" <[email protected]>, "Eddy Bruel" <[email protected]>
Sent: Thursday, April 3, 2014 9:02:38 AM
Subject: Re: [dev-servo] character encoding in the HTML parser

Another option we've just been discussing is to lazily compute a flag on the 
string indicating "contents are 7-bit ascii" that allowed us to use array 
indexing.  I'd expect this to often be true.  There are also many cases where 
we'd eagerly have this flag (atoms produced during parsing, strings converted 
from numbers, concatenations of 7-bit ascii strings, substrings of 7-bit ascii 
strings, as a parameter from the embedding, etc) so that we would be able to 
avoid much of the overhead of this check.  (One could even imagine a background 
thread that computed 7-bit-ness ;)

----- Original Message -----
> On Wed, Apr 2, 2014 at 4:25 PM, Robert O'Callahan <[email protected]>
> wrote:
> > If we could get the JS engine to use evil-UTF8 with some hack to handle
> > charAt and friends efficiently (e.g. tacking on a UCS-2 version of the
> > string when necessary)
> 
> Have we instrumented Gecko to find out what the access patterns are
> like? If the main performance-sensitive access pattern is sequential
> iteration over the string, instead of storing a UCS-2 copy, we could
> store the next expected UCS-2 index and the next UTF-8 index. charAt
> would then start by comparing if its argument equals the next expected
> UCS-2 index in which case read would start at the next UTF-8 index.
> 
> --
> Henri Sivonen
> [email protected]
> https://hsivonen.fi/
> _______________________________________________
> dev-servo mailing list
> [email protected]
> https://lists.mozilla.org/listinfo/dev-servo
> 
_______________________________________________
dev-servo mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-servo
_______________________________________________
dev-servo mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-servo

Re: [dev-servo] character encoding in the HTML parser

Reply via email to