Re: [dev-servo] WTF-8 encoding for DOM strings and HTML parsing

Henri Sivonen Thu, 09 Oct 2014 03:21:47 -0700

On Wed, Oct 8, 2014 at 4:13 PM, Jan de Mooij <jandemo...@gmail.com> wrote:
> When I added Latin1 to SpiderMonkey, we did consider using UTF8 but it's
> complicated. As mentioned, we have to ensure charAt/charCodeAt stay fast
> (crypto benchmarks etc rely on this, sadly).


It would be even more tragic to miss the opportunity to use 8-bit code
units for strings in Servo because JS crypto benchmarks use strings.
What chances are there to retire the use of strings-for-crypto in
benchmarking? Such a benchmark doesn't represent a reasonable real
application. A reasonable real application would use the Web Crypto
API to delegate crypto operations to outside the JS engine or use
ArrayBuffers to perform byte-oriented operations inside the JS engine.

> Many other string operations
> are also very perf-sensitive and extra branches in tight loops can hurt a
> lot.

Besides charAt/charCodeAt, what operations do you expect to be
adversely affected by WTF-8 memory layout?

As for extra branches, if each logically immutable string maintains
"is ASCII-only" immutable bit of state and, if that bit is false, two
mutable integers of state: "Next UTF-16 index" and "Next WTF-8 index"
is a branch at the start of charAt to see if the argument is equal to
"Next UTF-16 index" (in which case start reading at "Next WTF-8
index") substantially worse than checks to see if there a PIC of
something exists? Also, if the JIT knows about the internals of
strings, couldn't these checks be optimized out by temporarily
hoisting "Next UTF-16 index" and "Next WTF-8 index" out of the object
and inline into the code accessing the string before an optimizer
optimizes the code in obviously sequential loops?

> Also, the regular expression engine currently emits JIT code to load
> and compare multiple characters at once.

Since changing the code unit size to smaller is a matter of
concatenation and concatenation is a regular construct, whether a
regexp engine can be retargeted to *TF-8 is not an open research
question but a matter of doing the work, it doesn't make sense to me
to block Servo's use of *TF-8 on regexp concerns. When the time comes
to have product-level (as opposed to research placeholder) performance
for regexps, it should be a matter of doing work--not a matter of
researching if it is possible.

> All this is fixable to work on
> WTF-8 strings, but it's a lot of work and performance is a risk.

Considering all the work involved in making Servo into a engine
suitable for browsing the Web, it seems to me that it would be fair to
have this work on the todo list among everything else and accept
non-optimized WTF-8 string object support into SpiderMonkey as a
compile-time option for the time being.

> Also note that the copying we do for strings passed from JS to Gecko is not
> only necessary for moving GC, but also to inflate Latin1 strings (= most
> strings)

Has SpiderMonkey ever been instrumented to find out if most strings
are even just ASCII?

> to TwoByte Gecko strings. If Servo or Gecko could deal with both
> Latin1 and TwoByte strings, we could think about ways to avoid the copying.
> Though, as Boris said, I'm not aware of any (non-micro-)benchmark
> regressions from the copying so I don't expect big wins from optimizing
> this. But again, doing a Latin1 -> TwoByte copy is a very tight loop that
> compilers can probably vectorize. UTF8/WTF8 -> TwoByte is more complicated
> and probably slower.

Gecko already has vectorized code for conversions between UTF-8 and
UTF-16, so it's probably worth measuring how much worse vectorized
UTF-8 <-> UTF-16 is compared to vectorized Latin-1 <-> UTF-16. It's
quite possible that the answer is "not too much slower" if already
there aren't microbenchmarks relying on the copy speed.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Re: [dev-servo] WTF-8 encoding for DOM strings and HTML parsing

Reply via email to