On Wed, Oct 8, 2014 at 4:13 PM, Jan de Mooij <jandemo...@gmail.com> wrote: > When I added Latin1 to SpiderMonkey, we did consider using UTF8 but it's > complicated. As mentioned, we have to ensure charAt/charCodeAt stay fast > (crypto benchmarks etc rely on this, sadly).
It would be even more tragic to miss the opportunity to use 8-bit code units for strings in Servo because JS crypto benchmarks use strings. What chances are there to retire the use of strings-for-crypto in benchmarking? Such a benchmark doesn't represent a reasonable real application. A reasonable real application would use the Web Crypto API to delegate crypto operations to outside the JS engine or use ArrayBuffers to perform byte-oriented operations inside the JS engine. > Many other string operations > are also very perf-sensitive and extra branches in tight loops can hurt a > lot. Besides charAt/charCodeAt, what operations do you expect to be adversely affected by WTF-8 memory layout? As for extra branches, if each logically immutable string maintains "is ASCII-only" immutable bit of state and, if that bit is false, two mutable integers of state: "Next UTF-16 index" and "Next WTF-8 index" is a branch at the start of charAt to see if the argument is equal to "Next UTF-16 index" (in which case start reading at "Next WTF-8 index") substantially worse than checks to see if there a PIC of something exists? Also, if the JIT knows about the internals of strings, couldn't these checks be optimized out by temporarily hoisting "Next UTF-16 index" and "Next WTF-8 index" out of the object and inline into the code accessing the string before an optimizer optimizes the code in obviously sequential loops? > Also, the regular expression engine currently emits JIT code to load > and compare multiple characters at once. Since changing the code unit size to smaller is a matter of concatenation and concatenation is a regular construct, whether a regexp engine can be retargeted to *TF-8 is not an open research question but a matter of doing the work, it doesn't make sense to me to block Servo's use of *TF-8 on regexp concerns. When the time comes to have product-level (as opposed to research placeholder) performance for regexps, it should be a matter of doing work--not a matter of researching if it is possible. > All this is fixable to work on > WTF-8 strings, but it's a lot of work and performance is a risk. Considering all the work involved in making Servo into a engine suitable for browsing the Web, it seems to me that it would be fair to have this work on the todo list among everything else and accept non-optimized WTF-8 string object support into SpiderMonkey as a compile-time option for the time being. > Also note that the copying we do for strings passed from JS to Gecko is not > only necessary for moving GC, but also to inflate Latin1 strings (= most > strings) Has SpiderMonkey ever been instrumented to find out if most strings are even just ASCII? > to TwoByte Gecko strings. If Servo or Gecko could deal with both > Latin1 and TwoByte strings, we could think about ways to avoid the copying. > Though, as Boris said, I'm not aware of any (non-micro-)benchmark > regressions from the copying so I don't expect big wins from optimizing > this. But again, doing a Latin1 -> TwoByte copy is a very tight loop that > compilers can probably vectorize. UTF8/WTF8 -> TwoByte is more complicated > and probably slower. Gecko already has vectorized code for conversions between UTF-8 and UTF-16, so it's probably worth measuring how much worse vectorized UTF-8 <-> UTF-16 is compared to vectorized Latin-1 <-> UTF-16. It's quite possible that the answer is "not too much slower" if already there aren't microbenchmarks relying on the copy speed. -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/ _______________________________________________ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-servo