I opened this as a GitHub issue (https://github.com/mozilla/servo/issues/2217) 
but I thought I'd mention it here as well.

The traditional approach for string interning is:

* Interned strings are pointers
* To intern a string, look in a hash table mapping strings to pointers
* If it's not found, allocate a copy of the string owned by the hash table, and 
insert it

Unfortunately this hash table needs to be shared between all threads that are 
creating interned strings which might be compared to each other.  If we are 
parsing CSS in parallel and two threads independently intern the same class 
name, we will incorrectly interpret the class names as not equal.

There's been much discussion (#1153, #1135) about what kind of concurrent hash 
table we should use for this.

Here is an alternative that does not require such a table:

* Interned strings are 128 bits
* To intern a string of 16 bytes or fewer, just store it "as is"
* To intern a string of more than 16 bytes, hash it using a 128-bit 
cryptographic hash function and store that.  If the hash function is not 
broken, we can assume collisions don't happen.

To prevent an attacker-chosen immediate string from colliding with a known hash 
value, set the top two bits of the first hash byte to `10`, which makes the 
first byte a UTF-8 continuation byte.  Also use hashing for any string 
containing `NULL`, so that immediate strings don't need to include their length.

So we can intern strings and compare them for equality without any shared 
state.  But there's a third operation we need to support: getting the original 
characters of an interned string.  For this we do need to build a table mapping 
hash values to strings.  But each thread can build this table independently, 
since they will always get the same result.  We can merge them later, in 
parallel with selector matching etc.  Only when we need to read out original 
strings (for CSSOM?) do we block on construction of the merged table.  And we 
don't need it at all for strings ≤ 16 bytes.

So the things to investigate here are:

* How much does a 128-bit compare cost, vs. 64 bits for a pointer?
* How much does bloating the DOM cost, in terms of memory usage and cache 
pressure?
* Can we make the cryptographic hash fast enough?  On my machine, SHA-256 of a 
64-byte string takes about 500 ns (`openssl speed sha256`).  Rust already uses 
SipHash, which is supposed to be fast and crypto-strength.  But we would need 
to run it twice to get 128 bits; 64 bits is not enough to assume collisions 
don't happen.
* How much does it cost to build and merge the hash → string tables?

We can study some of this using standalone programs before testing this 
approach in Servo.

I should also note that we can do the small-string optimization even when the 
non-optimized case is a pointer.  But in that case we'll only optimize 4- or 
8-byte strings.

keegan
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Reply via email to