I opened this as a GitHub issue (https://github.com/mozilla/servo/issues/2217) but I thought I'd mention it here as well.
The traditional approach for string interning is: * Interned strings are pointers * To intern a string, look in a hash table mapping strings to pointers * If it's not found, allocate a copy of the string owned by the hash table, and insert it Unfortunately this hash table needs to be shared between all threads that are creating interned strings which might be compared to each other. If we are parsing CSS in parallel and two threads independently intern the same class name, we will incorrectly interpret the class names as not equal. There's been much discussion (#1153, #1135) about what kind of concurrent hash table we should use for this. Here is an alternative that does not require such a table: * Interned strings are 128 bits * To intern a string of 16 bytes or fewer, just store it "as is" * To intern a string of more than 16 bytes, hash it using a 128-bit cryptographic hash function and store that. If the hash function is not broken, we can assume collisions don't happen. To prevent an attacker-chosen immediate string from colliding with a known hash value, set the top two bits of the first hash byte to `10`, which makes the first byte a UTF-8 continuation byte. Also use hashing for any string containing `NULL`, so that immediate strings don't need to include their length. So we can intern strings and compare them for equality without any shared state. But there's a third operation we need to support: getting the original characters of an interned string. For this we do need to build a table mapping hash values to strings. But each thread can build this table independently, since they will always get the same result. We can merge them later, in parallel with selector matching etc. Only when we need to read out original strings (for CSSOM?) do we block on construction of the merged table. And we don't need it at all for strings ≤ 16 bytes. So the things to investigate here are: * How much does a 128-bit compare cost, vs. 64 bits for a pointer? * How much does bloating the DOM cost, in terms of memory usage and cache pressure? * Can we make the cryptographic hash fast enough? On my machine, SHA-256 of a 64-byte string takes about 500 ns (`openssl speed sha256`). Rust already uses SipHash, which is supposed to be fast and crypto-strength. But we would need to run it twice to get 128 bits; 64 bits is not enough to assume collisions don't happen. * How much does it cost to build and merge the hash → string tables? We can study some of this using standalone programs before testing this approach in Servo. I should also note that we can do the small-string optimization even when the non-optimized case is a pointer. But in that case we'll only optimize 4- or 8-byte strings. keegan _______________________________________________ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-servo