[dev-servo] Table-less string interning
I opened this as a GitHub issue (https://github.com/mozilla/servo/issues/2217) but I thought I'd mention it here as well. The traditional approach for string interning is: * Interned strings are pointers * To intern a string, look in a hash table mapping strings to pointers * If it's not found, allocate a copy of the string owned by the hash table, and insert it Unfortunately this hash table needs to be shared between all threads that are creating interned strings which might be compared to each other. If we are parsing CSS in parallel and two threads independently intern the same class name, we will incorrectly interpret the class names as not equal. There's been much discussion (#1153, #1135) about what kind of concurrent hash table we should use for this. Here is an alternative that does not require such a table: * Interned strings are 128 bits * To intern a string of 16 bytes or fewer, just store it "as is" * To intern a string of more than 16 bytes, hash it using a 128-bit cryptographic hash function and store that. If the hash function is not broken, we can assume collisions don't happen. To prevent an attacker-chosen immediate string from colliding with a known hash value, set the top two bits of the first hash byte to `10`, which makes the first byte a UTF-8 continuation byte. Also use hashing for any string containing `NULL`, so that immediate strings don't need to include their length. So we can intern strings and compare them for equality without any shared state. But there's a third operation we need to support: getting the original characters of an interned string. For this we do need to build a table mapping hash values to strings. But each thread can build this table independently, since they will always get the same result. We can merge them later, in parallel with selector matching etc. Only when we need to read out original strings (for CSSOM?) do we block on construction of the merged table. And we don't need it at all for strings ≤ 16 bytes. So the things to investigate here are: * How much does a 128-bit compare cost, vs. 64 bits for a pointer? * How much does bloating the DOM cost, in terms of memory usage and cache pressure? * Can we make the cryptographic hash fast enough? On my machine, SHA-256 of a 64-byte string takes about 500 ns (`openssl speed sha256`). Rust already uses SipHash, which is supposed to be fast and crypto-strength. But we would need to run it twice to get 128 bits; 64 bits is not enough to assume collisions don't happen. * How much does it cost to build and merge the hash → string tables? We can study some of this using standalone programs before testing this approach in Servo. I should also note that we can do the small-string optimization even when the non-optimized case is a pointer. But in that case we'll only optimize 4- or 8-byte strings. keegan ___ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-servo
Re: [dev-servo] Table-less string interning
It seems like it would be a good idea to instrument Gecko to gather a trace of intern operations from regular Web browsing that you can run against test programs. Making the interned atom for <= 8-char or 4-char ASCII strings be the string itself is a really interesting idea. Another idea I like is to introduce a separate immutable prepopulated (with tag names, attribute values, CSS properties and keywords, known-to-be popular values, etc) table that uses the same hash values as the mutable concurrent table (so we don't have to compute the hash value twice) and use tag bits to identify the table an intern value belongs to. Seems like combining those ideas might get good results without being complex. Rob -- Jtehsauts tshaei dS,o n" Wohfy Mdaon yhoaus eanuttehrotraiitny eovni le atrhtohu gthot sf oirng iyvoeu rs ihnesa.r"t sS?o Whhei csha iids teoa stiheer :p atroa lsyazye,d 'mYaonu,r "sGients uapr,e tfaokreg iyvoeunr, 'm aotr atnod sgaoy ,h o'mGee.t" uTph eann dt hwea lmka'n? gBoutt uIp waanndt wyeonut thoo mken.o w ___ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-servo
Re: [dev-servo] Table-less string interning
> Another idea I like is to introduce a separate immutable prepopulated (with > tag names, attribute values, CSS properties and keywords, known-to-be popular > values, etc) table that uses the same hash values as the mutable concurrent > table (so we don't have to compute the hash value twice) That sounds good. Right now I'm putting together static interning in the HTML parser using rust-phf [1], but converging on the same hash as the dynamic table definitely makes sense. > and use tag bits to identify the table an intern value belongs to If an interned string is a pointer to a canonical copy of the string, then most operations won't care whether it's a pointer to static or dynamic memory. And if we do need to distinguish (for refcounting or something), we can look at the memory map. keegan [1] https://github.com/sfackler/rust-phf - Original Message - From: "Robert O'Callahan" To: "Keegan McAllister" Cc: dev-servo@lists.mozilla.org Sent: Wednesday, April 23, 2014 5:35:29 PM Subject: Re: [dev-servo] Table-less string interning It seems like it would be a good idea to instrument Gecko to gather a trace of intern operations from regular Web browsing that you can run against test programs. Making the interned atom for <= 8-char or 4-char ASCII strings be the string itself is a really interesting idea. Another idea I like is to introduce a separate immutable prepopulated (with tag names, attribute values, CSS properties and keywords, known-to-be popular values, etc) table that uses the same hash values as the mutable concurrent table (so we don't have to compute the hash value twice) and use tag bits to identify the table an intern value belongs to. Seems like combining those ideas might get good results without being complex. Rob -- Jtehsauts tshaei dS,o n" Wohfy Mdaon yhoaus eanuttehrotraiitny eovni le atrhtohu gthot sf oirng iyvoeu rs ihnesa.r"t sS?o Whhei csha iids teoa stiheer :p atroa lsyazye,d 'mYaonu,r "sGients uapr,e tfaokreg iyvoeunr, 'm aotr atnod sgaoy ,h o'mGee.t" uTph eann dt hwea lmka'n? gBoutt uIp waanndt wyeonut thoo mken.o w ___ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-servo
Re: [dev-servo] Table-less string interning
On 04/23/2014 04:38 PM, Keegan McAllister wrote: I opened this as a GitHub issue (https://github.com/mozilla/servo/issues/2217) but I thought I'd mention it here as well. Rust already uses SipHash, which is supposed to be fast and crypto-strength. But we would need to run it twice to get 128 bits; 64 bits is not enough to assume collisions don't happen. I'm not sure how relevant this is, but Rust does use 64-bit hashes for versioning and assumes collisions don't happen. We're assuming nobody is going to be attacking Rust symbols though, so it doesn't need to be truly 'secure'. On the other hand we also use 64-bit hashes for type ID's, and that is something somebody might want to forge for sure (though we don't use dynamic typing for much at all in Rust). ___ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-servo
Re: [dev-servo] Table-less string interning
On 4/23/14, 7:38 PM, Keegan McAllister wrote: * How much does a 128-bit compare cost, vs. 64 bits for a pointer? Are we more or less assuming we're on 64-bit systems? My memory is that the 128-bit IID compares we end up doing in QueryInterface sorta suck, but we do a bunch of them at a time, which can't help Measuring is probably the right plan here, but need to make sure to do it on a variety of hardware. * How much does bloating the DOM cost, in terms of memory usage and cache pressure? Note that in Gecko the DOM does not store too much in the way of atoms. For example, element tag names are stored in nodeinfo structs (shared across all elements with the same document/localname/namespace/prefix), not in the element themselves. We do store an atom-or-nodeinfo in nsAttrName in Gecko to avoid the extra indirection in the super-common case of an attribute in the null namespace, but obviously one could always use nodeinfo here. So I suspect memory usage is not a huge issue in practice. Cache pressure is a much bigger deal, unfortunately. Selector matching in particular is not all that cache-friendly in Gecko * Can we make the cryptographic hash fast enough? We should see how often we end up atomizing things that are over 16 chars. I guess a fairly large number of IDs and classes might in fact be there... -Boris ___ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-servo
Re: [dev-servo] Table-less string interning
On 4/23/14, 8:53 PM, Brian Anderson wrote: We're assuming nobody is going to be attacking Rust symbols though Oh, right, that's the other worry. We've had security issues in the past due to things like type="fİle"> being treated as a file input by some parts of the system but not others. So anything that allows collisions between user-provided things and built-in atoms for attribute names and values is bad. On the bright side, it may be possible to enforce that all built-in atoms are under the 16 char limit and hence not susceptible to collisions. -Boris ___ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-servo
Re: [dev-servo] Table-less string interning
On 4/23/14 6:06 PM, Boris Zbarsky wrote: On 4/23/14, 8:53 PM, Brian Anderson wrote: We're assuming nobody is going to be attacking Rust symbols though Oh, right, that's the other worry. We've had security issues in the past due to things like being treated as a file input by some parts of the system but not others. So anything that allows collisions between user-provided things and built-in atoms for attribute names and values is bad. On the bright side, it may be possible to enforce that all built-in atoms are under the 16 char limit and hence not susceptible to collisions. Disclaimer: I am not a cryptographer; please correct me mercilessly. A collision in a widely used cryptographic hash would be a major, publishable security advance. SipHash was concerned with attacks that could somehow exploit the properties of `hash(message) % hash_table_size`, but we're taking the whole hash here. So I don't think that would be an issue. (We could mix in a secret, per-browser-session nonce if we were really paranoid, but since collisions are a death sentence for cryptographic hashes in the first place it's likely overkill.) I'm more concerned about the potential impact of a large hash on memory bandwidth (e.g. if we want to match multiple selectors in parallel using SIMD), but I haven't had a whole lot of luck with techniques that would benefit from that anyhow. So I'm totally fine with this move, assuming its performance characteristics are otherwise favorable; best not to hold up things that we know *do* provide benefits in favor of nebulous performance concerns for algorithms we don't know how to make work in the first place. As an added note, looks like Skylake is going to have SHA instructions [1]. Based on my experience with AES-NI I wouldn't be surprised if using these instructions ends up faster than interning in software anyway. Patrick [1]: http://en.wikipedia.org/wiki/Intel_SHA_extensions ___ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-servo