[dev-servo] Table-less string interning

2014-04-23 Thread Keegan McAllister
I opened this as a GitHub issue (https://github.com/mozilla/servo/issues/2217) 
but I thought I'd mention it here as well.

The traditional approach for string interning is:

* Interned strings are pointers
* To intern a string, look in a hash table mapping strings to pointers
* If it's not found, allocate a copy of the string owned by the hash table, and 
insert it

Unfortunately this hash table needs to be shared between all threads that are 
creating interned strings which might be compared to each other.  If we are 
parsing CSS in parallel and two threads independently intern the same class 
name, we will incorrectly interpret the class names as not equal.

There's been much discussion (#1153, #1135) about what kind of concurrent hash 
table we should use for this.

Here is an alternative that does not require such a table:

* Interned strings are 128 bits
* To intern a string of 16 bytes or fewer, just store it "as is"
* To intern a string of more than 16 bytes, hash it using a 128-bit 
cryptographic hash function and store that.  If the hash function is not 
broken, we can assume collisions don't happen.

To prevent an attacker-chosen immediate string from colliding with a known hash 
value, set the top two bits of the first hash byte to `10`, which makes the 
first byte a UTF-8 continuation byte.  Also use hashing for any string 
containing `NULL`, so that immediate strings don't need to include their length.

So we can intern strings and compare them for equality without any shared 
state.  But there's a third operation we need to support: getting the original 
characters of an interned string.  For this we do need to build a table mapping 
hash values to strings.  But each thread can build this table independently, 
since they will always get the same result.  We can merge them later, in 
parallel with selector matching etc.  Only when we need to read out original 
strings (for CSSOM?) do we block on construction of the merged table.  And we 
don't need it at all for strings ≤ 16 bytes.

So the things to investigate here are:

* How much does a 128-bit compare cost, vs. 64 bits for a pointer?
* How much does bloating the DOM cost, in terms of memory usage and cache 
pressure?
* Can we make the cryptographic hash fast enough?  On my machine, SHA-256 of a 
64-byte string takes about 500 ns (`openssl speed sha256`).  Rust already uses 
SipHash, which is supposed to be fast and crypto-strength.  But we would need 
to run it twice to get 128 bits; 64 bits is not enough to assume collisions 
don't happen.
* How much does it cost to build and merge the hash → string tables?

We can study some of this using standalone programs before testing this 
approach in Servo.

I should also note that we can do the small-string optimization even when the 
non-optimized case is a pointer.  But in that case we'll only optimize 4- or 
8-byte strings.

keegan
___
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo


Re: [dev-servo] Table-less string interning

2014-04-23 Thread Robert O'Callahan
It seems like it would be a good idea to instrument Gecko to gather a trace
of intern operations from regular Web browsing that you can run against
test programs.

Making the interned atom for <= 8-char or 4-char ASCII strings be the
string itself is a really interesting idea.

Another idea I like is to introduce a separate immutable prepopulated (with
tag names, attribute values, CSS properties and keywords, known-to-be
popular values, etc) table that uses the same hash values as the mutable
concurrent table (so we don't have to compute the hash value twice) and use
tag bits to identify the table an intern value belongs to.

Seems like combining those ideas might get good results without being
complex.

Rob
-- 
Jtehsauts  tshaei dS,o n" Wohfy  Mdaon  yhoaus  eanuttehrotraiitny  eovni
le atrhtohu gthot sf oirng iyvoeu rs ihnesa.r"t sS?o  Whhei csha iids  teoa
stiheer :p atroa lsyazye,d  'mYaonu,r  "sGients  uapr,e  tfaokreg iyvoeunr,
'm aotr  atnod  sgaoy ,h o'mGee.t"  uTph eann dt hwea lmka'n?  gBoutt  uIp
waanndt  wyeonut  thoo mken.o w
___
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo


Re: [dev-servo] Table-less string interning

2014-04-23 Thread Keegan McAllister
> Another idea I like is to introduce a separate immutable prepopulated (with 
> tag names, attribute values, CSS properties and keywords, known-to-be popular 
> values, etc) table that uses the same hash values as the mutable concurrent 
> table (so we don't have to compute the hash value twice)

That sounds good.  Right now I'm putting together static interning in the HTML 
parser using rust-phf [1], but converging on the same hash as the dynamic table 
definitely makes sense.

> and use tag bits to identify the table an intern value belongs to

If an interned string is a pointer to a canonical copy of the string, then most 
operations won't care whether it's a pointer to static or dynamic memory.  And 
if we do need to distinguish (for refcounting or something), we can look at the 
memory map.

keegan

[1] https://github.com/sfackler/rust-phf

- Original Message -
From: "Robert O'Callahan" 
To: "Keegan McAllister" 
Cc: dev-servo@lists.mozilla.org
Sent: Wednesday, April 23, 2014 5:35:29 PM
Subject: Re: [dev-servo] Table-less string interning

It seems like it would be a good idea to instrument Gecko to gather a trace
of intern operations from regular Web browsing that you can run against
test programs.

Making the interned atom for <= 8-char or 4-char ASCII strings be the
string itself is a really interesting idea.

Another idea I like is to introduce a separate immutable prepopulated (with
tag names, attribute values, CSS properties and keywords, known-to-be
popular values, etc) table that uses the same hash values as the mutable
concurrent table (so we don't have to compute the hash value twice) and use
tag bits to identify the table an intern value belongs to.

Seems like combining those ideas might get good results without being
complex.

Rob
-- 
Jtehsauts  tshaei dS,o n" Wohfy  Mdaon  yhoaus  eanuttehrotraiitny  eovni
le atrhtohu gthot sf oirng iyvoeu rs ihnesa.r"t sS?o  Whhei csha iids  teoa
stiheer :p atroa lsyazye,d  'mYaonu,r  "sGients  uapr,e  tfaokreg iyvoeunr,
'm aotr  atnod  sgaoy ,h o'mGee.t"  uTph eann dt hwea lmka'n?  gBoutt  uIp
waanndt  wyeonut  thoo mken.o w
___
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo


Re: [dev-servo] Table-less string interning

2014-04-23 Thread Brian Anderson

On 04/23/2014 04:38 PM, Keegan McAllister wrote:

I opened this as a GitHub issue (https://github.com/mozilla/servo/issues/2217) 
but I thought I'd mention it here as well.

Rust already uses SipHash, which is supposed to be fast and crypto-strength.  
But we would need to run it twice to get 128 bits; 64 bits is not enough to 
assume collisions don't happen.


I'm not sure how relevant this is, but Rust does use 64-bit hashes for 
versioning and assumes collisions don't happen. We're assuming nobody is 
going to be attacking Rust symbols though, so it doesn't need to be 
truly 'secure'. On the other hand we also use 64-bit hashes for type 
ID's, and that is something somebody might want to forge for sure 
(though we don't use dynamic typing for much at all in Rust).


___
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo


Re: [dev-servo] Table-less string interning

2014-04-23 Thread Boris Zbarsky

On 4/23/14, 7:38 PM, Keegan McAllister wrote:

* How much does a 128-bit compare cost, vs. 64 bits for a pointer?


Are we more or less assuming we're on 64-bit systems?

My memory is that the 128-bit IID compares we end up doing in 
QueryInterface sorta suck, but we do a bunch of them at a time, which 
can't help


Measuring is probably the right plan here, but need to make sure to do 
it on a variety of hardware.



* How much does bloating the DOM cost, in terms of memory usage and cache 
pressure?


Note that in Gecko the DOM does not store too much in the way of atoms. 
 For example, element tag names are stored in nodeinfo structs (shared 
across all elements with the same document/localname/namespace/prefix), 
not in the element themselves.


We do store an atom-or-nodeinfo in nsAttrName in Gecko to avoid the 
extra indirection in the super-common case of an attribute in the null 
namespace, but obviously one could always use nodeinfo here.


So I suspect memory usage is not a huge issue in practice.

Cache pressure is a much bigger deal, unfortunately.  Selector matching 
in particular is not all that cache-friendly in Gecko



* Can we make the cryptographic hash fast enough?


We should see how often we end up atomizing things that are over 16 chars.

I guess a fairly large number of IDs and classes might in fact be there...

-Boris
___
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo


Re: [dev-servo] Table-less string interning

2014-04-23 Thread Boris Zbarsky

On 4/23/14, 8:53 PM, Brian Anderson wrote:

We're assuming nobody is going to be attacking Rust symbols though


Oh, right, that's the other worry.

We've had security issues in the past due to things like type="fİle"> being treated as a file input by some parts of the system 
but not others.  So anything that allows collisions between 
user-provided things and built-in atoms for attribute names and values 
is bad.


On the bright side, it may be possible to enforce that all built-in 
atoms are under the 16 char limit and hence not susceptible to collisions.


-Boris
___
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo


Re: [dev-servo] Table-less string interning

2014-04-23 Thread Patrick Walton

On 4/23/14 6:06 PM, Boris Zbarsky wrote:

On 4/23/14, 8:53 PM, Brian Anderson wrote:

We're assuming nobody is going to be attacking Rust symbols though


Oh, right, that's the other worry.

We've had security issues in the past due to things like  being treated as a file input by some parts of the system
but not others.  So anything that allows collisions between
user-provided things and built-in atoms for attribute names and values
is bad.

On the bright side, it may be possible to enforce that all built-in
atoms are under the 16 char limit and hence not susceptible to collisions.


Disclaimer: I am not a cryptographer; please correct me mercilessly.

A collision in a widely used cryptographic hash would be a major, 
publishable security advance. SipHash was concerned with attacks that 
could somehow exploit the properties of `hash(message) % 
hash_table_size`, but we're taking the whole hash here. So I don't think 
that would be an issue. (We could mix in a secret, per-browser-session 
nonce if we were really paranoid, but since collisions are a death 
sentence for cryptographic hashes in the first place it's likely overkill.)


I'm more concerned about the potential impact of a large hash on memory 
bandwidth (e.g. if we want to match multiple selectors in parallel using 
SIMD), but I haven't had a whole lot of luck with techniques that would 
benefit from that anyhow. So I'm totally fine with this move, assuming 
its performance characteristics are otherwise favorable; best not to 
hold up things that we know *do* provide benefits in favor of nebulous 
performance concerns for algorithms we don't know how to make work in 
the first place.


As an added note, looks like Skylake is going to have SHA instructions 
[1]. Based on my experience with AES-NI I wouldn't be surprised if using 
these instructions ends up faster than interning in software anyway.


Patrick

[1]: http://en.wikipedia.org/wiki/Intel_SHA_extensions

___
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo