> On Feb 26, 2019, at 3:30 AM, INADA Naoki <songofaca...@gmail.com> wrote:
> 
> I'm working on compact and ordered set implementation.
> It has internal data structure similar to new dict from Python 3.6.

> 
> On Feb 26, 2019, at 3:30 AM, INADA Naoki <songofaca...@gmail.com> wrote:
> 
> I'm working on compact and ordered set implementation.
> It has internal data structure similar to new dict from Python 3.6

I've also looked at this as well.  Some thoughts: 

* Set objects have a different and conflicting optimization that works better 
for a broad range of use cases.  In particular, there is a linear probing 
search step that gives excellent cache performance (multiple entries retrieved 
in a single cache line) and it reduces the cost of finding the next entry to a 
single increment (entry++). This greatly reduces the cost of collisions and 
makes it cheaper to verify an item is not in a set. 

* The technique for compaction involves making the key/hash entry array dense 
and augmenting it with a sparse array of indices.  This necessarily involves 
adding a layer of indirection for every probe.

* With the cache misses, branching costs, and extra layer of indirection, 
collisions would stop being cheap, so we would need to work to avoid them 
altogether. To get anything like the current performance for a collision of the 
first probe, I suspect we would have to lower the table density down from 60% 
to 25%.  

* The intersection operation has an important optimization where it loops over 
the smaller of its two inputs.  To give a guaranteed order that preserves the 
order of the first input, you would have to forgo this optimization, possibly 
crippling any existing code that depends on it.

* Maintaining order in the face of deletions adds a new workload to sets that 
didn't exist before. You risk thrashing the set support a feature that hasn't 
been asked for and that isn't warranted mathematically (where the notion of 
sets is unordered).

* It takes a lot of care and planning to avoid fooling yourself with benchmarks 
on sets.  Anything done with a small tight loop will tend to hide all branch 
prediction costs and cache miss costs, both of which are significant in real 
world uses of sets.

* For sets, we care much more about look-up performance than space.  And unlike 
dicts where we usually expect to find a key, sets are all about checking 
membership which means they have to be balanced for the case where the key is 
not in the set.

* Having and preserving order is one of the least important things a set can 
offer (it does have some value, but it is the least important feature, one that 
was never contemplated by the original set PEP).

After the success of the compact dict, I can understand an almost irresistible 
urge to apply the same technique to sets. If it was clear that it was a win, I 
would have already done it long ago, even before dicts (it was much harder to 
get buy in to changing the dicts).  Please temper the enthusiasm with 
rationality and caution.  The existing setobject code has been finely tuned and 
micro-optimized over the years, giving it excellent performance on workloads we 
care about.  It would be easy throw all of that away.


Raymond
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to