Raymond Hettinger added the comment:
> How this change affects this case?
I would expect to see a tiny net benefit. The set.add() would be
microscopically faster (because of less work finding and going back to a free
slot). The set.discard() would do the same work (locating a value and marking
it as a dummy). The resize step would run at the same speed (it just skips
over dummies and reinserts only active values). The resize would run a little
earlier (because we're not reusing a small proportion of the dummy entries) but
it would clean out 100% of the dummies, making the table more sparse sooner.
> Sets often are used in following pattern
Not really. This is one use pattern out of many and is one of the least
common. The two dominant use cases for sets are uniquification (deduping) and
fast membership testing. The third most common case is data analysis using
fast set-to-set operations (union, intersection, and difference). Then comes
cases with individual element membership tests followed by an individual
element set.add (i.e. the "if x not in seen: {seen.add(x); work_on(x);}" case).
Dead last are the affected cases that the bounce around with a mix of
set.add() and set.discard() or set.remove().
The comment in dictobject.c suggests that the latter case is uncommon by a
factor of hundreds. Hence, I think work should be taken out of the inner loop
for the common cases and instead deferred to the high-speed resize step which
cleans out 100% of the dummy entries all at once.
The common cases all benefit (none of those have dummies, so there is no reason
at all to check for dummies in set_add_entry). The uncommon case (the mix of
individual adds and discards) is about neutral or slightly positive (as
explained above).
I've tried to think of a case that would be negatively affected and all I can
think of is repeatedly adding and removing exactly the *same* element or small
group of elements. In that case, the freeslot would be reused 100% of the time
and the would never need a resize. I've not seen such a case and if I had, I
would still care about the common cases more.
Also, I like the simplification of set_add_entry() and the separation of
concerns (dummy reclamation occurs in exactly one place and that one place
eliminates 100% of the dummies in a single pass).
FWIW, there is a loose analogy between this idea and the trade-off between
reference counting and GC. Reference counting reuses memory quicker than
waiting for GC to reclaim memory in one pass later, but it entails encumbering
all of the setting and deleting code. GC-only systems make the rest of the
code much cleaner and faster, but they have to wait to reclaim memory all at
once. Where the analogy fails though is that use of reuse of dummies in sets
is by far the least common case, that early freeslot checks only recapture a
small fraction of the deletions (only the ones that happen to have exactly the
same hash slot), and that early freeslot checks are completely wasted in all of
the common cases (which typically have no dummies at all).
----------
nosy: +tim.peters
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue29476>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com