Sorry, Saad; this is now *way* off topic... On Thu, Oct 25, 2012 at 5:39 PM, Oscar Benjamin <oscar.j.benja...@gmail.com> wrote: > >> degrades if there are many collisions). On average, you can check if a >> set/dict contains an item in constant time, i.e. O(1). The amortized >> worst case is O(n). > > Why do you say "*amortized* worst case"? Is there an occasional worse > than O(n) operation that is insignificant when amortised? > > At first I assumed that was a slip of the tongue
Yes, it was badly stated. I've only implemented "separate chaining" hash tables that use slots that are linked lists or dynamic arrays. Python uses an "open addressing" implementation. It seems to me with this implementation the worst case lookup is o(n), where n is the table size. In the worst case you've removed all but 1 key, where all the keys hash the same, and the lookup requires probing through every "dummy key" left in the table (I know this is insanely unrealistic). With separate chaining, on the other hand, the worst case lookup is O(k), where k is the number of keys. Here's a concrete example to give a feel for what the hash table looks like as keys are added and removed. First, define some structures to peek at the underlying table of a set, plus a pathological class whose instances always hash as 37: http://hg.python.org/cpython/file/70274d53c1dd/Include/setobject.h#l21 from ctypes import * Set_MINSIZE = 8 class SetEntry(Structure): _fields_ = [ ('hash', c_long), ('_key', c_void_p), ] @property def key(self): if self._key: return cast(self._key, py_object) else: return 0 class SetObject(Structure): _fields_ = [ ('ob_refcnt', c_ssize_t), ('ob_type', py_object), ('fill', c_ssize_t), ('used', c_ssize_t), ('mask', c_ssize_t), ('table', POINTER(SetEntry)), ('lookup', c_void_p), ('smalltable', SetEntry * Set_MINSIZE), ('hash', c_long), ('weakreflist', c_void_p), ] @classmethod def from_set(cls, s): if not isinstance(s, set): raise TypeError('must be a set') self = cls.from_address(id(s)) self._set = s # keep a reference return self def __iter__(self): for entrynum in xrange(self.mask + 1): entry = self.table[entrynum] yield entry.key class C(object): def __hash__(self): return 37 def __repr__(self): return 'C %d' % id(self) Add 20 of these pathological objects to a set and then remove all but 1 of them. The table now has 19 dummy keys, 1 used slot, and 12 unused slots: >>> c_list = [C() for i in range(20)] >>> s1 = set(c_list) >>> s2 = s1.copy() >>> c_keep = c_list.pop(10) >>> for c in c_list: s1.remove(c) >>> so1 = SetObject.from_set(s1) >>> list(so1) [py_object('<dummy key>'), py_object('<dummy key>'), py_object('<dummy key>'), 0, py_object('<dummy key>'), py_object('<dummy key>'), py_object('<dummy key>'), py_object('<dummy key>'), py_object('<dummy key>'), py_object('<dummy key>'), py_object('<dummy key>'), py_object('<dummy key>'), 0, py_object('<dummy key>'), py_object('<dummy key>'), 0, 0, 0, py_object('<dummy key>'), py_object('<dummy key>'), 0, py_object(C 3072909996), 0, 0, 0, 0, 0, py_object('<dummy key>'), py_object('<dummy key>'), py_object('<dummy key>'), 0, py_object('<dummy key>')] Since all the objects hash the same, the number of dummy keys it will have to skip before finding c_keep will vary depending on the table history. I did a few timeit tests on large pathological sets, and a lookup of the remaining key took anywhere from 4 to 25 times longer when the table was in this state. In contrast, using difference_update() resizes and cleans up the table because it's more than 1/5 dummies (it resizes down to using the 8 element "smalltable" in the set object). >>> s2.difference_update(c_list) >>> so2 = SetObject.from_set(s2) >>> list(so2) [0, 0, 0, 0, 0, py_object(C 3072909996), 0, 0] Inserting keys can also trigger a resize if 3*fill > 2*(mask + 1). In other words, it resizes if the table is more than 2/3 used, where "fill" is active+dummy and mask+1 is the table length. In this case if th numbers 3 and 12 are added to the set, they hash to the unused table slots 3 and 12 (adding more C instances would just refill the dummy slots). This increases "fill" to 22, and triggers a table resize. It's resized to the first power of 2 greater than 4*used == 4*3 = 12. That's 16 in this case. (Pedantically, it's a power of 2 times PySet_MINSIZE, which is 8.) >>> s1.add(3) >>> s1.add(12) >>> list(so1) [0, 0, 0, py_object(3), 0, py_object(C 3072967820), 0, 0, 0, 0, 0, 0, py_object(12), 0, 0, 0] _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor