Hi Raymond,

On 08.01.13 15:49, Maciej Fijalkowski wrote:
On Mon, Dec 10, 2012 at 3:44 AM, Raymond Hettinger
<raymond.hettin...@gmail.com> wrote:
The current memory layout for dictionaries is
unnecessarily inefficient.  It has a sparse table of
24-byte entries containing the hash value, key pointer,
and value pointer.

Instead, the 24-byte entries should be stored in a
dense table referenced by a sparse table of indices.

For example, the dictionary:

     d = {'timmy': 'red', 'barry': 'green', 'guido': 'blue'}

is currently stored as:

     entries = [['--', '--', '--'],
                [-8522787127447073495, 'barry', 'green'],
                ['--', '--', '--'],
                ['--', '--', '--'],
                ['--', '--', '--'],
                [-9092791511155847987, 'timmy', 'red'],
                ['--', '--', '--'],
                [-6480567542315338377, 'guido', 'blue']]

Instead, the data should be organized as follows:

     indices =  [None, 1, None, None, None, 0, None, 2]
     entries =  [[-9092791511155847987, 'timmy', 'red'],
                 [-8522787127447073495, 'barry', 'green'],
                 [-6480567542315338377, 'guido', 'blue']]

Only the data layout needs to change.  The hash table
algorithms would stay the same.  All of the current
optimizations would be kept, including key-sharing
dicts and custom lookup functions for string-only
dicts.  There is no change to the hash functions, the
table search order, or collision statistics.

The memory savings are significant (from 30% to 95%
compression depending on the how full the table is).
Small dicts (size 0, 1, or 2) get the most benefit.

For a sparse table of size t with n entries, the sizes are:

     curr_size = 24 * t
     new_size = 24 * n + sizeof(index) * t

In the above timmy/barry/guido example, the current
size is 192 bytes (eight 24-byte entries) and the new
size is 80 bytes (three 24-byte entries plus eight
1-byte indices).  That gives 58% compression.

Note, the sizeof(index) can be as small as a single
byte for small dicts, two bytes for bigger dicts and
up to sizeof(Py_ssize_t) for huge dict.

In addition to space savings, the new memory layout
makes iteration faster.  Currently, keys(), values, and
items() loop over the sparse table, skipping-over free
slots in the hash table.  Now, keys/values/items can
loop directly over the dense table, using fewer memory
accesses.

Another benefit is that resizing is faster and
touches fewer pieces of memory.  Currently, every
hash/key/value entry is moved or copied during a
resize.  In the new layout, only the indices are
updated.  For the most part, the hash/key/value entries
never move (except for an occasional swap to fill a
hole left by a deletion).

With the reduced memory footprint, we can also expect
better cache utilization.

For those wanting to experiment with the design,
there is a pure Python proof-of-concept here:

    http://code.activestate.com/recipes/578375

YMMV: Keep in mind that the above size statics assume a
build with 64-bit Py_ssize_t and 64-bit pointers.  The
space savings percentages are a bit different on other
builds.  Also, note that in many applications, the size
of the data dominates the size of the container (i.e.
the weight of a bucket of water is mostly the water,
not the bucket).


Raymond
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fijall%40gmail.com
One question Raymond.

The compression ratios stay true provided you don't overallocate entry
list. If you do overallocate you don't really gain that much (it all
depends vastly on details), or even loose in some cases. What do you
think should the strategy be?


What is the current status of this discussion?
I'd like to know whether it is a considered alternative implementation.

There is also a discussion in python-ideas right now where this
alternative is mentioned, and I think especially for small dicts
as **kwargs, it could be a cheap way to introduce order.

Is this going on, somewhere? I'm quite interested on that.

cheers - chris

--
Christian Tismer             :^)   <mailto:tis...@stackless.com>
Software Consulting          :     Have a break! Take a ride on Python's
Karl-Liebknecht-Str. 121     :    *Starship* http://starship.python.net/
14482 Potsdam                :     PGP key -> http://pgp.uni-mainz.de
phone +49 173 24 18 776  fax +49 (30) 700143-0023
PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
      whom do you want to sponsor today?   http://www.stackless.com/

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to