[Python-Dev] CPython optimization: storing reference counters outside of objects

2011-05-21 Thread Artur Siekielski
Hi.
The problem with reference counters is that they are very often
incremented/decremented, even for read-only algorithms (like traversal
of a list). It has two drawbacks:
1. CPU cache lines (64 bytes on X86) containing a beginning of a
PyObject are very often invalidated, resulting in loosing many chances
to use the CPU caches
2. The copy-on-write after fork() optimization (Linux) is almost
useless in CPython, because even if you don't modify data directly,
refcounts are modified, and PyObjects with refcounts inside are spread
all over process' memory (and one small refcount modification causes
the whole page - 4kB - to be copied into a child process).

So an idea I would like to try is to move reference counts outside of
PyObjects, to a contiguous block(s) of memory. PyObjects would have a
pointer to a reference count inside this block. Doing this I think
that
1. The beginning of PyObject structs could be CPU-cached for a much
longer time (small objects like ints could be fully cached). I don't
know if having localized writes into the block with refcounts also
help performance?
2. copy-on-write after fork() will work much better, only the block
with refcounts would be copied into a child process (for read-only
algorithms)

However the drawback is that such design introduces a new level of
indirection which is a pointer inside a PyObject instead of a direct
value. Also it seems that the "block" with refcounts would have to be
a non-trivial data structure.

I'm not a compiler/profiling expert so the main question is if such
design can work, and maybe someone was thinking about something
similar? And if CPython was profiled for CPU cache usage?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] CPython optimization: storing reference counters outside of objects

2011-05-23 Thread Artur Siekielski
Ok, I managed to make a quick but working patch (sufficient to get
working interpreter, it segfaults for extension modules). It uses the
"ememoa" allocator (http://code.google.com/p/ememoa/) which seems a
reasonable pool allocator. The patch: http://dpaste.org/K8en/. The
main obstacle was that there isn't a single function/macro that can be
used to initialize all PyObjects, so I had to initialize static
PyObjects (mainly PyTypeObjects) by hand.

I used a naive quicksort algorithm as a benchmark:
http://dpaste.org/qquh/ . The result is that after patching it runs
50% SLOWER. I profiled it and allocator methods used 35% time. So
there is still 15% performance loss even if the allocator is poor.

Anyway, I'd like to have working copy-on-write in CPython - in the
presence of GIL I find it important to have multiprocess programs
optimized (and I think it's a common idiom that a parent process
prepares some big data structure, and child "worker" processes do some
read-only quering).

Artur
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] CPython optimization: storing reference counters outside of objects

2011-05-23 Thread Artur Siekielski
2011/5/23 Guido van Rossum :
>> Anyway, I'd like to have working copy-on-write in CPython - in the
>> presence of GIL I find it important to have multiprocess programs
>> optimized (and I think it's a common idiom that a parent process
>> prepares some big data structure, and child "worker" processes do some
>> read-only quering).
>
> That is the question though -- *is* the idiom commonly used?

In fact I came to the whole idea with this optimization because the
idiom didn't work for me. I had a big word index built by a parent
process, and than wanted the children to enable querying this index (I
wanted to use all cores on a server). The index consumed 50% of RAM
and after a few minutes the children consumed all RAM.

I find it common in languages like Java to use thread pools, in
Python+Linux we have multiprocess pools if we want to use all cores,
and in this setting having a working copy-on-write is really valuable.

Oh, and using explicit shared memory or mmap is much harder, because
you have to map the whole object graph into bytes.

> It
> doesn't seem to me that it would scale all that far, since it only
> works as long as all forked copies live on the same machine and run on
> the same symmetrical multi-core processor.

? I don't know about multi-processor systems, but on single-processor
multi-core systems (which are common even on servers) and Linux it
works.


Artur
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] CPython optimization: storing reference counters outside of objects

2011-05-24 Thread Artur Siekielski
2011/5/24 Sturla Molden :
>> Oh, and using explicit shared memory or mmap is much harder, because
>> you have to map the whole object graph into bytes.
>
> It sounds like you need PYRO, POSH or multiprocessing's proxy objects.

PYRO/multiprocessing proxies isn't a comparable solution because of
ORDERS OF MAGNITUDE worser performance. You compare here direct memory
access vs serialization/message passing through sockets/pipes.

POSH might be good, but the project is dead for 8 years. And this
copy-on-write is nice because you don't need changes/restrictions to
your code, or a special garbage collector.


Artur
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] CPython optimization: storing reference counters outside of objects

2011-05-24 Thread Artur Siekielski
2011/5/24 Sturla Molden :
> Den 24.05.2011 11:55, skrev Artur Siekielski:
>>
>> PYRO/multiprocessing proxies isn't a comparable solution because of
>> ORDERS OF MAGNITUDE worser performance. You compare here direct memory
>> access vs serialization/message passing through sockets/pipes.
> The bottleneck is likely the serialization, but only if you serialize large
> objects. IPC is always very fast, at least on localhost .

It cannot be "fast" compared to direct memory access. Here is a
benchmark: summing numbers in a small list in a child process using
multiprocessing "manager": http://dpaste.org/QzKr/ , and using
implicit copy of the structure after fork(): http://dpaste.org/q3eh/.
The first is 200 TIMES SLOWER. It means if the work finishes in 20
seconds using fork(), the same work will require more than one hour
using multiprocessing manager.

> If a database is too slow, I am rather sure you need
> something else than Python as well.

Disk access is about 1000x slower than memory access in C, and Python
in a worst case is 50x slower than C, so there is still a huge win
(not to mention that in a common case Python is only a few times
slower).


Artur
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com