[Python-Dev] CPython optimization: storing reference counters outside of objects
Hi. The problem with reference counters is that they are very often incremented/decremented, even for read-only algorithms (like traversal of a list). It has two drawbacks: 1. CPU cache lines (64 bytes on X86) containing a beginning of a PyObject are very often invalidated, resulting in loosing many chances to use the CPU caches 2. The copy-on-write after fork() optimization (Linux) is almost useless in CPython, because even if you don't modify data directly, refcounts are modified, and PyObjects with refcounts inside are spread all over process' memory (and one small refcount modification causes the whole page - 4kB - to be copied into a child process). So an idea I would like to try is to move reference counts outside of PyObjects, to a contiguous block(s) of memory. PyObjects would have a pointer to a reference count inside this block. Doing this I think that 1. The beginning of PyObject structs could be CPU-cached for a much longer time (small objects like ints could be fully cached). I don't know if having localized writes into the block with refcounts also help performance? 2. copy-on-write after fork() will work much better, only the block with refcounts would be copied into a child process (for read-only algorithms) However the drawback is that such design introduces a new level of indirection which is a pointer inside a PyObject instead of a direct value. Also it seems that the "block" with refcounts would have to be a non-trivial data structure. I'm not a compiler/profiling expert so the main question is if such design can work, and maybe someone was thinking about something similar? And if CPython was profiled for CPU cache usage? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] CPython optimization: storing reference counters outside of objects
Ok, I managed to make a quick but working patch (sufficient to get working interpreter, it segfaults for extension modules). It uses the "ememoa" allocator (http://code.google.com/p/ememoa/) which seems a reasonable pool allocator. The patch: http://dpaste.org/K8en/. The main obstacle was that there isn't a single function/macro that can be used to initialize all PyObjects, so I had to initialize static PyObjects (mainly PyTypeObjects) by hand. I used a naive quicksort algorithm as a benchmark: http://dpaste.org/qquh/ . The result is that after patching it runs 50% SLOWER. I profiled it and allocator methods used 35% time. So there is still 15% performance loss even if the allocator is poor. Anyway, I'd like to have working copy-on-write in CPython - in the presence of GIL I find it important to have multiprocess programs optimized (and I think it's a common idiom that a parent process prepares some big data structure, and child "worker" processes do some read-only quering). Artur ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] CPython optimization: storing reference counters outside of objects
2011/5/23 Guido van Rossum : >> Anyway, I'd like to have working copy-on-write in CPython - in the >> presence of GIL I find it important to have multiprocess programs >> optimized (and I think it's a common idiom that a parent process >> prepares some big data structure, and child "worker" processes do some >> read-only quering). > > That is the question though -- *is* the idiom commonly used? In fact I came to the whole idea with this optimization because the idiom didn't work for me. I had a big word index built by a parent process, and than wanted the children to enable querying this index (I wanted to use all cores on a server). The index consumed 50% of RAM and after a few minutes the children consumed all RAM. I find it common in languages like Java to use thread pools, in Python+Linux we have multiprocess pools if we want to use all cores, and in this setting having a working copy-on-write is really valuable. Oh, and using explicit shared memory or mmap is much harder, because you have to map the whole object graph into bytes. > It > doesn't seem to me that it would scale all that far, since it only > works as long as all forked copies live on the same machine and run on > the same symmetrical multi-core processor. ? I don't know about multi-processor systems, but on single-processor multi-core systems (which are common even on servers) and Linux it works. Artur ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] CPython optimization: storing reference counters outside of objects
2011/5/24 Sturla Molden : >> Oh, and using explicit shared memory or mmap is much harder, because >> you have to map the whole object graph into bytes. > > It sounds like you need PYRO, POSH or multiprocessing's proxy objects. PYRO/multiprocessing proxies isn't a comparable solution because of ORDERS OF MAGNITUDE worser performance. You compare here direct memory access vs serialization/message passing through sockets/pipes. POSH might be good, but the project is dead for 8 years. And this copy-on-write is nice because you don't need changes/restrictions to your code, or a special garbage collector. Artur ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] CPython optimization: storing reference counters outside of objects
2011/5/24 Sturla Molden : > Den 24.05.2011 11:55, skrev Artur Siekielski: >> >> PYRO/multiprocessing proxies isn't a comparable solution because of >> ORDERS OF MAGNITUDE worser performance. You compare here direct memory >> access vs serialization/message passing through sockets/pipes. > The bottleneck is likely the serialization, but only if you serialize large > objects. IPC is always very fast, at least on localhost . It cannot be "fast" compared to direct memory access. Here is a benchmark: summing numbers in a small list in a child process using multiprocessing "manager": http://dpaste.org/QzKr/ , and using implicit copy of the structure after fork(): http://dpaste.org/q3eh/. The first is 200 TIMES SLOWER. It means if the work finishes in 20 seconds using fork(), the same work will require more than one hour using multiprocessing manager. > If a database is too slow, I am rather sure you need > something else than Python as well. Disk access is about 1000x slower than memory access in C, and Python in a worst case is 50x slower than C, so there is still a huge win (not to mention that in a common case Python is only a few times slower). Artur ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com