On 01/21/2012 07:50 PM, Stefan Behnel wrote:
Hi,

I did some callgrind profiling on Cython's generators and was surprised to
find that AddTraceback() represents a serious performance penalty for short
running generators.

I profiled a compiled Python implementation of itertools.groupby(), which
yields (key, group) tuples where the group is an iterator again. I ran this
code in Python for benchmarking:

"""
L = sorted(range(1000)*5)

all(list(g) for k,g in groupby(L))
"""

Groups tend to be rather short in real code, often just one or a couple of
items, so unpacking the group iterator into a list will usually be a quick
loop and then the generator raises StopIteration on termination and builds
a traceback for it. According to callgrind (which, I should note, tends to
overestimate the amount of time spent in memory allocation), the iteration
during the group unpacking takes about 30% of the overall runtime of the
all() loop, and the AddTraceback() call at the end of each group traversal
takes up to 25% (!) on my side. That means that more than 80% of the group
unpacking time goes into raising StopIteration from the generators. I
attached the call graph with the relative timings.

OT: Since you complain that callgrind is inaccurate; are you aware of sampling profilers, such as Google perftools? (I don't have experience with callgrind myself)

http://google-perftools.googlecode.com/svn/trunk/doc/cpuprofile.html

http://pypi.python.org/pypi/yep

Dag


About half of the exception raising time is eaten by PyString_FromFormat()
that builds the function-name + line-position string (which, I may note, is
basically a convenience feature). This string is a constant for a
generator's StopIteration exception, at least for each final return point
in a generator, but here it is being recreated over and over again, for
each exception that gets raised.

Even if we keep creating a new frame instance each time (which should be ok
because CPython has a frame instance cache already and we'd only create one
during the generator lifetime), the whole code object could actually be
cached after the first creation, preferably bound to the lifetime of the
generator creator function/method. Or, more generally, one code object per
generator termination point, which will be a single point in the majority
of cases. For the specific code above, that should shave off almost 20% of
the overall runtime of the all() loop.

I think that's totally worth doing.

Stefan



_______________________________________________
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel

_______________________________________________
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel

Reply via email to