Reducing "yield from" overhead in recursive generators

2022-03-17 Thread Rathmann
Summary: I am looking for feedback on a potential technique for coding
recursive generators.

Often times the most convenient algorithms for creating a sequence of
results are recursive, while the most convenient interface for the
consumption of those results is an iterator.  PEP 380 introduced the
"yield from" syntax which provides a technique for generators to
delegate to other generators, including recursively.

The trouble is that recursion and generators don't necessarily play
all that well together from a performance point of view.  As Greg
Ewing noted in PEP 380:

The overhead of passing __next__() calls and yielded values
down and up the chain can cause what ought to be an O(n)
operation to become, in the worst case, O(n**2).

Ewing goes on to describe some possible optimizations possible in
Python implementations.  However, in the Python versions I have
checked, there is still some O(depth) overhead to a chain of "yield
from" calls.

To make this concrete, let's consider a problem with a canonical
recursive implementation -- an inorder tree traversal.  (My actual
motivation for thinking about this issue comes from implementing
generators for various kinds of combinatorial objects.  But tree
traversal provides a good familiar example.)

For discussion purposes, let's represent a tree node as a tuple of the
form (key left right).  Then, a simple tree might be

```Python
tree1 = ('b', ('a', None, None), ('c', None, None))
```

Or, a badly unbalanced example could be:

```python
tree2 = ('a', None, 
 ('b', None,
  ('c', None,
   ('d', None,
('e', None,
 ('f', None,
  ('g', None,
   ('h', None,
('i', None, None)
```

Then, we can code inorder traversal as:

```python
def inorder_traverse(node):
k, l, r = node
if l:
yield from inorder_traverse(l)
yield k
if r:
yield from inorder_traverse(r)
```

So far, all very familiar.  The trouble is that if you are at a depth
of d, every value that is yielded by the algorithm gets passed through
d stack frames before it reaches the actual consumer of the data.
This issue, specifically the case of tree traversal, was the subject
of a 2016 thread in comp.lang.python "Linear Time Tree Traversal
Generator".  I am not reviving that 5+-year-old thread since my
objectives are a little different.

In that thread, Ned Batchelder suggested (and gave a nice
implementation for) reimplementing the traversal to use an explicit
stack and a loop to traverse the tree, so that "each yield directly
produces a value to the caller."  That, to my mind, basically solved
the original poster's problem -- the code for the iterative version is
reasonably concise and understandable, and is faster than the
recursive version both in the O(n) sense and also for small problems.

But, when the candidate recursive algorithm is more complicated than
that for tree traversal, translating to an iterative version can be
difficult.  Maintaining a stack of local variables is not too hard,
but converting a recursive function to iterative completely changes
the flow, such that familiar control structures become a tangled mass
of GOTOs.  (Which is even harder in a language without a GOTO.)  The
result (at least when I have tried) can be hard to write and (worse)
hard to read.

I was looking at the problem and realized that there might be a middle
ground -- avoiding the O(depth) cost of a chain of "yield from"
clauses, but still keeping the recursive code looking reasonably close
to the original.

The idea is to modify the recursive function by replacing the "yield
from" forms with "yield".  The transformed function needs to run from
a driver, which simulates the normal call stack.  The driver checks
each result yielded from the function -- if it is a regular value, it
is returned (yielded) to the caller immediately, but if the result is
a generator, it is pushed onto that stack.  The Python code for such a
driver is easier to understand than any english description I have
been able to create, so here it is:

```python
def recursive_generator_driver(g):
"""g is a generator object, which is likely to call other generators"""
stack = [g]
while True:
g = stack[-1]
try:
val = next(g)
if isinstance(val, types.GeneratorType):
# Yielded value represented a recursive call, so push
stack.append(val)
else:
# Regular value for iterator, give to caller
yield val
except StopIteration:
stack.pop()
if len(stack) == 0:
return
```

and the modified tree traverser is just

```python
def inorder_traverse_mod(node):
"""ONLY for use with recursive_generator_driver"""
k, l, r = node
if l:
yield inorder_traverse_mod(l) # NOT yield from 
yield k
if r:
yield in

Re: Reducing "yield from" overhead in recursive generators

2022-03-18 Thread Rathmann


On Friday, March 18, 2022 at 2:07:45 AM UTC-7, Antoon Pardon wrote:
> Op 18/03/2022 om 04:14 schreef Rathmann: 
> > 
> > So what do people think of this hack/technique? Is it 
> > 
> > A) A terrible idea? (Please be specific.) 
> > B) Already well-known (and I just missed it.) 
> > C) Potentially useful for the niche case of deeply recursive 
> > generators?
> I would go with B' + C. It seems there is a resemblance with the trampoline 
> technique 
> in order to eliminate tail recursion. I know the people of macropy have 
> written a 
> decorating macro that would eliminate tail recursion from such decorated 
> functions. 
> Maybe if you contact them they can be interested in making a similar 
> decorating macro 
> for use with such recursive decorators. 
> 
> -- 
> Antoon Pardon.

Thank you Antoon.  I was not aware of macropy.  It certainly seems
like a macro could automate the process of switching the "yield from"
to yield forms.  That is already pretty easy to do by hand, though.

The driver itself could be packaged as a decorator using just the
facilities available in regular Python implementations.

**Much** more ambitious would be a macro to automate the
transformation of a generator from (general) recursive to iterative
form, so as to avoid the need for this technique entirely.

The other challenge/question would be to see if there is a way to implement
this basic approach with lower overhead.  So far, the savings don't
kick in until the recursion hits a depth of 12 or so (even more with
CPython), and many use cases have an expected recursion depth
shallower than that.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Reducing "yield from" overhead in recursive generators

2022-03-22 Thread Rathmann
> On 19 Mar 2022, at 03:07, Greg Ewing  wrote:
>
> On 19/03/22 9:40 am, Rathmann wrote:
>> The other challenge/question would be to see if there is a way to implement
>> this basic approach with lower overhead.
>
> My original implementation of yield-from didn't have this problem,
> because it didn't enter any of the intermediate frames -- it just
> ran down a chain of C pointers to find the currently active one.
>
> At some point this was changed, I think so that all the frames
> would show up in the traceback in the event of an exception.
> This was evidently seen as more important than having efficient
> resumption of nested generators.

So that sounds like the original CPython implementation had an
O(depth) component, but with a lower constant factor than the current
version?  (Of course, since Python limits recursion depth, speaking of
O(n) is not quite right.)

Prioritizing stack traces over performance seems consistent with the
well known blog post on tail recursion elimination.  This is one of
the reasons I was motivated to try to attack this at the
library/application level -- when the programmer invokes a
transformation like I am proposing, they know that it is going to
affect the stack trace.

I am still hoping to find a way to implement the transformed
function/driver combination with lower overhead.  This is a little
tricky since the performance of the approaches is very dependent on
the Python implementation.  Below are some timings for 2.7 and 3.8
CPython and PyPy. For 2.7, for the untransformed version I used a loop
of yield statements, and timed with time.clock(), for 3.8, it is
"yield from" and time.perf_counter().  All times are for traversing a
balanced tree of 8 million nodes on X86_64 Linux.

3.8.10 (default, Nov 26 2021, 20:14:08) 
[GCC 9.3.0]
For depth= 22 nodes= 8388607 iters= 1 
  Elapsed time (yield from): 6.840768865999053 driver: 6.702329244999419

3.6.9 (7.3.1+dfsg-4, Apr 22 2020, 05:15:29)
[PyPy 7.3.1 with GCC 9.3.0]
For depth= 22 nodes= 8388607 iters= 1 
  Elapsed time (yield from): 4.037276787999872 driver: 2.3724582960003318

2.7.18 (default, Mar  8 2021, 13:02:45) 
[GCC 9.3.0]
For depth= 22 nodes= 8388607 iters= 1 
  Elapsed time (yield loop): 8.863244 driver: 13.788312

2.7.13 (7.3.1+dfsg-2, Apr 21 2020, 05:05:41)
[PyPy 7.3.1 with GCC 9.3.0]
For depth= 22 nodes= 8388607 iters= 1 
  Elapsed time (yield loop): 8.802712562 driver: 2.034388533

The most extreme variation was 2.7 pypy, where the driver was 4 times
faster.  (Of course, 2.7 is less used these days.)
-- 
https://mail.python.org/mailman/listinfo/python-list