I apologize up front for the length of this. There's a tl;dr at the end.

> On Apr 20, 2018, at 04:07, Ben Noordhuis <[email protected]> wrote:
> 
> On Thu, Apr 19, 2018 at 4:08 PM, Jason Madden
> <[email protected]> wrote:
>> 
>> 
>> After some debugging it became clear what was going on. The gory
>> details are in the bug report [2], but in summary this is an
>> interaction between the facts that the QUEUE that is used to iterate
>> through the check watchers[3] is stack allocated...

>>  Combining switching
>> greenlets with stopping watchers thus can result in the QUEUE getting
>> restored to an invalid state (infinite loop in this case).
>> 
>> We solved this problem by heap allocating the QUEUE[5] on each entry to
>> uv__run_check so that it no longer got saved and restored by greenlet
>> switches. This obviously adds some overhead, though I didn't try to
>> quantify it.
>> 
>> Potentially a better fix would be to have the QUEUE as part of the
>> loop structure; I didn't do that here to keep the patch as small as
>> possible (and I didn't know if that would hurt the chances of it
>> getting upstreamed due to ABI concerns). Or maybe there are other
>> approaches that are better? Feedback is welcome!
> 
> I suppose we could change that in libuv v2 (not v1 because of ABI)

Understood. 

Since we can't change the loop structure, and assuming malloc/free are "too 
expensive"[1], what came to mind, something I've implemented in the past for a 
similar case, was a static freelist of objects (QUEUE in this case). It could 
either be thread local or with appropriate intrinsics be implemented 
atomically. Objects would have to be checked out and returned using an API, 
much like malloc/free. I know that's not as easy as stack allocation from a 
correctness standpoint (but see below, that may be necessary for avoiding 
regressions). It looks like the intrinsics may already be there in atomic-ops.h 
(but I'll be honest, it's been years since I implemented anything like that).

> I suspect it would break again quickly because there is no good way to
> regression-test it.  

Agreed. I was giving it a little bit of thought, and (aside from simply having 
the style guide say "don't do that",) the only way I had come up with to really 
avoid *needing to regression test it* was to simply make it impossible: Make 
the queue type private and wrap an API around getting and returning a pointer 
to one (again, just like malloc/free, except the API could be macros). I'm sure 
nobody would be a big fan of that, but it would have the advantage of being 
flexible (the loop could keep a QUEUE of QUEUEs, popping and pushing as needed, 
growing a new entry should one be needed recursively). That flexibility may not 
be necessary if there are only a few defined places that are "reentrant" like 
in the original bug, in which case still hiding the type and a few defined 
macros could be a decent approach.

> For instance, there's currently a pull request
> under review that introduces a stack-allocated QUEUE in
> uv_write()-related code.

If you are referring to https://github.com/libuv/libuv/pull/1787, I don't think 
that would cause any problems. The new stack allocated variable is not 
reachable outside that scope as far as I can see...but that does highlight the 
point that it's either "never ever have a stack allocated queue" or "carefully 
eyeball each instance, expect to eventually misidentify a case and just wait 
for a downstream to report the bug".

Regarding v1 vs v2 and timelines, here's the way thing look from gevent's 
perspective:

As a gevent maintainer, I don't really mind too bad carrying around a small 
patch to libuv (we did that for quite awhile with libev). However, it limits 
the audience for gevent+libuv. When I'm carrying around a patch for libuv, I 
have to have a custom, embedded, build for it, I can't use the system provided 
library[2]. As a gevent maintainer, I also don't hate that because I know what 
we're getting when I get bug reports, and I also know that I can set flags to 
get link-time-optimizations and the best performance out of the Python 
bindings. However, there are lots of people who install the packaged gevent 
provided by their (Linux) distribution, and there are lots of distributions 
that refuse to allow embedded libraries, for perfectly valid reasons, so if 
gevent can't use the system libuv, it can't use libuv on that distro at all. If 
gevent has significant users that can't use libuv, I can't ultimately deprecate 
and drop libev without leaving those users behind, should that moment ever get 
here.

That's the same reason I can't use libuv v2 until it's released and starts 
getting picked up by distros. 

TL;DR: 

This is all a bit hypothetical right now. gevent 1.3 is only in beta and hasn't 
been packaged by distros yet AFAIK. Being able to drop my patches for gevent 
1.3/libuv 1.21 would be cool, though given the complexity of an API- and 
CAS-based approach, that seems unlikely. But if I have to wait for next year 
and gevent 1.4/libuv v2 to build against a system libuv and drop my patches, 
I'm cool with that too.

I'm happy to prepare a simple (ABI-breaking) PR against master/v2 if that 
sounds attractive (although I see there are already outstanding QUEUE-changing 
PRs for master). In light of the measured expense of the malloc/free calls, 
I've already adapted my own patch to do that.

If it doesn't sound attractive right now because of testing and maintenance 
concerns, I understand that too. Like I said, I don't really mind carrying 
around a small patch to libuv, especially not in the early days of 
gevent+libuv. I can circle back sometime in the future if it seems warranted 
either because the patch complexity grows or because of requests from 
distros/users.


Thanks again for your thoughts on this,
Jason

[1]

I decided to quantify the effect of the malloc call on some systems I had 
laying around. I took the uv__run_ code, set up a bunch of handles with a no-op 
callback, compiled with all optimizations enabled, and ran timings for various 
numbers of active handles. Approximate costs of the malloc/free of the QUEUE on 
various systems:

Linux x86_64:              30ns
FreeBSD 10 x86_64:         35ns
Solaris 11 x86_64:         60ns
macOS 10.13.4 w/jemalloc:  60ns
macOS 10.13.4:             80ns
Windows 10 x86_64:         80ns
Linux arm Rasp Pi 2:      500ns

The approximate number of active handles needed to amortize the cost away (due 
to the extra function callbacks) ranged from 10 on the low end to 100 in the 
middle to 1000+ on the high end.

[2]

Actually, right now I can't build with a system libuv either. I was never able 
to get the Python extension to link correctl on Windows, so I'm doing what 
https://github.com/saghul/pyuv does and using setuptools to build libuv on all 
platforms. The plan is to circle back to that when distros ask.

-- 
You received this message because you are subscribed to the Google Groups 
"libuv" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/libuv.
For more options, visit https://groups.google.com/d/optout.

Reply via email to