Cross-referenced to relevant bits of code where appropriate. (And just a quick reminder regarding the code quality disclaimer: I've been hacking away on this stuff relentlessly for a few months; the aim has been to make continual forward progress without getting bogged down in non-value-add busy work. Lots of wildly inconsistent naming conventions and dead code that'll be cleaned up down the track. And the relevance of any given struct will tend to be proportional to how many unused members it has (homeless hoarder + shopping cart analogy).)
On Thu, Mar 14, 2013 at 11:45:20AM -0700, Trent Nelson wrote: > The basic premise is that parallel 'Context' objects (well, structs) > are allocated for each parallel thread callback. The 'Context' struct: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel_private.h#l546 Allocated via new_context(): http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l4211 ....also relevant, new_context_for_socket() (encapsulates a client/server instance within a context). http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l4300 Primary role of the context is to isolate the memory management. This is achieved via 'Heap': http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel_private.h#l281 (Which I sort of half started refactoring to use the _HEAD_EXTRA approach when I thought I'd need to have a separate heap type for some TLS avenue I explored -- turns out that wasn't necessary). > The context persists for the lifetime of the "parallel work". > > The "lifetime of the parallel work" depends on what you're doing. For > a simple ``async.submit_work(foo)``, the context is considered > complete once ``foo()`` has been called (presuming no exceptions were > raised). Managing context lifetime is one of the main responsibilities of async.run_once(): http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l3841 > For an async client/server, the context will persist for the entirety > of the connection. Marking a socket context as 'finished' for servers is the job of PxServerSocket_ClientClosed(): http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l6885 > The context is responsible for encapsulating all resources related to > the parallel thread. So, it has its own heap, and all memory > allocations are taken from that heap. The heap is initialized in two steps during new_context(). First, a handle is allocated for the underlying system heap (via HeapCreate): http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l4224 The first "heap" is then initialized for use with our context via the Heap_Init(Context *c, size_t n, int page_size) call: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l1921 Heaps are actually linked together via a doubly-linked list. The first heap is a value member (not a pointer) of Context; however, the active heap is always accessed via the '*h' pointer which is updated as necessary. struct Heap { Heap *prev; Heap *next; void *base; void *next; int allocated; int remaining; ... struct Context { Heap heap; Heap *h; ... > For any given parallel thread, only one context can be executing at a > time, and this can be accessed via the ``__declspec(thread) Context > *ctx`` global (which is primed by some glue code as soon as the > parallel thread starts executing a callback). Glue entry point for all callbacks is _PyParallel_EnteredCallback: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l3047 On the topic of callbacks, the main workhorse for the submit_(wait|work) callbacks is _PyParallel_WorkCallback: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l3120 The interesting logic starts at start: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l3251 The interesting part is the error handling. If the callback raises an exception, we check to see if an errback has been provided. If so, we call the errback with the error details. If the callback completes successfully (or it fails, but the errback completes successfully), that is treated as successful callback or errback completion, respectively: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l3270 http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l3294 If the errback fails, or no errback was provided, the exception percolates back to the main thread. This is handled at error: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l3300 This should make the behavior of async.run_once() clearer. The first thing it does is check to see if any errors have been posted. http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l3917 Errors are returned back to calling code on a first-error-wins basis. (This involves fiddling with the context's lifetime, as we're essentially propagating an object created in a parallel context (the (exception, value, traceback) tuple) back to a main thread context -- so, we can't blow away that context until the exception has had a chance to properly bubble back up and be dealt with.) If there are no errors, we then check to see if any "call from main thread" requests have been made: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l3936 I added support for this in order to ease unit testing, but it has general usefulness. It's exposed via two decorators: @async.call_from_main_thread def foo(arg): ... def callback(): foo('abcd') async.submit_work(callback) That creates a parallel thread, invokes callback(), which then results in foo(arg) eventually being called from the main thread. This would be useful for synchronising access to a database or something like that. There's also @async.call_from_main_thread_and_wait, which I probably should have mentioned first: @async.call_from_main_thread_and_wait def update_login_details(login, details) db.update(login, details) def foo(): ... update_login_details(x, y) # execution will resume when the main thread finishes # update_login_details() ... async.submit_work(foo) Once all "main thread work requests" have been processed, completed callbacks and errbacks are processed. This basically just involves transitioning the associated context onto the "path to freedom" (the lifecycle that eventually results in the context being free()'d and the heap being destroyed). http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l4032 > No reference counting or garbage collection is done during parallel > thread execution. Instead, once the context is finished, it is > scheduled to be released, which means it'll be "processed" by the main > thread as part of its housekeeping work (during ``async.run()`` > (technically, ``async.run_once()``). > > The main thread simply destroys the entire heap in one fell swoop, > releasing all memory that was associated with that context. The "path to freedom" lifecycle is a bit complicated at the moment and could definitely use a review. But, basically, the main methods are _PxState_PurgeContexts() and _PxState_FreeContext(); the former checks that the context is ready to be freed, the latter does the actual freeing. _PxState_PurgeContexts: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l3789 _PxState_FreeContext: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l3700 The reason for the separation is to maintain bubbling effect; a context only makes one transition per run_once() invocation. Putting this in place was a key step to stop wild crashes in the early days when unittest would keep hold of exceptions longer than I was expecting -- it should probably be reviewed in light of the new persistence support I implemented (much later). > There are a few side effects to this. First, the heap allocator > (basically, the thing that answers ``malloc()`` calls) is incredibly > simple. It allocates LARGE_PAGE_SIZE chunks of memory at a time (2MB > on x64), and simply returns pointers to that chunk for each memory > request (adjusting h->next and allocation stats as it goes along, > obviously). Once the 2MB has been exhausted, another 2MB is > allocated. _PyHeap_Malloc is the workhorse here: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l2183 Very simple, just keeps nudging along the h->next pointer for each request, allocating another heap when necessary. Nice side effect is that it's ridiculously fast and very cache friendly. Python code running within parallel contexts runs faster than normal main-thread code because of this (plus the boost from not doing any ref counting). The simplicity of this approach made the heap snapshot logic really simple to implement too; taking a snapshot and then rolling back is just a couple of memcpy's and some pointer fiddling. > That approach is fine for the ``submit_(work|timer|wait)`` callbacks, > which basically provide a way to run a presumably-finite-length > function in a parallel thread (and invoking callbacks/errbacks as > required). > > However, it breaks down when dealing with client/server stuff. Each > invocation of a callback (say, ``data_received(...)``) may only > consume, say, 500 bytes, but it might be called a million times before > the connection is terminated. You can't have cumulative memory usage > with possibly-infinite-length client/server-callbacks like you can > with the once-off ``submit_(work|wait|timer)`` stuff. > > So, enter heap snapshots. The logic that handles all client/server > connections is instrumented such that it takes a snapshot of the heap > (and all associated stats) prior to invoking a Python method (via > ``PyObject_Call()``, for example, i.e. the invocation of > ``data_received``). I came up with the heap snapshot stuff in a really perverse way. The first cut introduced a new 'TLS heap' concept; the idea was that before you'd call PyObject_CallObject(), you'd enable the TLS heap, then roll it back when you were done. i.e. the socket IO loop code had a lot of stuff like this: snapshot = ENABLE_TLS_HEAP(); if (!PyObject_CallObject(...)) { DISABLE_TLS_HEAP_AND_ROLLBACK(snapshot); ... } DISABLE_TLS_HEAP(); ... /* do stuff */ ROLLBACK_TLS_HEAP(snapshot); That was fine initially, until I had to deal with the (pretty common) case of allocating memory from the TLS heap (say, for an async send), and then having the callback picked up by a different thread. That thread then had to return the other thread's snapshot and, well, it just fell apart conceptually. Then it dawned on me to just add the snapshot/rollback stuff to normal Context objects. In retrospect, it's silly I didn't think of this in the first place -- the biggest advantage of the Context abstraction is that it's thread-local, but not bindingly so (as in, it'll only ever run on one thread at a time, but it doesn't matter which one, which is essential, because the ). Once I switched out all the TLS heap cruft for Context-specific heap snapshots, everything "Just Worked". (I haven't removed the TLS heap stuff yet as I'm still using it elsewhere (where it doesn't have the issue above). It's an xxx todo.) The main consumer of this heap snapshot stuff (at the moment) is the socket IO loop logic: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l5632 Typical usage now looks like this: snapshot = PxContext_HeapSnapshot(c, NULL); if (!PxSocket_LoadInitialBytes(s)) { PxContext_RollbackHeap(c, &snapshot); PxSocket_EXCEPTION(); } /* at some later point... */ PxContext_RollbackHeap(c, &snapshot); > When the method completes, we can simply roll back the snapshot. The > heap's stats and next pointers et al all get reset back to what they > were before the callback was invoked. > > The only issue with this approach is detecting when the callback has > done the unthinkable (from a shared-nothing perspective) and persisted > some random object it created outside of the parallel context it was > created in. > > That's actually a huge separate technical issue to tackle -- and it > applies just as much to the normal ``submit_(wait|work|timer)`` > callbacks as well. I've got a somewhat-temporary solution in place > for that currently: > > That'll result in two contexts being created, one for each callback > invocation. ``async.dict()`` is a "parallel safe" wrapper around a > normal PyDict. This is referred to as "protection". > > In fact, the code above could have been written as follows: > > d = async.protect(dict()) > > What ``protect()`` does is instrument the object such that we > intercept ``__getitem__``, ``__setitem__``, ``__getattr__`` and > ``__setattr__``. The 'protect' details are pretty hairy. _protect does a few checks: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l1368 ....and then palms things off to _PyObject_PrepOrigType: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l1054 That method is where the magic happens. We basically clone the type object for the object we're protecting, then replace the setitem, getitem etc methods with our counterparts (described next): http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l1100 Note the voodoo involved in 'protecting' heap objects versus normal C-type objects, GC objects versus non-GC, etc. > We replace these methods with counterparts that > serve two purposes: > > 1. The read-only methods are wrapped in a read-lock, the write > methods are wrapped in a write lock (using underlying system slim > read/write locks, which are uber fast). (Basically, you can have > unlimited readers holding the read lock, but only one writer can hold > the write lock (excluding all the readers and other writers).) > > 2. Detecting when parallel objects (objects created from within a > parallel thread, and thus, backed by the parallel context's heap) > have been assigned outside the context (in this case, to a > "protected" dict object that was created from the main thread). This is handled via _Px_objobjargproc_ass: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l900 That is responsible for detecting when a parallel object is being assigned to a non-parallel object (and tries to persist the object where necessary). > The first point is important as it ensures concurrent access doesn't > corrupt the data structure. > > The second point is important because it allows us to prevent the > persisted object's context from automatically transitioning into the > complete->release->heapdestroy lifecycle when the callback completes. > > This is known as "persistence", as in, a context has been persisted. > All sorts of things happen to the object when we detect that it's been > persisted. The biggest thing is that reference counting is enabled > again for the object (from the perspective of the main thread; ref > counting is still a no-op within the parallel thread) -- however, once > the refcount hits zero, instead of free()ing the memory like we'd > normally do in the main thread (or garbage collecting it), we decref > the reference count of the owning context. That's the job of _Px_TryPersist (called via _Px_objobjargproc_ass as mentioned above): http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l861 That makes use of yet-another-incredibly-useful-Windows-feature called 'init once'; basically, underlying system support for ensuring something only gets done *once*. Perfect for avoiding race conditions. > Once the owning context's refcount goes to zero, we know that no more > references exist to objects created from that parallel thread's > execution, and we're free to release the context (and thus, destroy > the heap -> free the memory). All that magic is the unfortunate reason my lovely Py_INCREF/DECREF overrides when from very simple to quite-a-bit-more-involved. i.e. originally Py_INCREF was just: #define Py_INCREF(o) (Py_PXCTX ? (void)0; Py_REFCNT(o)++); With the advent of parallel object persistence and context-specific refcounts, things become less simple: Py_INCREF: http://hg.python.org/sandbox/trent/file/7148209d5490/Include/object.h#l890 890 __inline 891 void 892 _Py_IncRef(PyObject *op) 893 { 894 if ((!Py_PXCTX && (Py_ISPY(op) || Px_PERSISTED(op)))) { 895 _Py_INC_REFTOTAL; 896 (((PyObject*)(op))->ob_refcnt++); 897 } 898 } Py_DECREF: http://hg.python.org/sandbox/trent/file/7148209d5490/Include/object.h#l911 909 __inline 910 void 911 _Py_DecRef(PyObject *op) 912 { 913 if (!Py_PXCTX) { 914 if (Px_PERSISTED(op)) 915 Px_DECREF(op); 916 else if (!Px_ISPX(op)) { 917 _Py_DEC_REFTOTAL; 918 if ((--((PyObject *)(op))->ob_refcnt) != 0) { 919 _Py_CHECK_REFCNT(op); 920 } else 921 _Py_Dealloc((PyObject *)(op)); 922 } 923 } 924 } > That's currently implemented and works very well. There are a few > drawbacks: one, the user must only assign to an "async protected" > object. Use a normal dict and you're going to segfault or corrupt > things (or worse) pretty quickly. > > Second, we're persisting the entire context potentially for a single > object. The context may be huge; think of some data processing > callback that ran for ages, racked up a 100MB footprint, but only > generated a PyLong with the value 42 at the end, which consumes, like, > 50 bytes (or whatever the size of a PyLong is these days). > > It's crazy keeping a 100MB context around indefinitely until that > PyLong object goes away, so, we need another option. The idea I have > for that is "promotion". Rather than persist the context, the object > is "promoted"; basically, the parallel thread palms it off to the main > thread, which proceeds to deep-copy the object, and take over > ownership. This removes the need for the context to be persisted. > > Now, I probably shouldn't have said "deep-copy" there. Promotion is a > terrible option for anything other than simple objects (scalars). If > you've got a huge list that consumes 98% of your 100MB heap footprint, > well, persistence is perfect. If it's a 50 byte scalar, promotion is > perfect. (Also, deep-copy implies collection interrogation, which has > all sorts of complexities, so, err, I'll probably end up supporting > promotion if the object is a scalar that can be shallow-copied. Any > form of collection or non-scalar type will get persisted by default.) > > I haven't implemented promotion yet (persistence works well enough for > now). And none of this is integrated into the heap snapshot/rollback > logic -- i.e. we don't detect if a client/server callback assigned an > object created in the parallel context to a main-thread object -- we > just roll back blindly as soon as the callback completes. > > Before this ever has a chance of being eligible for adoption into > CPython, those problems will need to be addressed. As much as I'd > like to ignore those corner cases that violate the shared-nothing > approach -- it's inevitable someone, somewhere, will be assigning > parallel objects outside of the context, maybe for good reason, maybe > by accident, maybe because they don't know any better. Whatever the > reason, the result shouldn't be corruption. > > So, the remaining challenge is preventing the use case alluded to > earlier where someone tries to modify an object that hasn't been > "async protected". That's a bit harder. The idea I've got in mind is > to instrument the main CPython ceval loop, such that we do these > checks as part of opcode processing. That allows us to keep all the > logic in the one spot and not have to go hacking the internals of > every single object's C backend to ensure correctness. > > Now, that'll probably work to an extent. I mean, after all, there are > opcodes for all the things we'd be interested in instrumenting, > LOAD_GLOBAL, STORE_GLOBAL, SETITEM etc. What becomes challenging is > detecting arbitrary mutations via object calls, i.e. how do we know, > during the ceval loop, that foo.append(x) needs to be treated > specially if foo is a main-thread object and x is a parallel thread > object? > > There may be no way to handle that *other* than hacking the internals > of each object, unfortunately. So, the viability of this whole > approach may rest on whether or that's deemed as an acceptable > tradeoff (a necessary evil, even) to the Python developer community. Actually, I'd sort of forgotten that I started adding protection support for lists in _PyObject_PrepOrigType. Well, technically, support for intercepting PySequenceMethods: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l1126 I settled for just intercepting PyMappingMethods initially, which is why that chunk of code is commented out. Intercepting the mapping methods allowed me to implement the async protection for dicts and generic objects, which was sufficient for testing purposes at the time. So, er, I guess my point is that automatically detecting object mutation might not be as hard as I'm alluding to above. I'll be happy if we're able to simply raise an exception if you attempt to mutate a non-protected main-thread object. That's infinitely better than segfaulting or silent corruption. Trent. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com