date:20181016

Re: [Python-Dev] bpo-34837: Multiprocessing.Pool API Extension - Pass Data to Workers w/o Globals

2018-10-16 Thread Sean Harrington

@Nataniel this is what I am suggesting as well. No cacheing - just storing
the `fn` on each worker, rather than pickling it for each item in our
iterable.

As long as we store the `fn` post-fork on the worker process (perhaps as
global), subsequent calls to Pool.map shouldn't be effected (referencing
Antoine's & Michael's points that "multiprocessing encapsulates each
subprocesses globals in a separate namespace").

@Antoine - I'm making an effort to take everything you've said into
consideration here.  My initial PR and talk
 was intended to shed light on
a couple of pitfalls that I often see Python end-users encounter with Pool.
Moving beyond my naive first attempt, and the onslaught of deserved
criticism, it seems that we have an opportunity here: No changes to the
interface, just an optimization to reduce the frequency of pickling.

Raymond Hettinger may also be interested in this optimization, as he speaks
(with great analogies) about different ways you can misuse concurrency in
Python . This would address
one of the pitfalls that he outlines: the "size of the
serialized/deserialized data".

Is this an optimization that either of you would be willing to review, and
accept, if I find there is a *reasonable way* to implement it?

On Fri, Oct 12, 2018 at 3:40 PM Nathaniel Smith  wrote:

> On Fri, Oct 12, 2018, 06:09 Antoine Pitrou  wrote:
>
>> On Fri, 12 Oct 2018 08:33:32 -0400
>> Sean Harrington  wrote:
>> > Hi Nathaniel - this if this solution can be made performant, than I
>> would
>> > be more than satisfied.
>> >
>> > I think this would require removing "func" from the "task tuple", and
>> > storing the "func" "once per worker" somewhere globally (maybe a class
>> > attribute set post-fork?).
>> >
>> > This also has the beneficial outcome of increasing general performance
>> of
>> > Pool.map and friends. I've seen MANY folks across the interwebs doing
>> > things like passing instance methods to map, resulting in "big" tasks,
>> and
>> > slower-than-sequential parallelized code. Parallelizing "instance
>> methods"
>> > by passing them to map, w/o needing to wrangle with staticmethods and
>> > globals, would be a GREAT feature! It'd just be as easy as:
>> >
>> > Pool.map(self.func, ls)
>> >
>> > What do you think about this idea? This is something I'd be able to take
>> > on, assuming I get a few core dev blessings...
>>
>> Well, I'm not sure how it would work, so it's difficult to give an
>> opinion.  How do you plan to avoid passing "self"?  By caching (by
>> equality? by identity?)?  Something else?  But what happens if "self"
>> changed value (in the case of a mutable object) in the parent?  Do you
>> keep using the stale version in the child?  That would break
>> compatibility...
>>
>
> I was just suggesting that within a single call to Pool.map, it would be
> reasonable optimization to only send the fn once to each worker. So e.g. if
> you have 5 workers and 1000 items, you'd only pickle fn 5 times, rather
> than 1000 times like we do now. I wouldn't want to get any fancier than
> that with caching data between different map calls or anything.
>
> Of course even this may turn out to be too complicated to implement in a
> reasonable way, since it would require managing some extra state on the
> workers. But semantically it would be purely an optimization of current
> semantics.
>
> -n
>
>> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/seanharr11%40gmail.com
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bpo-34837: Multiprocessing.Pool API Extension - Pass Data to Workers w/o Globals

2018-10-16 Thread Michael Selik

Would this change the other pool method behavior in some way if the user,
for whatever reason, mixed techniques?

imap_unordered will only block when nexting the generator. If the user
mingles nexting that generator with, say, apply_async, could the change
you're proposing have some side-effect?

On Tue, Oct 16, 2018, 5:09 AM Sean Harrington  wrote:

> @Nataniel this is what I am suggesting as well. No cacheing - just storing
> the `fn` on each worker, rather than pickling it for each item in our
> iterable.
>
> As long as we store the `fn` post-fork on the worker process (perhaps as
> global), subsequent calls to Pool.map shouldn't be effected (referencing
> Antoine's & Michael's points that "multiprocessing encapsulates each
> subprocesses globals in a separate namespace").
>
> @Antoine - I'm making an effort to take everything you've said into
> consideration here.  My initial PR and talk
>  was intended to shed light
> on a couple of pitfalls that I often see Python end-users encounter with
> Pool. Moving beyond my naive first attempt, and the onslaught of deserved
> criticism, it seems that we have an opportunity here: No changes to the
> interface, just an optimization to reduce the frequency of pickling.
>
> Raymond Hettinger may also be interested in this optimization, as he
> speaks (with great analogies) about different ways you can misuse
> concurrency in Python . This
> would address one of the pitfalls that he outlines: the "size of the
> serialized/deserialized data".
>
> Is this an optimization that either of you would be willing to review, and
> accept, if I find there is a *reasonable way* to implement it?
>
>
> On Fri, Oct 12, 2018 at 3:40 PM Nathaniel Smith  wrote:
>
>> On Fri, Oct 12, 2018, 06:09 Antoine Pitrou  wrote:
>>
>>> On Fri, 12 Oct 2018 08:33:32 -0400
>>> Sean Harrington  wrote:
>>> > Hi Nathaniel - this if this solution can be made performant, than I
>>> would
>>> > be more than satisfied.
>>> >
>>> > I think this would require removing "func" from the "task tuple", and
>>> > storing the "func" "once per worker" somewhere globally (maybe a class
>>> > attribute set post-fork?).
>>> >
>>> > This also has the beneficial outcome of increasing general performance
>>> of
>>> > Pool.map and friends. I've seen MANY folks across the interwebs doing
>>> > things like passing instance methods to map, resulting in "big" tasks,
>>> and
>>> > slower-than-sequential parallelized code. Parallelizing "instance
>>> methods"
>>> > by passing them to map, w/o needing to wrangle with staticmethods and
>>> > globals, would be a GREAT feature! It'd just be as easy as:
>>> >
>>> > Pool.map(self.func, ls)
>>> >
>>> > What do you think about this idea? This is something I'd be able to
>>> take
>>> > on, assuming I get a few core dev blessings...
>>>
>>> Well, I'm not sure how it would work, so it's difficult to give an
>>> opinion.  How do you plan to avoid passing "self"?  By caching (by
>>> equality? by identity?)?  Something else?  But what happens if "self"
>>> changed value (in the case of a mutable object) in the parent?  Do you
>>> keep using the stale version in the child?  That would break
>>> compatibility...
>>>
>>
>> I was just suggesting that within a single call to Pool.map, it would be
>> reasonable optimization to only send the fn once to each worker. So e.g. if
>> you have 5 workers and 1000 items, you'd only pickle fn 5 times, rather
>> than 1000 times like we do now. I wouldn't want to get any fancier than
>> that with caching data between different map calls or anything.
>>
>> Of course even this may turn out to be too complicated to implement in a
>> reasonable way, since it would require managing some extra state on the
>> workers. But semantically it would be purely an optimization of current
>> semantics.
>>
>> -n
>>
>>> ___
>> Python-Dev mailing list
>> Python-Dev@python.org
>> https://mail.python.org/mailman/listinfo/python-dev
>> Unsubscribe:
>> https://mail.python.org/mailman/options/python-dev/seanharr11%40gmail.com
>>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/mike%40selik.org
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Arbitrary non-identifier string keys when using **kwargs

2018-10-16 Thread Jeff Hardy

On Sun, Oct 14, 2018 at 12:15 PM Jeff Allen  wrote:
>
>
> On 10/10/2018 00:06, Steven D'Aprano wrote:
>
> On Tue, Oct 09, 2018 at 09:37:48AM -0700, Jeff Hardy wrote:
>
> ...
>
> From an alternative implementation point of view, CPython's behaviour
> *is* the spec. Practicality beats purity and all that.
>
> Are you speaking on behalf of all authors of alternate implementations,
> or even of some of them?
>
> It certainly is not true that CPython's behaviour "is" the spec. PyPy
> keeps a list of CPython behaviour they don't match, either because they
> choose not to for other reasons, or because they believe that the
> CPython behaviour is buggy. I daresay IronPython and Jython have
> similar.
>
> While agreeing with the principle, unless it is one of the fundamental 
> differences (GC, GIL), Jython usually lets practicality beat purity. When 
> faced with a certain combination of objects, one has to do something, and it 
> is least surprising to do what CPython does. It's also easier than keeping a 
> record.

This is how it is for IronPython as well. When the pool of potential
users is already small, one cannot afford to get too pedantic about
whether something is in the spec or not. Matching what CPython does is
the easiest way to make sure as many people as possible can use an
alternative implementation.

- Jeff
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bpo-34837: Multiprocessing.Pool API Extension - Pass Data to Workers w/o Globals

2018-10-16 Thread Sean Harrington

Is your concern something like the following?

with Pool(8) as p:
gen = p.imap_unordered(func, ls)
first_elem = next(gen)
p.apply_async(long_func, x)
remaining_elems = [elem for elem in gen]

...here, if we store "func" on each worker Process as a global, and execute
this pattern above, we will likely alter state of one of the worker
processes s.t. it stores "long_func" in place of the initial "func".

So yes, this could break things. *A potential solution*:

Replace "func" in the task tuple with an identifier (maybe, *perhaps
> naively*, func.__qualname__), and store the "identifier => func map"
> somewhere globally accessible, maybe as a class attribute on Pool. On any
> call to Pool.map, Pool.apply, etc... this map is updated. Then, in the
> worker process, as each task is processed, we use the "func identifier" on
> the task to recover the globally mapped 'func', and apply it.


This would avoid the weird stateful bug above. We could also do something
slightly different to this if folks are averse to the "Pool class-attribute
func map" (i.e. averse to globals), and store this map as an Instance
Attribute on the Pool object, and wrap the "initializer" func to make the
map globally available in the worker via the "global" keyword.

One note: this isn't a "cache", it's just a global map which has its keys &
values updated *blindly* with every call to Pool.. It serves
as a way to bypass repeated serialization of functions in Pool, which can
be large when bound to big objects (like large class instances, or
functools.partial objects).

On Tue, Oct 16, 2018 at 9:27 AM Michael Selik 
wrote:

> Would this change the other pool method behavior in some way if the user,
> for whatever reason, mixed techniques?
>
> imap_unordered will only block when nexting the generator. If the user
> mingles nexting that generator with, say, apply_async, could the change
> you're proposing have some side-effect?
>
> On Tue, Oct 16, 2018, 5:09 AM Sean Harrington 
> wrote:
>
>> @Nataniel this is what I am suggesting as well. No cacheing - just
>> storing the `fn` on each worker, rather than pickling it for each item in
>> our iterable.
>>
>> As long as we store the `fn` post-fork on the worker process (perhaps as
>> global), subsequent calls to Pool.map shouldn't be effected (referencing
>> Antoine's & Michael's points that "multiprocessing encapsulates each
>> subprocesses globals in a separate namespace").
>>
>> @Antoine - I'm making an effort to take everything you've said into
>> consideration here.  My initial PR and talk
>>  was intended to shed light
>> on a couple of pitfalls that I often see Python end-users encounter with
>> Pool. Moving beyond my naive first attempt, and the onslaught of deserved
>> criticism, it seems that we have an opportunity here: No changes to the
>> interface, just an optimization to reduce the frequency of pickling.
>>
>> Raymond Hettinger may also be interested in this optimization, as he
>> speaks (with great analogies) about different ways you can misuse
>> concurrency in Python .
>> This would address one of the pitfalls that he outlines: the "size of the
>> serialized/deserialized data".
>>
>> Is this an optimization that either of you would be willing to review,
>> and accept, if I find there is a *reasonable way* to implement it?
>>
>>
>> On Fri, Oct 12, 2018 at 3:40 PM Nathaniel Smith  wrote:
>>
>>> On Fri, Oct 12, 2018, 06:09 Antoine Pitrou  wrote:
>>>
 On Fri, 12 Oct 2018 08:33:32 -0400
 Sean Harrington  wrote:
 > Hi Nathaniel - this if this solution can be made performant, than I
 would
 > be more than satisfied.
 >
 > I think this would require removing "func" from the "task tuple", and
 > storing the "func" "once per worker" somewhere globally (maybe a class
 > attribute set post-fork?).
 >
 > This also has the beneficial outcome of increasing general
 performance of
 > Pool.map and friends. I've seen MANY folks across the interwebs doing
 > things like passing instance methods to map, resulting in "big"
 tasks, and
 > slower-than-sequential parallelized code. Parallelizing "instance
 methods"
 > by passing them to map, w/o needing to wrangle with staticmethods and
 > globals, would be a GREAT feature! It'd just be as easy as:
 >
 > Pool.map(self.func, ls)
 >
 > What do you think about this idea? This is something I'd be able to
 take
 > on, assuming I get a few core dev blessings...

 Well, I'm not sure how it would work, so it's difficult to give an
 opinion.  How do you plan to avoid passing "self"?  By caching (by
 equality? by identity?)?  Something else?  But what happens if "self"
 changed value (in the case of a mutable object) in the parent?  Do you
 keep using the stale version in the child?  That would break
 compatibility...

>>>
>>

Re: [Python-Dev] bpo-34837: Multiprocessing.Pool API Extension - Pass Data to Workers w/o Globals

Re: [Python-Dev] bpo-34837: Multiprocessing.Pool API Extension - Pass Data to Workers w/o Globals

Re: [Python-Dev] Arbitrary non-identifier string keys when using **kwargs

Re: [Python-Dev] bpo-34837: Multiprocessing.Pool API Extension - Pass Data to Workers w/o Globals

4 matches

Site Navigation

Mail list logo

Footer information