Re: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication

Bruce Southey Thu, 16 Jun 2011 11:57:25 -0700

On 06/16/2011 11:44 AM, Christopher Barker wrote:

NOTE: I'm only taking part in this discussion because it's interestingand I hope to learn something. I do hope the OP chimes back in toclarify his needs, but in the meantime...
Bruce Southey wrote:
Remember that is what the OP wanted to do, not me.
Actually, I don't think that's what the OP wanted -- I think we have aconflict between the need for concrete examples, and the desire tofind a generic solution, so I think this is what the OP wants:
How to best multiprocess a _generic_ operation that needs to beperformed on a lot of arrays. Something like:
output = []
for a in a_bunch_of_arrays:
  output.append( a_function(a) )
More specifically, a_function() is an inner product, *defined by theuser*.
So there is no way to optimize the inner product itself (that will beup to the user), nor any way to generally convert the bunch_of_arraysto a single array with a single higher-dimensional operation.
In testing his approach, the OP used a numpy multiply, and a simple,loop-through-the elements multiply, and found that with hismultiprocessing calls, the simple loop was a fair bit faster with twoprocessors, but that the numpy one was slower with two processors. Ofcourse, the looping method was much, much, slower than the numpy onein any case.
So Sturla's comments are probably right on:

Sturla Molden wrote:
"innerProductList = pool.map(myutil.numpy_inner_product, arrayList)"
1. Here we potentially have a case of false sharing and/or mutexcontention, as the work is too fine grained. pool.map does not doany load balancing. If pool.map is to scale nicely, each work itemmust take a substantial amount of time. I suspect this is the mainissue.
2. There is also the question of when the process pool is spawned.Though I haven't checked, I suspect it happens prior to callingpool.map. But if it does not, this is a factor as well, particularlyon Windows (less so on Linux and Apple).
It didn't work well on my Mac, so ti's either not an issue, or notWindows-specific, anyway.
3. "arrayList" is serialised by pickling, which has a significanoverhead. It's not shared memory either, as the OP's code implies,but the main thing is the slowness of cPickle.
I'll bet this is a big issue, and one I'm curious about how toaddress, I have another problem where I need to multi-process, and I'dlove to know a way to pass data to the other process and back*without* going through pickle. maybe memmapped files?
"IPs = N.array(innerProductList)"
4. numpy.array is a very slow function. The benchmark shouldpreferably not include this overhead.
I re-ran, moving that out of the timing loop, and, indeed, it helped alot, but it still takes longer with the multi-processing.
I suspect that the overhead of pickling, etc. is overwhelming theoperation itself. That and the load balancing issue that I don'tunderstand!
To test this, I did a little experiment -- creating a "fake"operation, one that simply returns an element from the input array --so it should take next to no time, and we can time the overhead of thepickling, etc:
$ python shared_mem.py

Using 2 processes
No shared memory, numpy array multiplication took 0.124427080154 seconds
Shared memory, numpy array multiplication took 0.586215019226 seconds
No shared memory, fake array multiplication took 0.000391006469727seconds
Shared memory, fake array multiplication took 0.54935503006 seconds

No shared memory, my array multiplication took 23.5055780411 seconds
Shared memory, my array multiplication took 13.0932741165 seconds

Bingo!
The overhead of the multi-processing takes about .54 seconds, whichexplains the slowdown for the numpy method
not so mysterious after all.

Bruce Southey wrote:
But if everything is *single-threaded* and thread-safe, then you justcreate a function and use Anne's very useful handythread.py(http://www.scipy.org/Cookbook/Multithreading).
This may be worth a try -- though the GIL could well get in the way.
By the way, if the arrays are sufficiently small, there is a lot ofoverhead involved such that there is more time in communication thancomputation.
yup -- clearly the case here. I wonder if it's just array size though-- won't cPickle time scale with array size? So it may not be sizepe-se, but rather how much computation you need for a given size array.
-Chris

[I've enclosed the OP's slightly altered code]





_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Please see:
http://mail.scipy.org/pipermail/numpy-discussion/2011-June/056766.html

"I'm doing element wise multiplication, basically innerProduct =numpy.sum(array1*array2) where array1 and array2 are, in general,multidimensional."


Thanks for the code as I forgot that it was sent.

I think there is something weird about these timings(probably becausethese use time and not timeit) - the shared timings should not beconstant across number of processors. For the numpy multiplicationapproach shows rather constant differences but using np.inner() clearlydiffers with the number of processors used. So I think that numpy may beusing multiple threads here.


It is far more evident with large arrays:
    arraySize = (3000,200)
    numArrays = 50

Using 1 processes
No shared memory, numpy array multiplication took 0.279149055481 seconds
Shared memory, numpy array multiplication took 1.87239384651 seconds
No shared memory, inner array multiplication took 14.9514381886 seconds
Shared memory, inner array multiplication took 17.0087819099 seconds

Using 4 processes
No shared memory, numpy array multiplication took 0.279071807861 seconds
Shared memory, numpy array multiplication took 1.48242783546 seconds
No shared memory, inner array multiplication took 15.1401138306 seconds
Shared memory, inner array multiplication took 5.2479391098 seconds

Using 8 processes
No shared memory, numpy array multiplication took 0.281194925308 seconds
Shared memory, numpy array multiplication took 1.44942212105 seconds
No shared memory, inner array multiplication took 15.3794519901 seconds
Shared memory, inner array multiplication took 3.51714301109 seconds


Bruce

_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication

Reply via email to