On 06/16/2011 11:44 AM, Christopher Barker wrote:
NOTE: I'm only taking part in this discussion because it's interesting and I hope to learn something. I do hope the OP chimes back in to clarify his needs, but in the meantime...



Bruce Southey wrote:
Remember that is what the OP wanted to do, not me.

Actually, I don't think that's what the OP wanted -- I think we have a conflict between the need for concrete examples, and the desire to find a generic solution, so I think this is what the OP wants:


How to best multiprocess a _generic_ operation that needs to be performed on a lot of arrays. Something like:


output = []
for a in a_bunch_of_arrays:
  output.append( a_function(a) )


More specifically, a_function() is an inner product, *defined by the user*.

So there is no way to optimize the inner product itself (that will be up to the user), nor any way to generally convert the bunch_of_arrays to a single array with a single higher-dimensional operation.

In testing his approach, the OP used a numpy multiply, and a simple, loop-through-the elements multiply, and found that with his multiprocessing calls, the simple loop was a fair bit faster with two processors, but that the numpy one was slower with two processors. Of course, the looping method was much, much, slower than the numpy one in any case.

So Sturla's comments are probably right on:

Sturla Molden wrote:

"innerProductList = pool.map(myutil.numpy_inner_product, arrayList)"

1. Here we potentially have a case of false sharing and/or mutex contention, as the work is too fine grained. pool.map does not do any load balancing. If pool.map is to scale nicely, each work item must take a substantial amount of time. I suspect this is the main issue.

2. There is also the question of when the process pool is spawned. Though I haven't checked, I suspect it happens prior to calling pool.map. But if it does not, this is a factor as well, particularly on Windows (less so on Linux and Apple).

It didn't work well on my Mac, so ti's either not an issue, or not Windows-specific, anyway.

3. "arrayList" is serialised by pickling, which has a significan overhead. It's not shared memory either, as the OP's code implies, but the main thing is the slowness of cPickle.

I'll bet this is a big issue, and one I'm curious about how to address, I have another problem where I need to multi-process, and I'd love to know a way to pass data to the other process and back *without* going through pickle. maybe memmapped files?

"IPs = N.array(innerProductList)"

4. numpy.array is a very slow function. The benchmark should preferably not include this overhead.

I re-ran, moving that out of the timing loop, and, indeed, it helped a lot, but it still takes longer with the multi-processing.

I suspect that the overhead of pickling, etc. is overwhelming the operation itself. That and the load balancing issue that I don't understand!

To test this, I did a little experiment -- creating a "fake" operation, one that simply returns an element from the input array -- so it should take next to no time, and we can time the overhead of the pickling, etc:

$ python shared_mem.py

Using 2 processes
No shared memory, numpy array multiplication took 0.124427080154 seconds
Shared memory, numpy array multiplication took 0.586215019226 seconds

No shared memory, fake array multiplication took 0.000391006469727 seconds
Shared memory, fake array multiplication took 0.54935503006 seconds

No shared memory, my array multiplication took 23.5055780411 seconds
Shared memory, my array multiplication took 13.0932741165 seconds

Bingo!

The overhead of the multi-processing takes about .54 seconds, which explains the slowdown for the numpy method

not so mysterious after all.

Bruce Southey wrote:

But if everything is *single-threaded* and thread-safe, then you just create a function and use Anne's very useful handythread.py (http://www.scipy.org/Cookbook/Multithreading).

This may be worth a try -- though the GIL could well get in the way.

By the way, if the arrays are sufficiently small, there is a lot of overhead involved such that there is more time in communication than computation.

yup -- clearly the case here. I wonder if it's just array size though -- won't cPickle time scale with array size? So it may not be size pe-se, but rather how much computation you need for a given size array.

-Chris

[I've enclosed the OP's slightly altered code]





_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Please see:
http://mail.scipy.org/pipermail/numpy-discussion/2011-June/056766.html
"I'm doing element wise multiplication, basically innerProduct = numpy.sum(array1*array2) where array1 and array2 are, in general, multidimensional."

Thanks for the code as I forgot that it was sent.

I think there is something weird about these timings(probably because these use time and not timeit) - the shared timings should not be constant across number of processors. For the numpy multiplication approach shows rather constant differences but using np.inner() clearly differs with the number of processors used. So I think that numpy may be using multiple threads here.

It is far more evident with large arrays:
    arraySize = (3000,200)
    numArrays = 50

Using 1 processes
No shared memory, numpy array multiplication took 0.279149055481 seconds
Shared memory, numpy array multiplication took 1.87239384651 seconds
No shared memory, inner array multiplication took 14.9514381886 seconds
Shared memory, inner array multiplication took 17.0087819099 seconds

Using 4 processes
No shared memory, numpy array multiplication took 0.279071807861 seconds
Shared memory, numpy array multiplication took 1.48242783546 seconds
No shared memory, inner array multiplication took 15.1401138306 seconds
Shared memory, inner array multiplication took 5.2479391098 seconds

Using 8 processes
No shared memory, numpy array multiplication took 0.281194925308 seconds
Shared memory, numpy array multiplication took 1.44942212105 seconds
No shared memory, inner array multiplication took 15.3794519901 seconds
Shared memory, inner array multiplication took 3.51714301109 seconds


Bruce


_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to