For 4 cores, on your system, your conclusion makes some sense. That said, I played around with this on both a core 2 duo and the 12 core system. For the 12-core system, on my tests the 0 case ran extremely close to the 2-thread case for all my sizes.
The core 2 duo runs windows 7, and after downloading pthreadsGC2.dll from the pthreads project, I was able to use openmp under a year-old (32-bit) pythonxy distribution with mingw. The result, 0 threads come in slightly faster than one thread, .00102 versus .00106, and 2 threads took .00060. My current theory is that gcc under linux uses some background trick to get two thread-like streams going. As I assess scale-up under linux, I will need to consider this behavior. Creating optimal codes with OpenMP certainly requires a considerable commitment. Given the problem-specific fine tuning required, I would not expect much gain in general-purpose routines. In specific routines like cdist, it might make more sense. I talked to a Dell HPC rep today, and he said that squeezing out an extra 15% performance boost on an Intel CPU was a pleasant surprise, so the 30% improvement is maybe not so bad. Cheers, Eric _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion