abysmal multicore performance, especially on AMD processors

Lee Spector Fri, 07 Dec 2012 17:25:18 -0800

I've been running compute intensive (multi-day), highly parallelizable Clojure 
processes on high-core-count machines and blithely assuming that since I saw 
near maximal CPU utilization in "top" and the like that I was probably getting 
good speedups.


But a colleague recently did some tests and the results are really quite 
alarming. 

On intel machines we're seeing speedups but much less than I expected -- about 
a 2x speedup going from 1 to 8 cores.

But on AMD processors we're seeing SLOWDOWNS, with the same tests taking almost 
twice as long on 8 cores as on 1.

I'm baffled, and unhappy that my runs are probably going slower on 48-core and 
64-core nodes than on single-core nodes. 

It's possible that I'm just doing something wrong in the way that I dispatch 
the tasks, or that I've missed some Clojure or JVM setting... but right now I'm 
mystified and would really appreciate some help.

I'm aware that there's overhead for multicore distribution and that one can 
expect slowdowns if the computations that are being distributed are fast 
relative to the dispatch overhead, but this should not be the case here. We're 
distributing computations that take seconds or minutes, and not huge numbers of 
them (at least in our tests while trying to figure out what's going on).

I'm also aware that the test that produced the data I give below, insofar as it 
uses pmap to do the distribution, may leave cores idle for a bit if some tasks 
take a lot longer than others, because of the way that pmap allocates cores to 
threads. But that also shouldn't be a big issue here because for this test all 
of the threads are doing the exact same computation. And I also tried using an 
agent-based dispatch approach that shouldn't have the pmap thread allocation 
issue, and the results were about the same.

Note also that all of the computations in this test are purely functional and 
independent -- there shouldn't be any resource contention issues.

The test: I wrote a time-consuming function that just does a bunch of math and 
list manipulation (which is what takes a lot of time in my real applications):

(defn burn 
  ([] (loop [i 0
             value '()]
        (if (>= i 10000)
          (count (last (take 10000 (iterate reverse value))))
          (recur (inc i)
                 (cons 
                   (* (int i) 
                      (+ (float i) 
                         (- (int i) 
                            (/ (float i) 
                               (inc (int i))))))
                   value)))))
  ([_] (burn)))

Then I have a main function like this:

(defn -main 
  [& args]
  (time (doall (pmap burn (range 8))))
  (System/exit 0))

We run it with "lein run" (we've tried both leingingen 1.7.1 and 
2.0.0-preview10) with Java 1.7.0_03 Java HotSpot(TM) 64-Bit Server VM. We also 
tried Java 1.6.0_22. We've tried various JVM memory options (via :jvm-opts with 
-Xmx and -Xms settings) and also with and without -XX:+UseParallelGC. None of 
this seems to change the picture substantially.

The results that we get generally look like this:

- On an Intel Core i7 3770K with 8 cores and 16GB of RAM, running the code 
above, it takes about 45 seconds (and all cores appear to be fully loaded as it 
does so). If we change the pmap to just plain map, so that we use only a single 
core, the time goes up to about 1 minute and 36 seconds. So the speedup for 8 
cores is just about 2x, even though there are 8 completely independent tasks. 
So that's pretty depressing.

- But much worse: on a 4 x Opteron 6272 with 48 cores and 32GB of RAM, running 
the same test (with pmap) takes about 4 minutes and 2 seconds. That's really 
slow! Changing the pmap to map here produces a runtime of about 2 minutes and 
20 seconds. So it's quite a bit faster on one core than on 8! And all of these 
times are terrible compared to those on the intel.

Another strange observation is that we can run multiple instances of the test 
on the same machine and (up to some limit, presumably) they don't seem to slow 
each other down, even though just one instance of the test appears to be maxing 
out all of the CPU according to "top". I suppose that means that "top" isn't 
telling me what I thought -- my colleague says it can mean that something is 
blocked in some way with a full instruction queue. But I'm not interested in 
running multiple instances. I have single computations that involve multiple 
expensive but independent subcomputations, and I want to farm those 
subcomputations out to multiple cores -- and get speedups as a result. My 
subcomputations are so completely independent that I think I should be able to 
get speedups approaching a factor of n for n cores, but what I see is a factor 
of only about 2 on intel machines, and a bizarre factor of about 1/2 on AMD 
machines.

Any help would be greatly appreciated!

Thanks,

 -Lee

--
Lee Spector, Professor of Computer Science
Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
[email protected], http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

abysmal multicore performance, especially on AMD processors

Reply via email to