Re: abysmal multicore performance, especially on AMD processors

Marshall Bockrath-Vandegrift Wed, 12 Dec 2012 05:38:24 -0800

Andy Fingerhut <[email protected]> writes:

> I'm not practiced in recognizing megamorphic call sites, so I could be
> missing some in the example code below, modified from Lee's original
> code.  It doesn't use reverse or conj, and as far as I can tell
> doesn't use PersistentList, either, only Cons.


...

> Can you try to reproduce to see if you get similar results?  If so, do
> you know why we get bad parallelism in a single JVM for this code?  If
> there are no megamorphic call sites, then it is examples like this
> that lead me to wonder about locking in memory allocation and/or GC.

I think your benchmark is a bit different from Lee’s original.  The
`reverse`-based versions perform heavily allocation as they repeatedly
reverse a sequence, but each thread will hold a sequence of length at
most 10,000 at any given time.  In your benchmark, each thread holds a
sequence of at most 2,000,000 elements, for a naive 200x increase in
memory pressure and a potential increase in the number of objects being
promoted out of the young generation.

I ran your run benchmark under a version of Cameron’s criterium-based
speed-up measurement wrapper I’ve modified to pass in the `pmap`
function to use.  I reduced the number of iterations in your algorithm
by a factor of 5 to get it to run in a reasonable amount of time.  And I
ran it using default JVM GC settings, on a 32-way AMD system.

I get the following numbers for 1-32 way parallelism with a 500MB heap:

    andy  1 : smap-ms 7.5, pmap-ms 7.7, speedup 0.97
    andy  2 : smap-ms 7.8, pmap-ms 9.8, speedup 0.80
    andy  4 : smap-ms 8.5, pmap-ms 10.6, speedup 0.80
    andy  8 : smap-ms 8.6, pmap-ms 11.5, speedup 0.75
    andy 16 : smap-ms 8.1, pmap-ms 12.5, speedup 0.65
    andy 32 : [java.lang.OutOfMemoryError: Java heap space]

And these numbers with a 4GB heap:

    andy  1 : smap-ms 3.8, pmap-ms 4.0, speedup 0.95
    andy  2 : smap-ms 4.2, pmap-ms 2.1, speedup 2.02
    andy  4 : smap-ms 4.2, pmap-ms 1.7, speedup 2.48
    andy  8 : smap-ms 4.2, pmap-ms 1.2, speedup 3.44
    andy 16 : smap-ms 4.4, pmap-ms 1.0, speedup 4.52
    andy 32 : smap-ms 4.0, pmap-ms 1.6, speedup 2.55

I’m running out of time for breakfast experiments, but it seems
relatively likely to me that the increased at-once sequence size in your
benchmark is increasing the number of objects making it out of the young
generation.  This in turn is increasing the number of pause-the-world
GCs, which increase even further in frequency at lower heap sizes.  I’ll
run these again later with GC logging and report if the results are
unexpected.

-Marshall

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

Reply via email to