Andy Fingerhut <[email protected]> writes:
> I'm not practiced in recognizing megamorphic call sites, so I could be
> missing some in the example code below, modified from Lee's original
> code. It doesn't use reverse or conj, and as far as I can tell
> doesn't use PersistentList, either, only Cons.
...
> Can you try to reproduce to see if you get similar results? If so, do
> you know why we get bad parallelism in a single JVM for this code? If
> there are no megamorphic call sites, then it is examples like this
> that lead me to wonder about locking in memory allocation and/or GC.
I think your benchmark is a bit different from Lee’s original. The
`reverse`-based versions perform heavily allocation as they repeatedly
reverse a sequence, but each thread will hold a sequence of length at
most 10,000 at any given time. In your benchmark, each thread holds a
sequence of at most 2,000,000 elements, for a naive 200x increase in
memory pressure and a potential increase in the number of objects being
promoted out of the young generation.
I ran your run benchmark under a version of Cameron’s criterium-based
speed-up measurement wrapper I’ve modified to pass in the `pmap`
function to use. I reduced the number of iterations in your algorithm
by a factor of 5 to get it to run in a reasonable amount of time. And I
ran it using default JVM GC settings, on a 32-way AMD system.
I get the following numbers for 1-32 way parallelism with a 500MB heap:
andy 1 : smap-ms 7.5, pmap-ms 7.7, speedup 0.97
andy 2 : smap-ms 7.8, pmap-ms 9.8, speedup 0.80
andy 4 : smap-ms 8.5, pmap-ms 10.6, speedup 0.80
andy 8 : smap-ms 8.6, pmap-ms 11.5, speedup 0.75
andy 16 : smap-ms 8.1, pmap-ms 12.5, speedup 0.65
andy 32 : [java.lang.OutOfMemoryError: Java heap space]
And these numbers with a 4GB heap:
andy 1 : smap-ms 3.8, pmap-ms 4.0, speedup 0.95
andy 2 : smap-ms 4.2, pmap-ms 2.1, speedup 2.02
andy 4 : smap-ms 4.2, pmap-ms 1.7, speedup 2.48
andy 8 : smap-ms 4.2, pmap-ms 1.2, speedup 3.44
andy 16 : smap-ms 4.4, pmap-ms 1.0, speedup 4.52
andy 32 : smap-ms 4.0, pmap-ms 1.6, speedup 2.55
I’m running out of time for breakfast experiments, but it seems
relatively likely to me that the increased at-once sequence size in your
benchmark is increasing the number of objects making it out of the young
generation. This in turn is increasing the number of pause-the-world
GCs, which increase even further in frequency at lower heap sizes. I’ll
run these again later with GC logging and report if the results are
unexpected.
-Marshall
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en