Re: Poor parallelization performance across 18 cores (but not 4)

David Iba Wed, 18 Nov 2015 22:37:24 -0800

OK, have a few updates to report:

   - Oracle vs OpenJDK did not make a difference
   - Whenever I run N>1 threads calling any of these functions with 
   swap/vswap, there is some overhead compared to running 18 separate 
   single-run processes in parallel.  This overhead seems to increase as N 
   increases.
   - For both swap and vswap, the function timings from running 18 futures 
      (from one JVM) show about 1.5X the time from running 18 separate JVM 
      processes.
      - For the swap version (f2), very often a few of the calls would go 
      rogue and take around 3X the time of the others.
         - this did not happen for the vswap version of f2.
      - Running 9 processes with 2 f2-calling threads each was maybe 4% 
   slower than 18 processes of 1.
   - Running 4 processes with 4 f2-calling threads each was mostly the same 
   speed as the 18x1, but there were a couple of those rogue threads that took 
   2-3X the time of the others.


Any ideas?

On Thursday, November 19, 2015 at 1:08:14 AM UTC+9, David Iba wrote:
>
> No worries.  Thanks, I'll give that a try as well!
>
> On Thursday, November 19, 2015 at 1:04:04 AM UTC+9, tbc++ wrote:
>>
>> Oh, then I completely mis-understood the problem at hand here. If that's 
>> the case then do the following:
>>
>> Change "atom" to "volatile!" and "swap!" to "vswap!". See if that changes 
>> anything. 
>>
>> Timothy
>>
>>
>> On Wed, Nov 18, 2015 at 9:00 AM, David Iba <[email protected]> wrote:
>>
>>> Timothy:  Each thread (call of f2) creates its own "local" atom, so I 
>>> don't think there should be any swap retries.
>>>
>>> Gianluca:  Good idea!  I've only tried OpenJDK, but I will look into 
>>> trying Oracle and report back.
>>>
>>> Andy:  jvisualvm was showing pretty much all of the memory allocated in 
>>> the eden space and a little in the first survivor (no major/full GC's), and 
>>> total GC Time was very minimal.
>>>
>>> I'm in the middle of running some more tests and will report back when I 
>>> get a chance today or tomorrow.  Thanks for all the feedback on this!
>>>
>>> On Thursday, November 19, 2015 at 12:38:55 AM UTC+9, tbc++ wrote:
>>>>
>>>> This sort of code is somewhat the worst case situation for atoms (or 
>>>> really for CAS). Clojure's swap! is based off the "compare-and-swap" or 
>>>> CAS 
>>>> operation that most x86 CPUs have as an instruction. If we expand swap! it 
>>>> looks something like this:
>>>>
>>>> (loop [old-val @x*]
>>>>   (let [new-val (assoc old-val :k i)]
>>>>     (if (compare-and-swap x* old-val new-val)
>>>>        new-val
>>>>        (recur @x*)))
>>>>
>>>> Compare-and-swap can be defined as "updates the content of the 
>>>> reference to new-val only if the current value of the reference is equal 
>>>> to 
>>>> the old-val). 
>>>>
>>>> So in essence, only one core can be modifying the contents of an atom 
>>>> at a time, if the atom is modified during the execution of the swap! call, 
>>>> then swap! will continue to re-run your function until it's able to update 
>>>> the atom without it being modified during the function's execution. 
>>>>
>>>> So let's say you have some super long task that you need to integrate 
>>>> into a ref, he's one way to do it, but probably not the best:
>>>>
>>>> (let [a (atom 0)]
>>>>   (dotimes [x 18]
>>>>     (future
>>>>         (swap! a long-operation-on-score some-param))))
>>>>
>>>>
>>>> In this case long-operation-on-score will need to be re-run every time 
>>>> a thread modifies the atom. However if our function only needs the state 
>>>> of 
>>>> the ref to add to it, then we can do something like this instead:
>>>>
>>>> (let [a (atom 0)]
>>>>   (dotimes [x 18]
>>>>     (future
>>>>         (let [score (long-operation-on-score some-param)
>>>>           (swap! a + score)))))
>>>>
>>>> Now we only have a simple addition inside the swap! and we will have 
>>>> less contention between the CPUs because they will most likely be spending 
>>>> more time inside 'long-operation-on-score' instead of inside the swap.
>>>>
>>>> *TL;DR*: do as little work as possible inside swap! the more you have 
>>>> inside swap! the higher chance you will have of throwing away work due to 
>>>> swap! retries. 
>>>>
>>>> Timothy
>>>>
>>>> On Wed, Nov 18, 2015 at 8:13 AM, gianluca torta <[email protected]> 
>>>> wrote:
>>>>
>>>>> by the way, have you tried both Oracle and Open JDK with the same 
>>>>> results?
>>>>> Gianluca
>>>>>
>>>>> On Tuesday, November 17, 2015 at 8:28:49 PM UTC+1, Andy Fingerhut 
>>>>> wrote:
>>>>>>
>>>>>> David, you say "Based on jvisualvm monitoring, doesn't seem to be 
>>>>>> GC-related".
>>>>>>
>>>>>> What is jvisualvm showing you related to GC and/or memory allocation 
>>>>>> when you tried the 18-core version with 18 threads in the same process?
>>>>>>
>>>>>> Even memory allocation could become a point of contention, depending 
>>>>>> upon how the memory allocation works with many threads.  e.g. Depends on 
>>>>>> whether a thread gets a large chunk of memory on a global lock, and then 
>>>>>> locally carves it up into the small pieces it needs for each individual 
>>>>>> Java 'new' allocation, or gets a global lock for every 'new'.  The 
>>>>>> latter 
>>>>>> would give terrible performance as # cores increase, but I don't know 
>>>>>> how 
>>>>>> to tell whether that is the case, except by knowing more about how the 
>>>>>> memory allocator is implemented in your JVM.  Maybe digging through 
>>>>>> OpenJDK 
>>>>>> source code in the right place would tell?
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>> On Tue, Nov 17, 2015 at 2:00 AM, David Iba <[email protected]> wrote:
>>>>>>
>>>>>>> correction: that "do" should be a "doall".  (My actual test code was 
>>>>>>> a bit different, but each run printed some info when it started so it 
>>>>>>> doesn't have to do with delayed evaluation of lazy seq's or anything).
>>>>>>>
>>>>>>>
>>>>>>> On Tuesday, November 17, 2015 at 6:49:16 PM UTC+9, David Iba wrote:
>>>>>>>>
>>>>>>>> Andy:  Interesting.  Thanks for educating me on the fact that atom 
>>>>>>>> swap's don't use the STM.  Your theory seems plausible... I will try 
>>>>>>>> those 
>>>>>>>> tests next time I launch the 18-core instance, but yeah, not sure how 
>>>>>>>> illuminating the results will be.
>>>>>>>>
>>>>>>>> Niels: along the lines of this (so that each thread prints its time 
>>>>>>>> as well as printing the overall time):
>>>>>>>>
>>>>>>>>    1.   (time
>>>>>>>>    2.    (let [f f1
>>>>>>>>    3.          n-runs 18
>>>>>>>>    4.          futs (do (for [i (range n-runs)]
>>>>>>>>    5.                     (future (time (f)))))]
>>>>>>>>    6.      (doseq [fut futs]
>>>>>>>>    7.        @fut)))
>>>>>>>>    
>>>>>>>>
>>>>>>>> On Tuesday, November 17, 2015 at 5:33:01 PM UTC+9, Niels van 
>>>>>>>> Klaveren wrote:
>>>>>>>>>
>>>>>>>>> Could you also show how you are running these functions in 
>>>>>>>>> parallel and time them ? The way you start the functions can have as 
>>>>>>>>> much 
>>>>>>>>> impact as the functions themselves.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Niels
>>>>>>>>>
>>>>>>>>> On Tuesday, November 17, 2015 at 6:38:39 AM UTC+1, David Iba wrote:
>>>>>>>>>>
>>>>>>>>>> I have functions f1 and f2 below, and let's say they run in T1 
>>>>>>>>>> and T2 amount of time when running a single instance/thread.  The 
>>>>>>>>>> issue I'm 
>>>>>>>>>> facing is that parallelizing f2 across 18 cores takes anywhere from 
>>>>>>>>>> 2-5X 
>>>>>>>>>> T2, and for more complex funcs takes absurdly long.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    1. (defn f1 []
>>>>>>>>>>    2.   (apply + (range 2e9)))
>>>>>>>>>>    3.  
>>>>>>>>>>    4. ;; Note: each call to (f2) makes its own x* atom, so the 
>>>>>>>>>>    'swap!' should never retry.
>>>>>>>>>>    5. (defn f2 []
>>>>>>>>>>    6.   (let [x* (atom {})]
>>>>>>>>>>    7.     (loop [i 1e9]
>>>>>>>>>>    8.       (when-not (zero? i)
>>>>>>>>>>    9.         (swap! x* assoc :k i)
>>>>>>>>>>    10.         (recur (dec i))))))
>>>>>>>>>>    
>>>>>>>>>>
>>>>>>>>>> Of note:
>>>>>>>>>> - On a 4-core machine, both f1 and f2 parallelize well (roungly 
>>>>>>>>>> T1 and T2 for 4 runs in parallel)
>>>>>>>>>> - running 18 f1's in parallel on the 18-core machine also 
>>>>>>>>>> parallelizes well.
>>>>>>>>>> - Disabling hyperthreading doesn't help.
>>>>>>>>>> - Based on jvisualvm monitoring, doesn't seem to be GC-related
>>>>>>>>>> - also tried on dedicated 18-core ec2 instance with same issues, 
>>>>>>>>>> so not shared-tenancy-related
>>>>>>>>>> - if I make a jar that runs a single f2 and launch 18 in 
>>>>>>>>>> parallel, it parallelizes well (so I don't think it's 
>>>>>>>>>> machine/aws-related)
>>>>>>>>>>
>>>>>>>>>> Could it be that the 18 f2's in parallel on a single JVM instance 
>>>>>>>>>> is overworking the STM with all the swap's?  Any other theories?
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "Clojure" group.
>>>>>>> To post to this group, send email to [email protected]
>>>>>>> Note that posts from new members are moderated - please be patient 
>>>>>>> with your first post.
>>>>>>> To unsubscribe from this group, send email to
>>>>>>> [email protected]
>>>>>>> For more options, visit this group at
>>>>>>> http://groups.google.com/group/clojure?hl=en
>>>>>>> --- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "Clojure" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>> -- 
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Clojure" group.
>>>>> To post to this group, send email to [email protected]
>>>>> Note that posts from new members are moderated - please be patient 
>>>>> with your first post.
>>>>> To unsubscribe from this group, send email to
>>>>> [email protected]
>>>>> For more options, visit this group at
>>>>> http://groups.google.com/group/clojure?hl=en
>>>>> --- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "Clojure" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>>
>>>> -- 
>>>> “One of the main causes of the fall of the Roman Empire was 
>>>> that–lacking zero–they had no way to indicate successful termination of 
>>>> their C programs.”
>>>> (Robert Firth) 
>>>>
>>> -- 
>>> You received this message because you are subscribed to the Google
>>> Groups "Clojure" group.
>>> To post to this group, send email to [email protected]
>>> Note that posts from new members are moderated - please be patient with 
>>> your first post.
>>> To unsubscribe from this group, send email to
>>> [email protected]
>>> For more options, visit this group at
>>> http://groups.google.com/group/clojure?hl=en
>>> --- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "Clojure" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>
>> -- 
>> “One of the main causes of the fall of the Roman Empire was that–lacking 
>> zero–they had no way to indicate successful termination of their C 
>> programs.”
>> (Robert Firth) 
>>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Poor parallelization performance across 18 cores (but not 4)

Reply via email to