Re: [Rd] compiling R | multi-Opteron | BLAS source

Evan Cooch Tue, 01 Aug 2006 11:17:23 -0700

Thanks very much - I followed your advice, and have tried a variety of 
permutations (using ACML, and LAPACK). For the most part, I'm still 
'playing' with multiple threads, but given the performance I'm getting 
(quad Opteron 880, 16 GB RAM, 64-bit FC5), I'll stick with that for now 
(but based on your examples, worth considering a single-thread build for 
comparisons - the svd test is pretty compelling). Here are some 'average 
values' from my machine for the benchmarks you posted:


ACML3.5.0 - multi-threaded (compiled with gcc 4.0.1 and gfortran):

system.time(for(i in 1:25) X%*%X)
 11.75   0.335  3.900  0.000  0.000

system.time(for(i in 1:25) solve(X))
22.410   2.621   13.481  0.000 0.000

system.time(for(i in 1:10) svd(X))
67.384   4.28   38.585   0.000   0.000


Needless to say, on this level of system, most things run pretty fast - 
except the svd benchmark which lags, consistent with what you showed in 
your results. What is somewhat intriguing is why the svd example varies 
so much between (say) internal BLAS (165) and goto BLAS (for example; 
43), for a single-thread compilation.

But, it does look as if ACML is holding its own.

Cheers...

> The R-devel version of R provides a pluggable BLAS, which makes such tests 
> fairly easy (although building the BLAS themselves is not).  On dual 
> Opterons, using multiple threads is often not worthwhile and can be 
> counter-productive (Doug Bates has found some dramatic examples, and you 
> can see them in my timings below).
>
> So timings for FC3, gcc 3.4.6, dual Opteron 252, 64-bit build of R. ACML 
> 3.5.0 is by far the easiest to install (on R-devel all you need to do is 
> to link libacml.so to lib/libRblas.so) and pretty competitive, so that is 
> what I normally use.
>
> These timings are not very repeatable: to a few % only even after 
> averaging quite a few runs.
>
> set.seed(123)
> X <- matrix(rnorm(1e6), 1000)
> system.time(for(i in 1:25) X%*%X)
> system.time(for(i in 1:25) solve(X))
> system.time(for(i in 1:10) svd(X))
>
> internal BLAS (-O3)
>   
>> system.time(for(i in 1:25) X%*%X)
>>     
> [1] 96.939  0.341 97.375  0.000  0.000
>   
>> system.time(for(i in 1:25) solve(X))
>>     
> [1] 110.316   1.652 112.006   0.000   0.000
>   
>> system.time(for(i in 1:10) svd(X))
>>     
> [1] 165.550   1.131 166.806   0.000   0.000
>
> Goto 1.03, 1 thread
>   
>> system.time(for(i in 1:25) X%*%X)
>>     
> [1] 12.949  0.191 13.143  0.000  0.000
>   
>> system.time(for(i in 1:25) solve(X))
>>     
> [1] 23.201  1.449 24.652  0.000  0.000
>   
>> system.time(for(i in 1:10) svd(X))
>>     
> [1] 43.318  1.016 44.361  0.000  0.000
>
> Goto 1.03, dual CPU
>   
>> system.time(for(i in 1:25) X%*%X)
>>     
> [1] 15.038  0.244  8.488  0.000  0.000
>   
>> system.time(for(i in 1:25) solve(X))
>>     
> [1] 26.569  2.239 19.814  0.000  0.000
>   
>> system.time(for(i in 1:10) svd(X))
>>     
> [1] 59.912  1.799 50.350  0.000  0.000
>
> ACML 3.5.0 (single-threaded)
>   
>> system.time(for(i in 1:25) X%*%X)
>>     
> [1] 13.794  0.368 14.164  0.000  0.000
>   
>> system.time(for(i in 1:25) solve(X))
>>     
> [1] 22.990  1.695 24.710  0.000  0.000
>   
>> system.time(for(i in 1:10) svd(X))
>>     
> [1] 48.267  1.373 49.662  0.000  0.000
>
> ATLAS 3.6.0, single-threaded
>   
>> system.time(for(i in 1:25) X%*%X)
>>     
> [1] 16.164  0.404 16.572  0.000  0.000
>   
>> system.time(for(i in 1:25) solve(X))
>>     
> [1] 26.200  1.704 27.907  0.000  0.000
>   
>> system.time(for(i in 1:10) svd(X))
>>     
> [1] 50.150  1.462 51.619  0.000  0.000
>
> ATLAS 3.6.0, multi-threaded
>   
>> system.time(for(i in 1:25) X%*%X)
>>     
> [1] 17.657  0.468  9.775  0.000  0.000
>   
>> system.time(for(i in 1:25) solve(X))
>>     
> [1] 38.388  2.353 30.141  0.000  0.000
>   
>> system.time(for(i in 1:10) svd(X))
>>     
> [1] 95.611  3.039 88.917  0.000  0.000
>
>
> On Sun, 23 Jul 2006, Evan Cooch wrote:
>
>   
>> Greetings -
>>
>> A quick perusal of some of the posts to this maillist suggest the level 
>> of the questions is probably beyond someone working at my level, but at 
>> the risk of looking foolish publicly (something I find I get 
>> increasingly comfortable with as I get older), here goes:
>>
>> My research group recently purchased a multi-Opteron system (bunch of 
>> 880 chips), running 64-bit RHEL 4 (which we have site licensed at our 
>> university, so it cost us nothing - good price) with SMP support built 
>> into the kernel (perhaps obviously, for a multi-pro system). Several of 
>> our user use [R], which I've only used on a few occasions. However, it 
>> is part of my task to get [R] installed for folks using this system.
>>
>> While the simple, basic compile sequence (./configure, make, make check, 
>> make install) went smoothly, its pretty clear from our benchmarks that 
>> the [R] code isn't running as 'rocket-fast' as it should for a system 
>> like this. So, I dig a bit deeper. Most of the jobs we want to run could 
>> benefit from BLAS support (lots of array manipulations and other bits of 
>> linear algebra), and a few other compilation optimizations - and here is 
>> where I seek advice.
>>
>> 1) Looks like there are 3-4 flavours: LAPACK, ATLAS, ACML 
>> (AMD-chips...), and Goto. In reading what I can find, it seems that 
>> there are reasons not to use ACML (single-thread) despite the AMD chips, 
>> reasons to avoid ATLAS (some hassles compiling on RHEL 4 boxes), reasons 
>> to avoid LAPACK (ibid), but apparently no problems with Goto BLAS.
>>
>> Is that a reasonable summary? At the risk of starting a larger 
>> discussion, I'm simply looking to get BLAS support, yielding the fastest 
>> [R] code with the minimum of hassles (while tweaking lines of configure 
>> fies,  weird linker sequences and all that used to appeal when I was a 
>> student, I don't have time at this stage). So, any quick recommendation 
>> for *which* BLAS library? My quick assessment suggests goto BLAS, but 
>> I'm hoping for some confirmation.
>>
>> 3) compilation of BLAS - I can compile for 32-bit, or 64-bit. 
>> Presumably, given we've invested in 64-bit chips, and a 64-bit OS, we'd 
>> like to consider a 64-bit compilation. Which, also presumably, means 
>> we'd need 64-bit compilation for [R]. While I've read the short blurb on 
>> CRAN concerning 64-bi vs 32-bit compilations (data size vs speed), I'd 
>> be happy to have both on our machine. But, I'm not sure how one 
>> specifies 64-bits in the [R] compilation - what flags to I need to set 
>> during ./configure, or what config file do I need to edit?
>>
>> Thanks very much in advance - and, again, apologies for the 'low-level' 
>> of these questions, but one needs to start somewhere.
>>
>> ______________________________________________
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
>>     
>
>   


        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] compiling R | multi-Opteron | BLAS source

Reply via email to