On Tue, 16 Dec 2008, Nathan S. Watson-Haigh wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Charles C. Berry wrote:
On Mon, 15 Dec 2008, Nathan S. Watson-Haigh wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Nathan S. Watson-Haigh wrote:
I'm trying to calculate Pearson correlation coefficients for a large
matrix of size 18563 x 18563. The following function takes about XX
minutes to complete, and I'd like to do this calculation about 15 times
and so speed is some what of an issue.

I think you are on the wrong track, Nathan.

The matrix you are starting with is 18563 x 18563 and the result of
finding the correlations amongst the columns of that matrix is also 18563
x 18563. It will require more than 5 Gigabytes of memory to store the
result and the original matrix.

Yes the memory usage is somewhat large - luckily I have the use of a
cluster with lots of shared memory! However, I'm interested to learn how
you came about the calculation to determine the memory requirements.


The original object is

18563^2*8/1024^3
[1] 2.567358


Gigabytes, and so is the result. I added them together.



Likely the time needed to do the calc is inflated because of caching
issues and if your machine has less than enough memory to store the
result and all the intermediate pieces by swapping as well.

You can finesse these by breaking your problem into smaller pieces, say
computing the correlations between each pair of 19 blocks of columns
(columns 1:977, 977+1:977, ... 18*977+1:977 ), then assembling the
results.

This is possibly, however why is something like this not implemented
internally in the cor() function if it poorly scales due to the large
memory requirements?

Because nobody ever really needed it?

Seriously, optimizing something like this is machine dependent, and R-core probably has higher priorities.

cor() provides lots of options - it handles NAs, for example - and it is probably not worth the trouble to try to optimize over those options. The calculation sans NAs is a simple one and can be done using the built in BLAS (as crossprod() does), which BLAS can in turn be tuned to the machine used. So, if your environment has a tuned or multithreaded BLAS, you might be better off to use crossprod() and scale the result.



---

BTW, R already has the necessary machinery to calculate the crossproduct
matrix (etc) needed to find the correlations. You can access the low level
linear algebra that R uses. You can marry R to an optimized BLAS if you
like.

So pulling in some other code to do this will not save you anything. If
you ever do decide to import C[++] code there is excellent documentation
in the Writing R Extensions manual, which you should review before
attempting to import C++ code into R.

Thanks, I have seen this and it seemed quite technical to use as a
starting point for someone unfamiliar with both C++ and incorporating
C++ code into R.


Well, in that case the path of least resistance is to start the process when you leave for the night and pick up the results the next morning.


HTH,

Chuck

Charles C. Berry                            (858) 534-2098
                                            Dept of Family/Preventive Medicine
E mailto:cbe...@tajo.ucsd.edu               UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to