Dear R developers,
I am visualising high dimensional genomic data and for this purpose I need to
compute pairwise distances between many points in a high-dimensional space (say
I have a matrix of 5,000 rows and 20,000 columns, so the result is a
5,000x5,000 matrix or it's upper diagonal).Computing such thing in R takes many
hours (I am doing this on a Linux server with more than 100 GB of RAM, so this
is not the problem). When I write the matrix to disk, read it ans compute the
distances in C, write them to the disk and read them into R it takes 10 - 15
minutes (and I did not spend much time on optimising my C code).The question is
why the R function is so slow? I understand that it calls C (or C++) to compute
the distance. My suspicion is that the transposed matrix is passed to C and so
each time a distance between two columns of a matrix is computed, and since C
stores matrices by rows it is very inefficient and causes many cache misses (my
first C implementation was like this and I had to stop the r
un after an hour when it failed to complete).If my suspicion is correct, is it
possible to re-write the dist function so that it works faster on large
matrices?
Best regards,Moshe OlshanskyMonash University
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel