> By the way, since the tcrossprod function in the Matrix package is so fast,
> the Euclidean distance can be computed very fast:
Indeed.
> euc_dist <- function(m) {mtm <- Matrix::tcrossprod(m); sq <- rowSums(m*m);
> sqrt(outer(sq,sq,"+") - 2*mtm)}
There are two reasons why I didn't use this optimization in "wordspace":
1) It can be inaccurate for small distances between vectors of large Euclidean
length because of loss of significance in the subtraction step. This is not
just a theoretical concern – I've seen data sets were this became a real
problem.
2) It incurs substantial memory overhead for a large distance matrix. Your code
allocates at least five matrices of this size: outer(…), mtm, 2 * mtm, outer(…)
- 2*mtm, and the final result obtained by taking the square root. [Actually,
there is additional overhead for m*m (an even larger matrix) when computing the
Euclidean norms, but this could be avoided with sq <- rowNorms(m,
method="euclidean").]
I am usually more concerned about RAM than raw processing speed, so the package
was designed to keep memory overhead as low as possible and allow users to work
with realistic data sets on ordinary laptop computers.
> It takes less than 50 seconds for my (dense) matrix of 5,054 rows and 12,803
> columns, while dist.matrix with method="euclidean" takes almost 10 minutes
> (which is still orders of magnitude faster than dist).
It's a little disappointing that dist.matrix() is still relatively slow despite
all simplifications and better cache consistency (the function automatically
transposes the input matrix and computes distances by columns rather than
rows). I'm a little surprised about your timing, though. Testing with a
random 5000 x 20000 matrix, my MacBook computers the full Euclidean distance
matrix in about 5 minutes.
If your machine (and version of R) supports OpenMP, you can improve performance
by allowing multithreading with wordspace.openmp(threads=<n>). In my test
case, I get a 2.2x speed-up with 4 threads (2m 15s instead of 5m).
Best wishes,
Stefan
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel