Re: [Rd] dist function in R is very slow

Moshe Olshansky via R-devel Sun, 18 Jun 2017 23:12:03 -0700

Hi Stefan,
You are right about the possible loss of accuracy computing the Euclidean 
distance the way I did. In some cases you probably even can get a negative 
value to compute a square root (so I am making all negative numbers 0). To do 
what I did one must know that it is all right in their case.I tried 
wordspace.openmp wuth 8 threads and it reduces the time to just over 2.5 
minutes. This is more than enough for me.I am not sure whether you have any 
chance to beat the speed of (t)crossprod since they may be using a 
(complexity-wise) faster algorithm for matrix multiplication (may be with FFT - 
I am not sure).
Once again, thank you very much for your comments and help.

      From: Stefan Evert <stefa...@collocations.de>
 To: Moshe Olshansky <m_olshan...@yahoo.com> 
Cc: R-devel Mailing List <r-devel@r-project.org>
 Sent: Monday, 19 June 2017, 2:23
 Subject: Re: [Rd] dist function in R is very slow

> By the way, since the tcrossprod function in the Matrix package is so fast, 
> the Euclidean distance can be computed very fast:

Indeed.

> euc_dist <- function(m) {mtm <- Matrix::tcrossprod(m); sq <- rowSums(m*m);  
> sqrt(outer(sq,sq,"+") - 2*mtm)}

There are two reasons why I didn't use this optimization in "wordspace":

1) It can be inaccurate for small distances between vectors of large Euclidean 
length because of loss of significance in the subtraction step.  This is not 
just a theoretical concern – I've seen data sets were this became a real 
problem.

2) It incurs substantial memory overhead for a large distance matrix. Your code 
allocates at least five matrices of this size: outer(…), mtm, 2 * mtm, outer(…) 
- 2*mtm, and the final result obtained by taking the square root.  [Actually, 
there is additional overhead for m*m (an even larger matrix) when computing the 
Euclidean norms, but this could be avoided with sq <- rowNorms(m, 
method="euclidean").]

I am usually more concerned about RAM than raw processing speed, so the package 
was designed to keep memory overhead as low as possible and allow users to work 
with realistic data sets on ordinary laptop computers.

> It takes less than 50 seconds for my (dense) matrix of 5,054 rows and 12,803 
> columns, while dist.matrix with method="euclidean" takes almost 10 minutes 
> (which is still orders of magnitude faster than dist).

It's a little disappointing that dist.matrix() is still relatively slow despite 
all simplifications and better cache consistency (the function automatically 
transposes the input matrix and computes distances by columns rather than 
rows).  I'm a little surprised about your timing, though.  Testing with a 
random 5000 x 20000 matrix, my MacBook computers the full Euclidean distance 
matrix in about 5 minutes.  

If your machine (and version of R) supports OpenMP, you can improve performance 
by allowing multithreading with wordspace.openmp(threads=<n>).  In my test 
case, I get a 2.2x speed-up with 4 threads (2m 15s instead of 5m).

Best wishes,
Stefan

        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] dist function in R is very slow

Reply via email to