Hi,

On Tue, Oct 8, 2013 at 4:38 AM, Ke Sun <[email protected]> wrote:
> On Tue, Oct 08, 2013 at 01:49:14AM -0700, Matthew Brett wrote:
>> Hi,
>>
>> On Tue, Oct 8, 2013 at 1:06 AM, Ke Sun <[email protected]> wrote:
>> > Dear all,
>> >
>> > I have written the following function to compute the square distances of a 
>> > large
>> > matrix (each sample a row). It compute row by row and print the overall 
>> > progress.
>> > The progress output is important and I didn't use matrix multiplication.
>> >
>> > I give as input a 70,000x800 matrix. The output should be a 70,000x70,000
>> > matrix. The program runs really slow (16 hours for 1/3 progress). And it 
>> > eats
>> > 36G memory (fortunately I have enough).
>>
>> That is very slow.
>>
>> As a matter of interest - why didn't you use matrix multiplication?
> Because it will cost hours and I want to see the progress and
> know how far it goes. Another concern is to save memory and
> compute one sample at a time.
>
>> On a machine I had access to it took about 20 minutes.
> How? I am using matrix multiplication (the same code as
> http://stackoverflow.com/a/4856692) and it runs for around 18 hours.

I wonder if you are running into disk swap - the code there does
involve a large temporary array.

I believe the appended version of the code is correct, and I think it
is also memory efficient.

On a fast machine with lots of memory, it ran in about 5 minutes.
It's using EPD, which might be using multiple cores for the matrix
multiply.

Does the code also work for you in reasonable time?

Another suggestion I saw which only calculates the unique values (say
lower diagonal) is scipy.spatial.distance

http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html#scipy.spatial.distance.pdist

To a first pass that seems to be slower than the matrix multiply.

Cheers,

Matthew

import datetime
import numpy as np

def pdista(X):
    """Squared pairwise distances between all columns of X."""
    B = np.dot(X.T, X)
    q = np.diag(B)[:, None].copy() # copy necessary?
    B *= -2
    B += q
    B += q.T
    return B

M = 70000
N = 800
A = np.random.normal(size=(M, N))

start = datetime.datetime.now()
dists = pdista(A.T)
elapsed = datetime.datetime.now() - start
print(elapsed)
_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to