your computation is symmetric so you only need to compute the upper or lower triangle which will save both memory and time.
On Tue, Oct 8, 2013 at 10:06 AM, Ke Sun <[email protected]> wrote: > Dear all, > > I have written the following function to compute the square distances of a > large > matrix (each sample a row). It compute row by row and print the overall > progress. > The progress output is important and I didn't use matrix multiplication. > > I give as input a 70,000x800 matrix. The output should be a 70,000x70,000 > matrix. The program runs really slow (16 hours for 1/3 progress). And it > eats > 36G memory (fortunately I have enough). > > Could you give some insights on how to modify the code to be efficient and > to eat less memory? > > thanks, > Ke Sun > > def dist2_large( data ): > import time > if data.ndim != 2: raise RuntimeError( "data should be a matrix" ) > N,D = data.shape > > print 'using the sample-wise implementation' > print '%d samples, %d dimensions' % (N,D) > > start_t = time.time() > d2 = np.zeros( [N,N] ) > for i in range( N ): > print "\r%5d/%d" % (i+1, N), > for j in range( N ): > d2[i,j] = ((data[i] - data[j])**2).sum() > > total_t = time.time() - start_t > hours = (total_t / 3600) > minutes = (total_t % 3600) / 60 > print "\nfinished in %2dh%2dm" % (hours, minutes) > > return d2 > > _______________________________________________ > NumPy-Discussion mailing list > [email protected] > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
