Hi Jose, >>>>> "JQ" == Jose Quesada <ques...@gmail.com> >>>>> on Sat, 24 Oct 2009 23:49:08 +0200 writes:
JQ> -----BEGIN PGP SIGNED MESSAGE----- JQ> Hash: SHA1 JQ> Hi, JQ> I'm doing some basic operations on large sparse matrices, for example JQ> getting a row. JQ> it takes close to 30 seconds on a 3Ghz machine, and shots the memory JQ> usage up to the sky. Hmm, I can hardly beleave the memory shooting, are you sure about that ?? .. see below JQ> I suspect there are dense intermediary steps (which, if true would JQ> defeat the purpose of trying to use sparse representaitons). JQ> As much as I try understanding the hierarchy of Matrix classes, it's a JQ> mystery to me. JQ> Is subsetting sparse matrices memory-intensive? Does it have to do with JQ> features of the language, such as pass-by-value bu default? JQ> Or am I doing something inneficent without knowing? JQ> (Note: example below would only work with 64bit R and lots of memory; JQ> reduce size of matrix 2-3 orders of magnitude for 32-bit R) [... rSpMatrix() from help(spMatrix) .. ] JQ> ## (original ir a term x doc matrix of the entire wikipedia) JQ> mm <- rSpMatrix(793251, 1027355, nnz = 205746204) JQ> # make it column based: JQ> mm <- as(mm, "CsparseMatrix") { Note that the above Tsparse -> Csparse is not really beneficial, if your main issue is things like subsetting and sub-assignment: these are implemented to go via TsparseMatrix operations. However, this is really a side-issue } JQ> a=mm[1,,drop=F]#this takes close to 30 seconds on a 3Ghz machine Well, you have nnz := 205 million non-zero entries in your matrix. This is quite substantial, also with lots of RAM. The subsetting indeed uses several dense vectors of size 'nnz' and it's just these operations (vector operations only) that use the time (and memory) you see. Most of these vectors are logical or integers --> ~ 800 MB each, but of course the x slot of M is 8 * 205 = 1.6 GB alone. So even thought you have a sparse matrix, it *is* large in itself, and dealing with it and its components is not inexpensive. Note that in R m[i,,] needs to work for very general i, e.g., for i <- c(1,1,1,2:7, 10000:10100) or i a logical vector of length nrow(m), or even a character vector of rownames ... all these work for 'Matrix' as for traditional R matrices, so, the function that does the subsetting needs to be written generally enough. But of course, we are happy for improvement patches to the Matrix package, and this part of the code could probably be sped up by a factor of 10 or 20, e.g., by moving that part of the code to C; it's currently the code starting at around line 194 in Matrix/R/Tsparse.R, setMethod("[", signature(x = "TsparseMatrix", i = "index", j = "missing", drop = "logical"), function (x, i, j, ..., drop) { ## select rows ............... ............... ............... }) which you seem to offer to improve ;-) {{ A less nice alternative to moving it to C, might be to special case M[i,] , M[i,j] etc for the special case where i (and j) is just of length 1. }} Thank you for the reproducible code, and thanks in advance for your improvement patches or other propositions... Best regards, Martin Maechler, ETH Zurich ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.