Utkarsh, Thanks again for the feedback and suggestions on bigmemory.
A follow-up on counting NAs: we have exposed a new function colna() to the user in upcoming release 3.7. Of course mwhich() can still be helpful. As for the last topic -- applying any function to columns of a big.matrix object. Once you peel away the shell, a big.matrix column is identical to an R matrix column (or vector) -- a pointer and a length and knowledge of the type is sufficient. Because we (ideally) want to support our current 4 types (and hopefully add complex and maybe more, soon), we rely on C++ template functions for the summaries we have implemented to date. But yes, looking at our implementation of colmean(), for example, would be a good place to start. Keep in mind that there are differences between big.matrix objects and R internals. bigmemory indexes everything using longs instead of integers (and uses numerics when passing indices between R and C/C++). So simply using an existing R function (or the C function under the hood of R) would be limiting not only with respect to the various types of big.matrix objects, but also with respect to the size. On 64-bit R platforms, there is no practical limit to the size of a filebacked big.matrix (other than your disk space or filesystem limitations). But R won't handle vectors in excess of 2 billion elements, even if you have the RAM to support such beasts. Operating on chunks within R is of course another possibility. Further discussion of development ideas would be great, but should probably be moved offine or over to R-devel. As always, we appreciate feedback, complaints, bug reports, etc... Thanks, Jay On Wed, Jun 3, 2009 at 3:16 AM, utkarshsinghal <utkarsh.sing...@global-analytics.com> wrote: > Thanks for the really valuable inputs, developing the package and updating > it regularly. I will be glad if I can contribute in any way. > > In problem three, however, I am interested in knowing a generic way to apply > any function on columns of a big.matrix object (obviously without loading > the data into R). May be the source code of the function "colmean" can help, > if that is not too much to ask for. Or if we can develop a function similar > to "apply" of the base R. > > > Regards > Utkarsh > > > > > Jay Emerson wrote: >> >> We also have ColCountNA(), which is not currently exposed to the user >> but will be in the next version. >> >> Jay >> >> On Tue, Jun 2, 2009 at 2:08 PM, Jay Emerson <jayemer...@gmail.com> wrote: >> >>> >>> Thanks for trying this out. >>> >>> Problem 1. We'll check this. Options should certainly be available. >>> Thanks! >>> >>> Problem 2. Fascinating. We just (yesterday) implemented a >>> sub.big.matrix() function doing exactly >>> this, creating something that is a big matrix but which just >>> references a contiguous subset of the >>> original matrix. This will be available in an upcoming version >>> (hopefully in the next week). A more >>> specialized function would create an entirely new big.matrix from a >>> subset of a first big.matrix, >>> making an actual copy, but this is something else altogether. You >>> could do this entirely within R >>> without much work, by the way, and only 2* memory overhead. >>> >>> Problem 3. You can count missing values using mwhich(). For other >>> exploration (e.g. skewness) >>> at the moment you should just extract a single column (variable) at a >>> time into R, study it, then get the >>> next column, etc... . We will not be implementing all of R's >>> functions directly with big.matrix objects. >>> We will be creating a new package "bigmemoryAnalytics" and would >>> welcome contributions to the >>> package. >>> >>> Feel free to email us directly with bugs, questions, etc... >>> >>> Cheers, >>> >>> Jay >>> >>> >>> ---------------------------------------------------------- >>> >>> From: utkarshsinghal <utkarsh.sing...@global-analytics.com> >>> Date: Tue, Jun 2, 2009 at 8:25 AM >>> Subject: [R] bigmemory - extracting submatrix from big.matrix object >>> To: r help <r-help@r-project.org> >>> I am using the library(bigmemory) to handle large datasets, say 1 GB, >>> and facing following problems. Any hints from anybody can be helpful. >>> _Problem-1: >>> _ >>> I am using "read.big.matrix" function to create a filebacked big >>> matrix of my data and get the following warning: >>> >>>> >>>> x = >>>> read.big.matrix("/home/utkarsh.s/data.csv",header=T,type="double",shared=T,backingfile >>>> = "backup", backingpath = "/home/utkarsh.s") >>>> >>> >>> Warning message: >>> In filebacked.big.matrix(nrow = numRows, ncol = numCols, type = type, : >>> A descriptor file has not been specified. A descriptor named >>> backup.desc will be created. >>> However there is no such argument in "read.big.matrix". Although there >>> is an argument "descriptorfile" in the function "as.big.matrix" but if >>> I try to use it in "read.big.matrix", I get an error showing it as >>> unused argument (as expected). >>> _Problem-2:_ >>> I want to get a filebacked *sub*matrix of "x", say only selected >>> columns: x[, 1:100]. Is there any way of doing that without actually >>> loading the data into R memory. >>> _ >>> Problem-3 >>> _There are functions available like: summary, colmean, colsd, ... for >>> standard summary statistics. But is there any way to calculate other >>> summaries say number of missing values or skewness of each variable, >>> without loading the whole data into R memory. >>> Regards >>> Utkarsh >>> >>> -- >>> John W. Emerson (Jay) >>> Assistant Professor of Statistics >>> Department of Statistics >>> Yale University >>> http://www.stat.yale.edu/~jay >>> >>> >> >> >> >> > > -- John W. Emerson (Jay) Assistant Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.