Hi Gopi "Gopi Goswami" <[EMAIL PROTECTED]> writes:
> Hi there, > > > Problem :: > When one tries to change one or some of the columns of a data.frame, R makes > a copy of the whole data.frame using the '*tmp*' mechanism (this does not > happen for components of a list, tracemem( ) on R-2.6.2 says so). > > > Suggested solution :: > Store the columns of the data.frame as a list inside of an environment slot > of an S4 class, and define the '[', '[<-' etc. operators using setMethod( ) > and setReplaceMethod( ). The Biocondcutor package Biobase has a class 'ExpressionSet' with slot assayData. By default assayData is an environment that is 'locked' so can't be modified casually. The interface to ExpressionSet unlocks the environment, and copies and modifies it when necessary. This is not quite the same as you propose, but has some similar characteristics. I've spent a lot of time with this data structure, and think this borders on one of those ideas that 'seemed like a good idea at the time'. You end up using R-level tools to manage memory. Copy-on-change is better than you might naively think at not making unnecessary copies. S4 caries significant overhead, including copies during method dispatch, that work against you (subsetting an expression set in an OOP way, no behind-the-scenes tricks, makes *5* copies of the S4 instance, though perhaps these are light-weight because the big data is in an environment). And in the mean time computers have gotten faster and bigger, and the 'big' data of ExpressionSets are now only modestly sized or even small. A somewhat different approach is in the Biostrings package, for instance DNAStringSet, where the original object is 'read-only'. The user is presented with a 'view' into the object; changing the view (subsetting) changes the indicies in the view but not the original data. This is both fast and memory efficient. This is a read-only solution, though. Hope that helps, Martin > Question :: > This implementation will violate copy on modify principle of R (since > environments are not copied), but will save a lot of memory. Do you see any > other obvious problem(s) with the idea? Have you seen a related setup > implemented / considered before (apart from the packages like filehash, ff, > and database related ones for saving memory)? > > > Implementation code snippet :: > ### The S4 class. > setClass('DataFrame', > representation(data = 'data.frame', nrow = 'numeric', ncol = > 'numeric', store = 'environment'), > prototype(data = data.frame( ), nrow = 0, ncol = 0)) > > setMethod('initialize', 'DataFrame', function(.Object) { > .Object <- callNextMethod( ) > [EMAIL PROTECTED] <- new.env(hash = TRUE) > assign('data', as.list([EMAIL PROTECTED]), [EMAIL PROTECTED]) > [EMAIL PROTECTED] <- nrow([EMAIL PROTECTED]) > [EMAIL PROTECTED] <- ncol([EMAIL PROTECTED]) > [EMAIL PROTECTED] <- data.frame( ) > .Object > }) > > > ### Usage: > nn <- 10 > ## dd1 below could possibly be created by read.table or scan and data.frame > dd1 <- data.frame(xx = rnorm(nn), yy = rnorm(nn)) > dd2 <- new('DataFrame', data = dd1) > rm(dd1) > ## Now work with dd2 > > > Thanks a lot, > Gopi Goswami. > PhD, Statistics, 2005 > http://gopi-goswami.net/index.html > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel