Re: [Rd] Doing the right amount of copy for large data frames.

Martin Morgan Mon, 14 Apr 2008 09:02:56 -0700

Hi Gopi

"Gopi Goswami" <[EMAIL PROTECTED]> writes:


> Hi there,
>
>
> Problem ::
> When one tries to change one or some of the columns of a data.frame, R makes
> a copy of the whole data.frame using the '*tmp*' mechanism (this does not
> happen for components of a list, tracemem( ) on R-2.6.2 says so).
>
>
> Suggested solution ::
> Store the columns of the data.frame as a list inside of an environment slot
> of an S4 class, and define the '[', '[<-' etc. operators using setMethod( )
> and setReplaceMethod( ).

The Biocondcutor package Biobase has a class 'ExpressionSet' with slot
assayData. By default assayData is an environment that is 'locked' so
can't be modified casually. The interface to ExpressionSet unlocks the
environment, and copies and modifies it when necessary. This is not
quite the same as you propose, but has some similar characteristics.

I've spent a lot of time with this data structure, and think this
borders on one of those ideas that 'seemed like a good idea at the
time'. You end up using R-level tools to manage memory. Copy-on-change
is better than you might naively think at not making unnecessary
copies. S4 caries significant overhead, including copies during method
dispatch, that work against you (subsetting an expression set in an
OOP way, no behind-the-scenes tricks, makes *5* copies of the S4
instance, though perhaps these are light-weight because the big data
is in an environment). And in the mean time computers have gotten
faster and bigger, and the 'big' data of ExpressionSets are now only
modestly sized or even small.

A somewhat different approach is in the Biostrings package, for
instance DNAStringSet, where the original object is 'read-only'. The
user is presented with a 'view' into the object; changing the view
(subsetting) changes the indicies in the view but not the original
data. This is both fast and memory efficient. This is a read-only
solution, though.

Hope that helps, Martin

> Question ::
> This implementation will violate copy on modify principle of R (since
> environments are not copied), but will save a lot of memory. Do you see any
> other obvious problem(s) with the idea? Have you seen a related setup
> implemented / considered before (apart from the packages like filehash, ff,
> and database related ones for saving memory)?
>
>
> Implementation code snippet ::
> ### The S4 class.
> setClass('DataFrame',
>               representation(data = 'data.frame', nrow = 'numeric', ncol =
> 'numeric', store = 'environment'),
>               prototype(data = data.frame( ), nrow = 0, ncol = 0))
>
> setMethod('initialize', 'DataFrame', function(.Object) {
>     .Object <- callNextMethod( )
>     [EMAIL PROTECTED] <- new.env(hash = TRUE)
>     assign('data', as.list([EMAIL PROTECTED]), [EMAIL PROTECTED])
>     [EMAIL PROTECTED] <- nrow([EMAIL PROTECTED])
>     [EMAIL PROTECTED] <- ncol([EMAIL PROTECTED])
>     [EMAIL PROTECTED] <- data.frame( )
>     .Object
> })
>
>
> ### Usage:
> nn  <- 10
> ## dd1 below could possibly be created by read.table or scan and data.frame
> dd1 <- data.frame(xx = rnorm(nn), yy = rnorm(nn))
> dd2 <- new('DataFrame', data = dd1)
> rm(dd1)
> ## Now work with dd2
>
>
> Thanks a lot,
> Gopi Goswami.
> PhD, Statistics, 2005
> http://gopi-goswami.net/index.html
>
>       [[alternative HTML version deleted]]
>
> ______________________________________________
> [email protected] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Doing the right amount of copy for large data frames.

Reply via email to