[Rd] Doing the right amount of copy for large data frames.

Gopi Goswami Mon, 14 Apr 2008 20:29:32 -0700

Dear All,


Thanks a lot for your helpful comments (e.g., NAMED, ExpressionSet,
DNAStringSet).


Observations and questions ::

ooo   For a data.frame dd and a list ll with same contents to being with,
the following operations show significant difference in the maximum memory
usage column of the gc( ) output on R-2.6.2 (the detailed code is in the PS
section below).

ll$xx <- zz
dd$xx <- zz

My understanding is that the '$<-.data.frame' S3 method above makes a copy
of the whole dd first (using '*tmp*'). But for a list this is avoided due to
the use of SET_VECTOR_ELT at the C-level. Is this a valid explanation or
something deeper is happening behind the scene?



ooo    I'll look into the read-only flag idea to avoid unhappy circumstances
that might arise while bypassing the copy-on-modify principle. Any pointers
or code snippets as to how to implement this idea?



ooo    The main reason I want to bypass copy-on-modify is that I want to
emulate a Python like behavior for lists (and data.frame), in the sense
that, I want to take the responsibility of making a deep copy if need be,
but most of the time I want to knowingly change 'things in place' using the
proposed S4 class DataFrame.


Regards,
Gopi Goswami.
PhD, Statistics, 2005
http://gopi-goswami.net/index.html



PS:

zz <- seq_len(1000000)
gc( )
dd <- data.frame(xx = zz)
dd$yy <- zz
gc( )
object.size(dd)

######################################################################

zz <- seq_len(1000000)
gc( )
ll <- list(xx = zz)
ll$yy <- zz
gc( )
object.size(ll)




On Mon, Apr 14, 2008 at 10:18 AM, Tony Plate <[EMAIL PROTECTED]> wrote:

> Gopi Goswami wrote:
>
> > Hi there,
> >
> >
> > Problem ::
> > When one tries to change one or some of the columns of a data.frame, R
> > makes
> > a copy of the whole data.frame using the '*tmp*' mechanism (this does
> > not
> > happen for components of a list, tracemem( ) on R-2.6.2 says so).
> >
> >
> > Suggested solution ::
> > Store the columns of the data.frame as a list inside of an environment
> > slot
> > of an S4 class, and define the '[', '[<-' etc. operators using
> > setMethod( )
> > and setReplaceMethod( ).
> >
> >
> > Question ::
> > This implementation will violate copy on modify principle of R (since
> > environments are not copied), but will save a lot of memory. Do you see
> > any
> > other obvious problem(s) with the idea?
> >
> Well, because it violates the copy-on-modify principle it can potentially
> break code that depends on this principle.  I don't know how much there is
> -- did you try to see if R and recommended packages will pass checks with
> this change in place?
>
> >  Have you seen a related setup
> > implemented / considered before (apart from the packages like filehash,
> > ff,
> > and database related ones for saving memory)?
> >
> >
> I've frequently used a personal package that stores array data in a file
> (like ff).  It works fine, and I partially get around the problem of
> violating the copy-on-modify principle by having a readonly flag in the
> object -- when the flag is set to allow modification I have to be careful,
> but after I set it to readonly I can use it more freely with the knowledge
> that if some function does attempt to modify the object, it will stop with
> an error.
>
> In this particular case, why not just track down why data frame
> modification is copying the entire object and suggest a change so that it
> just copies the column being changed?  (should be possible if list
> modification doesn't copy all components).
>
> -- Tony Plate
>
> >
> > Implementation code snippet ::
> > ### The S4 class.
> > setClass('DataFrame',
> >              representation(data = 'data.frame', nrow = 'numeric', ncol
> > =
> > 'numeric', store = 'environment'),
> >              prototype(data = data.frame( ), nrow = 0, ncol = 0))
> >
> > setMethod('initialize', 'DataFrame', function(.Object) {
> >    .Object <- callNextMethod( )
> >    [EMAIL PROTECTED] <- new.env(hash = TRUE)
> >    assign('data', as.list([EMAIL PROTECTED]), [EMAIL PROTECTED])
> >    [EMAIL PROTECTED] <- nrow([EMAIL PROTECTED])
> >    [EMAIL PROTECTED] <- ncol([EMAIL PROTECTED])
> >    [EMAIL PROTECTED] <- data.frame( )
> >    .Object
> > })
> >
> >
> > ### Usage:
> > nn  <- 10
> > ## dd1 below could possibly be created by read.table or scan and
> > data.frame
> > dd1 <- data.frame(xx = rnorm(nn), yy = rnorm(nn))
> > dd2 <- new('DataFrame', data = dd1)
> > rm(dd1)
> > ## Now work with dd2
> >
> >
> > Thanks a lot,
> > Gopi Goswami.
> > PhD, Statistics, 2005
> > http://gopi-goswami.net/index.html
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >
> >
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Doing the right amount of copy for large data frames.

Reply via email to