On 06/07/2010 9:04 PM, julian.tay...@csiro.au wrote:
Hi developers,
After some investigation I have found there can be large discrepancies in the same object
being saved as an external "xx.RData" file. The immediate repercussion of this
is the possible increased size of your .RData workspace for no apparent reason.
The function and its three scenarios below highlight these discrepancies. Note that the object
being returned is exactly the same in each circumstance. The first scenario simply loops over a set
of lm() models from a simulated set of data. The second adds a reasonably large matrix calculation
within the loop. The third highlights exactly where the discrepancy lies. It appears that when the
object is saved to an "xx.RData" it is still burdened, in some capacity, with the objects
created in the function. Only deleting these objects at the end of the function ensures the
realistic size of the returned object. Performing gc() after each of these short simulations shows
that the "Vcells" that are accumulated in the function environment appear to remain after
the function returns. These cached remains are then transferred to the .RData upon saving of the
object(s). This is occurring quite broadly across the Windows 7 (R 2.10.1) and 64 Bit Ubuntu Linux
(R 2.9.0) systems that I us!
e.
A similar problem was partially pointed out four years ago
http://tolstoy.newcastle.edu.au/R/help/06/03/24060.html
and has been made more obvious in the scenarios given below.
Admittedly I have had many problems with workspace .RData sizes over the years
and it has taken me some time to realise what is actually occurring. Can
someone enlighten myself and my colleagues as to why the objects created and
evaluated in a function call stack are saved, in some capacity, with the
returned object?
I haven't worked through your example, but in general the way that local
objects get captured is when part of the return value includes an
environment. Examples of things that include an environment are locally
created functions and formulas. It's probably the latter that you're
seeing. When R computes the result of "y ~ ." or a similar formula, it
attaches a pointer to the environment in which the calculation took
place, so that later when the formula is used, it can look up y there.
For example, in your line
lm(y ~ ., data = dat)
from your code, the formula "y ~ ." needs to be computed before R knows
that you've explicitly listed a dataframe holding the data, and before
it knows whether the variable y is in that dataframe or is just a local
variable in the current function.
Since these are just pointers to the environment, this doesn't take up
much space in memory, but when you save the object to disk, a copy of
the whole environment will be made, and that can end up wasting up a lot
of space if the environment contains a lot of things that aren't needed
by the formula.
Duncan Murdoch
Cheers,
Julian
####################### small simulation from a clean directory
lmfunc <- function(loop = 20, add = FALSE, gr = FALSE){
lmlist <- rmlist <- list()
set.seed(100)
dat <- data.frame(matrix(rnorm(100*100), ncol = 100))
rm <- matrix(rnorm(100000), ncol = 1000)
names(dat)[1] <- "y"
i <- 1
for(i in 1:loop) {
lmlist[[i]] <- lm(y ~ ., data = dat)
if(add)
rmlist[[i]] <- rm
}
fm <- lmlist[[loop]]
if(gr) {
print(what <- ls(envir = sys.frame(which = 1)))
remove(list = setdiff(what, "fm"))
}
fm
}
# baseline gc()
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 153325 4.1 350000 9.4 350000 9.4
Vcells 99228 0.8 786432 6.0 386446 3.0
###### 1. simple lm() simulation
lmtest1 <- lmfunc()
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 184470 5.0 407500 10.9 350000 9.4
Vcells 842169 6.5 1300721 10.0 1162577 8.9
save(lmtest1, file = "lm1.RData")
system("ls -s lm1.RData")
4312 lm1.RData
## A moderate increase in Vcells; .RData object around 4.5 Mb
###### 2. add matrix calculation to loop
lmtest2 <- lmfunc(add = TRUE)
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 209316 5.6 407500 10.9 405340 10.9
Vcells 3584244 27.4 4175939 31.9 3900869 29.8
save(lmtest2, file = "lm2.RData")
system("ls -s lm2.RData")
19324 lm2.RData
## A enormous increase in Vcells; .RData object is now 19Mb+
###### 3. delete all objects in function call stack
lmtest3 <- lmfunc(add = TRUE, gr = TRUE)
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 210766 5.7 467875 12.5 467875 12.5
Vcells 3615863 27.6 6933688 52.9 6898609 52.7
save(lmtest3, file = "lm3.RData")
system("ls -s lm3.RData")
320 lm3.RData
## A minimal increase in Vcells; .RData object is now 320Kb
sapply(ls(pattern = "lmtest*"), function(x) object.size(get(x, envir =
.GlobalEnv)))
lmtest1 lmtest2 lmtest3
358428 358428 358428
## all objects are deemed the same size by object.size()
######################### End sim
--
---
Dr. Julian Taylor phone: +61 8 8303 8792
Postdoctoral Fellow fax: +61 8 8303 8763
CMIS, CSIRO mobile: +61 4 1638 8180
Private Mail Bag 2 email: julian.tay...@csiro.au
Glen Osmond, SA, 5064
---
[[alternative HTML version deleted]]
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel