Reviving an old thread. I haven't noticed this be a problem for a while when saving RDS's which is great. However, I noticed the problem again when saving `qs` files (https://github.com/traversc/qs) which is an RDS replacement with a fast serialization / compression system.
I'd like to get an idea of what change was made within R to address this issue for `saveRDS`. My thought is that this will help the author of the `qs` package do something similar. I have had a browse through the release notes for the last few years (Ctrl-F-ing "environment") and couldn't see it. Many thanks for any help and best wishes to all. The following code uses R 3.6.2 and requires you to run install.packages("qs") first: save_size_qs <- function (object) { tf <- tempfile(fileext = ".qs") on.exit(unlink(tf)) qs::qsave(object, file = tf) file.size(tf) } save_size_rds <- function (object) { tf <- tempfile(fileext = ".rds") on.exit(unlink(tf)) saveRDS(object, file = tf) file.size(tf) } normal_lm <- function(){ junk <- 1:1e+08 lm(Sepal.Length ~ Sepal.Width, data = iris) } normal_ggplot <- function(){ junk <- 1:1e+08 ggplot2::ggplot() } clean_lm <- function () { junk <- 1:1e+08 # Run the lm in its own environment env <- new.env(parent = globalenv()) env$subset <- subset with(env, lm(Sepal.Length ~ Sepal.Width, data = iris)) } # The qs save size includes the junk but the rds does not save_size_qs(normal_lm()) #> [1] 848396 save_size_rds(normal_lm()) #> [1] 4163 save_size_qs(normal_ggplot()) #> [1] 857446 save_size_rds(normal_ggplot()) #> [1] 12895 # Both exclude the junk when separating the lm into its own environment save_size_qs(clean_lm()) #> [1] 6154 save_size_rds(clean_lm()) #> [1] 4255 On Thu, Jul 28, 2016 at 7:31 AM Kenny Bell <kmbel...@gmail.com> wrote: > Thanks so much for all this. > > The first solution is what I'm going with as I want the terms object to > come along so that predict still works. > > On Wed, Jul 27, 2016 at 12:28 PM, William Dunlap via R-devel < > r-devel@r-project.org> wrote: > >> Another solution is to only save the parts of the model object that >> interest you. As long as they don't include the formula (which is >> what drags along the environment it was created in), you will >> save space. E.g., >> >> tfun2 <- function(subset) { >> junk <- 1:1e6 >> list(subset=subset, lm(Sepal.Length ~ Sepal.Width, data=iris, >> subset=subset)$coef) >> } >> >> saveSize(tfun2(1:4)) >> #[1] 152 >> >> >> >> Bill Dunlap >> TIBCO Software >> wdunlap tibco.com >> >> On Wed, Jul 27, 2016 at 11:19 AM, William Dunlap <wdun...@tibco.com> >> wrote: >> >> > One way around this problem is to make a new environment whose >> > parent environment is .GlobalEnv and which contains only what the >> > the call to lm() requires and to compute lm() in that environment. >> E.g., >> > >> > tfun1 <- function (subset) >> > { >> > junk <- 1:1e+06 >> > env <- new.env(parent = globalenv()) >> > env$subset <- subset >> > with(env, lm(Sepal.Length ~ Sepal.Width, data = iris, subset = >> subset)) >> > } >> > Then we get >> > > saveSize(tfun1(1:4)) # see below for def. of saveSize >> > [1] 910 >> > instead of the 2129743 bytes in the save file when using the naive >> method. >> > >> > saveSize <- function (object) { >> > tf <- tempfile(fileext = ".RData") >> > on.exit(unlink(tf)) >> > save(object, file = tf) >> > file.size(tf) >> > } >> > >> > >> > >> > Bill Dunlap >> > TIBCO Software >> > wdunlap tibco.com >> > >> > On Wed, Jul 27, 2016 at 10:48 AM, Kenny Bell <km...@berkeley.edu> >> wrote: >> > >> >> In the below, I generate a model from an environment that isn't >> >> .GlobalEnv with a large object that is unrelated to the model >> >> generation. It seems to save the irrelevant object unnecessarily. In >> >> my actual use case, I am running and saving many models in a loop that >> >> each use a single large data.frame (that gets collapsed into a small >> >> data.frame for estimation), so removing it isn't an option. >> >> >> >> In the case where the model exists in .GlobalEnv, everything is >> >> peachy. So replicating whatever happens when saving the model that was >> >> generated in .GlobalEnv at the return() stage of the function call >> >> would fix this problem. >> >> >> >> I was referred to this list from r-bugs. First time r-devel poster. >> >> >> >> Hope this helps, >> >> >> >> Kendon >> >> >> >> ``` >> >> tmp_fun <- function(x){ >> >> iris_big <- lapply(1:10000, function(x) iris) >> >> lm(Sepal.Length ~ Sepal.Width, data = iris) >> >> } >> >> >> >> out <- tmp_fun(1) >> >> object.size(out) >> >> # 48008 >> >> save(out, file = "tmp.RData", compress = FALSE) >> >> file.size("tmp.RData") >> >> # 57196752 - way too big >> >> >> >> # Works fine when in .GlobalEnv >> >> iris_big <- lapply(1:10000, function(x) iris) >> >> out <- lm(Sepal.Length ~ Sepal.Width, data = iris) >> >> >> >> object.size(out) >> >> # 48008 >> >> save(out, file = "tmp.RData", compress = FALSE) >> >> file.size("tmp.RData") >> >> # 16641 - good size. >> >> ``` >> >> >> >> [[alternative HTML version deleted]] >> >> >> >> ______________________________________________ >> >> R-devel@r-project.org mailing list >> >> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> >> > >> > >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > > [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel