Re: [R] [External] Function environments serialize to a lot of data until they don't
On Fri, 8 Mar 2024, Ivan Krylov via R-help wrote: Hello R-help, I've noticed that my 'parallel' jobs take too much memory to store and transfer to the cluster workers. I've managed to trace it to the following: # `payload` is being written to the cluster worker. # The function FUN had been created as a closure inside my package: payload$data$args$FUN # function (l, ...) # withCallingHandlers(fun(l$x, ...), error = .wraperr(l$name)) # # # The function seems to bring a lot of captured data with it. e <- environment(payload$data$args$FUN) length(serialize(e, NULL)) # [1] 738202878 parent.env(e) # # The parent environment has a name, so it all must be right here. # What is it? ls(e, all.names = TRUE) # [1] "fun" length(serialize(e$fun, NULL)) # [1] 317 # The only object in the environment is small! # Where is the 700 megabytes of data? length(serialize(e, NULL)) # [1] 536 length(serialize(payload$data$args$FUN, NULL)) # [1] 1722 And once I've observed `fun`, the environment becomes very small and now can be serialized in a very compact manner. I managed to work around it by forcing the promise and explicitly putting `fun` in a small environment when constructing the closure: .wrapfun <- function(fun) { e <- new.env(parent = loadNamespace('mypackage')) e$fun <- fun # NOTE: a naive return(function(...)) could serialize to 700 # megabytes due to `fun` seemingly being a promise (?). Once the # promise is resolved, suddenly `fun` is much more compact. ret <- function(l, ...) withCallingHandlers( fun(l$x, ...), error = .wraperr(l$name) ) environment(ret) <- e ret } Creating and setting environments is brittle and easy to get wrong. I prefer to use a combination of proper lexical scoping, regular assignments, and force() as I do below. Is this analysis correct? Could a simple f <- force(fun) have sufficed? Where can I read more about this type of problems? Just force(fun), without the assignment, should be enough, or even just fun, as in function(fun) { fun; } Using force() make the intent clearer. Closures or formulas capturing large amount of data is something you have to be careful about with serialization in general and distributed memory computing in R in particular. There is a little on it in the parallel vignette. I know I have talked and written about it in various places but can't remember a specific reference right now. I usually define a top level function to create any closures I want to transmit and make sure they only capture what they need. A common pattern is provided by a simple function for creating a normal log-likelihood: mkLL <- function(x) { m <- mean(x) s <- sd(x) function(y) sum(dnorm(y, m, s, log = TRUE)) } This avoids recomputing the mean and sd on every call. It is fine for use within a single process, and the fact that the original data is available in the environment might even be useful for debugging: ll <- mkLL(rnorm(10)) environment(ll)$x ## [1] -0.09202453 0.78901912 -0.66744232 1.36061149 1.50768816 ## [6] -2.60754997 0.68727212 0.31557476 2.02027688 -1.42361769 But it does prevent the data from being garbage-collected until the returned result is no longer reachable. A more GC-friendly, and serialization-friendly definition is mkLL <- function(x) { m <- mean(x) s <- sd(x) x <- NULL ## not needed anymore; remove from the result's enclosing env function(y) sum(dnorm(y, m, s, log = TRUE)) } ll <- mkLL(rnorm(1e7)) length(serialize(ll, NULL)) ## [1] 734 If you prefer to calculate the mean and sd yourself you could use mkLL1 <- function(m, s) function(x) sum(dnorm(x, m, s, log = TRUE)) Until the result is called for the first time the evaluation of the arguments will be delayed, i.e. encoded in promises that record the expression to evaluate and the environment in which to evaluate it: f <- function(n) { x <- rnorm(n) mkLL1(mean(x), sd(x)) } ll <- f(1e7) length(serialize(ll, NULL)) ## [1] 80002223 Once the arguments are evaluated, the expressions are still needed for substitute() and such, but the environment is not, so it is dropped, and if the promise environment can no longer be reached it can be garbage-collected, It will also no longer appear in a serialization: ll(1) ## [1] -1.419588 length(serialize(ll, NULL)) ## [1] 3537 Having a reference to a large environment is not much of an issue within a single process, but can be in a distributed memory parallel computing context. To avoid this you can force evaluation of the promises: mkLL1 <- function(m, s) { force(m) force(s) function(x) sum(dnorm(x, m, s, log = TRUE)) } ll <- f(1e7) length(serialize(ll, NULL)) ## [1] 2146 The possibility of inadvertently transferring too much data is an issue in distributed memory computing in general, so there are various tools that help. A very sim
Re: [R] [External] Re: Building Packages. (fwd)
[forgot to copy to R-help so re-sending] -- Forwarded message -- Date: Thu, 21 Mar 2024 11:41:52 + From: luke-tier...@uiowa.edu To: Duncan Murdoch Subject: Re: [External] Re: [R] Building Packages. At least on my installed version (which tells me it is out of date) they appear to just be modifying the "package:utils" parent frame of the global search path. There seem to be a few others: checkUtilsFun <- function(n) identical(get(n, "package:utils"), get(n, getNamespace("utils"))) names(which(! sapply(ls("package:utils", all = TRUE), checkUtilsFun))) ## [1] "bug.report" "file.edit""help.request" ## [4] "history" "install.packages" "remove.packages" ## [7] "View" I don't know why they don't put these overrides in the tools:rstudio frame. At least that would make them more visible. You can fix all of these with something like local({ up <- match("package:utils", search()) detach("package:utils") library(utils, pos = up) }) or just install.packages with local({ up <- match("package:utils", search()) unlockBinding("install.packages", pos.to.env(up)) assign("install.packages", utils::install.packages, "package:utils") lockBinding("install.packages", pos.to.env(up)) }) Best, luke On Thu, 21 Mar 2024, Duncan Murdoch wrote: Yes, you're right. The version found in the search list entry for "package:utils" is the RStudio one; the ones found with two or three colons are the original. Duncan Murdoch On 21/03/2024 5:48 a.m., peter dalgaard wrote: Um, what's with the triple colon? At least on my install, double seems to suffice: identical(utils:::install.packages, utils::install.packages) [1] TRUE install.packages function (...) .rs.callAs(name, hook, original, ...) -pd On 21 Mar 2024, at 09:58 , Duncan Murdoch wrote: The good news for Jorgen (who may not be reading this thread any more) is that one can still be sure of getting the original install.packages() by using utils:::install.packages( ... ) with *three* colons, to get the internal (namespace) version of the function. Duncan Murdoch On 21/03/2024 4:31 a.m., Martin Maechler wrote: "Duncan Murdoch on Wed, 20 Mar 2024 13:20:12 -0400 writes: > On 20/03/2024 1:07 p.m., Duncan Murdoch wrote: >> On 20/03/2024 12:37 p.m., Ben Bolker wrote: >>> Ivan, can you give more detail on this? I've heard this >>> issue mentioned, but when I open RStudio and run >>> find("install.packages") it returns >>> "utils::install.packages", and running dump() from >>> within RStudio console and from an external "R >>> --vanilla" gives identical results. >>> >>> I thought at one point this might only refer to the GUI >>> package-installation interface, but you seem to be >>> saying it's the install.packages() function as well. >>> >>> Running an up-to-date RStudio on Linux, FWIW -- maybe >>> weirdness only happens on other OSs? >> >> On MacOS, I see this: >> >> > install.packages function (...) .rs.callAs(name, hook, >> original, ...) >> >> I get the same results as you from find(). I'm not sure >> what RStudio is doing to give a different value for the >> function than what find() sees. > Turns out that RStudio replaces the install.packages > object in the utils package. > Duncan Murdoch Yes, and this has been the case for several years now, and I have mentioned this several times, too (though some of it possibly not in a public R-* mailing list). And yes, that they modify the package environment as.environment("package:utils") but leave the namespace asNamespace("utils") unchanged, makes it harder to see what's going on (but also has less severe consequences; if they kept to the otherwise universal *rule* that the namespace and package must have the same objects apart from those only in the namespace, people would not even have access to R's true install.packages() but only see the RStudio fake^Hsubstitute.. We are still not happy with their decision. Also help(install.packages) goes to R's documentation of R's install.packages, so there's even more misleading of useRs. Martin __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.r-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.r-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Depart
Re: [R] [External] Re: Parser For Line Number Tracing
On Sun, 19 Jan 2025, Ivo Welch wrote: Hi Duncan — Wonderful. Thank you. Bug or no bug, I think it would be a huge improvement for user-friendliness if R printed the last line by default *every time* a script dies. Most computer languages do so. Should I file it as a request for improvement to the R code development team? Maybe R can be improved at a very low cost to the development team and a very high benefit to newbies. No. There are already many ways to influence the way the default error handler prints information about errors, mstly via options(). In particular you may want to look at entries in ?options for show.error.locations showErrorCalls showWarningCalls and adjust your options settings accordingly. Best, luke Regards, /ivo On Sun, Jan 19, 2025 at 2:39 AM Duncan Murdoch wrote: On 2025-01-18 8:27 p.m., Ivo Welch wrote: I am afraid my errors are worse! (so are my postings. I should have given an example.) ``` x <- 1 y <- 2 nofunction("something stupid I am doing!") z <- 4 ``` and ``` source("where-is-my-water.R") Error in nofunction("something stupid I am doing!") : could not find function "nofunction" ``` and no traceback is available. Okay, I see. In that case traceback() doesn't report the line, but it still is known internally. You can see it using the following function: showKnownLocations <- function() { calls <- sys.calls() srcrefs <- sapply(calls, function(v) if (!is.null(srcref <- attr(v, "srcref"))) { srcfile <- attr(srcref, "srcfile") paste0(basename(srcfile$filename), "#", srcref[1L]) } else ".") cat("Current call stack locations:\n") cat(srcrefs, sep = " ") cat("\n") } I haven't done much testing on this, but I think it can be called explicitly from any location if you want to know how you got there, or you can set it as the error handler using options(error = showKnownLocations) For example, try this script: options(error = showKnownLocations) f <- function() showKnownLocations() x <- 1 f() y <- 2 nofunction("something stupid I am doing!") z <- 4 I see this output from source("test.R"): > source("test.R") Current call stack locations: . . . . test.R#4 test.R#2 Error in nofunction("something stupid I am doing!") : could not find function "nofunction" Current call stack locations: . . . . test.R#6 The first report is from the explicit call in f() on line 2 that was invoked on line 4, and the second report happens during error handling. I supppose the fact that traceback() isn't showing you the line 6 location could be considered a bug. Duncan Murdoch __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.r-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics andFax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tier...@uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu/ __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.