Re: [R] [External] Function environments serialize to a lot of data until they don't

2024-03-08 Thread luke-tierney--- via R-help

On Fri, 8 Mar 2024, Ivan Krylov via R-help wrote:


Hello R-help,

I've noticed that my 'parallel' jobs take too much memory to store and
transfer to the cluster workers. I've managed to trace it to the
following:

# `payload` is being written to the cluster worker.
# The function FUN had been created as a closure inside my package:
payload$data$args$FUN
# function (l, ...)
# withCallingHandlers(fun(l$x, ...), error = .wraperr(l$name))
# 
# 

# The function seems to bring a lot of captured data with it.
e <- environment(payload$data$args$FUN)
length(serialize(e, NULL))
# [1] 738202878
parent.env(e)
# 

# The parent environment has a name, so it all must be right here.
# What is it?

ls(e, all.names = TRUE)
# [1] "fun"
length(serialize(e$fun, NULL))
# [1] 317

# The only object in the environment is small!
# Where is the 700 megabytes of data?

length(serialize(e, NULL))
# [1] 536
length(serialize(payload$data$args$FUN, NULL))
# [1] 1722

And once I've observed `fun`, the environment becomes very small and
now can be serialized in a very compact manner.

I managed to work around it by forcing the promise and explicitly
putting `fun` in a small environment when constructing the closure:

.wrapfun <- function(fun) {
e <- new.env(parent = loadNamespace('mypackage'))
e$fun <- fun
# NOTE: a naive return(function(...)) could serialize to 700
# megabytes due to `fun` seemingly being a promise (?). Once the
# promise is resolved, suddenly `fun` is much more compact.
ret <- function(l, ...) withCallingHandlers(
 fun(l$x, ...),
 error = .wraperr(l$name)
)
environment(ret) <- e
ret
}


Creating and setting environments is brittle and easy to get wrong. I
prefer to use a combination of proper lexical scoping, regular
assignments, and force() as I do below.


Is this analysis correct? Could a simple f <- force(fun) have sufficed?
Where can I read more about this type of problems?


Just force(fun), without the assignment, should be enough, or even
just fun, as in

   function(fun) { fun;  }

Using force() make the intent clearer.

Closures or formulas capturing large amount of data is something you
have to be careful about with serialization in general and distributed
memory computing in R in particular. There is a little on it in the
parallel vignette. I know I have talked and written about it in
various places but can't remember a specific reference right now.

I usually define a top level function to create any closures I want to
transmit and make sure they only capture what they need. A common
pattern is provided by a simple function for creating a normal
log-likelihood:

mkLL <- function(x) {
m <- mean(x)
s <- sd(x)
function(y) sum(dnorm(y, m, s, log = TRUE))
}

This avoids recomputing the mean and sd on every call. It is fine for
use within a single process, and the fact that the original data is
available in the environment might even be useful for debugging:

ll <- mkLL(rnorm(10))
environment(ll)$x
##  [1] -0.09202453  0.78901912 -0.66744232  1.36061149  1.50768816
##  [6] -2.60754997  0.68727212  0.31557476  2.02027688 -1.42361769

But it does prevent the data from being garbage-collected until
the returned result is no longer reachable. A more GC-friendly, and
serialization-friendly definition is

mkLL <- function(x) {
m <- mean(x)
s <- sd(x)
x <- NULL  ## not needed anymore; remove from the result's enclosing env
function(y) sum(dnorm(y, m, s, log = TRUE))
}

ll <- mkLL(rnorm(1e7))
length(serialize(ll, NULL))
## [1] 734

If you prefer to calculate the mean and sd yourself you could use

mkLL1 <- function(m, s) function(x) sum(dnorm(x, m, s, log = TRUE))

Until the result is called for the first time the evaluation of the
arguments will be delayed, i.e. encoded in promises that record the
expression to evaluate and the environment in which to evaluate
it:

f <- function(n) {
x <- rnorm(n)
mkLL1(mean(x), sd(x))
}
ll <- f(1e7)
length(serialize(ll, NULL))
## [1] 80002223

Once the arguments are evaluated, the expressions are still needed for
substitute() and such, but the environment is not, so it is dropped,
and if the promise environment can no longer be reached it can be
garbage-collected, It will also no longer appear in a serialization:

ll(1)
## [1] -1.419588
length(serialize(ll, NULL))
## [1] 3537

Having a reference to a large environment is not much of an issue
within a single process, but can be in a distributed memory parallel
computing context.  To avoid this you can force evaluation of the
promises:

mkLL1 <- function(m, s) {
force(m)
force(s)
function(x) sum(dnorm(x, m, s, log = TRUE))
}
ll <- f(1e7)
length(serialize(ll, NULL))
## [1] 2146

The possibility of inadvertently transferring too much data is an
issue in distributed memory computing in general, so there are various
tools that help. A very sim

Re: [R] [External] Re: Building Packages. (fwd)

2024-03-21 Thread luke-tierney--- via R-help

[forgot to copy to R-help so re-sending]

-- Forwarded message --
Date: Thu, 21 Mar 2024 11:41:52 +
From: luke-tier...@uiowa.edu
To: Duncan Murdoch 
Subject: Re: [External] Re: [R] Building Packages.

At least on my installed version (which tells me it is out of date)
they appear to just be modifying the "package:utils" parent frame of
the global search path.

There seem to be a few others:

checkUtilsFun <- function(n)
identical(get(n, "package:utils"), get(n, getNamespace("utils")))
names(which(! sapply(ls("package:utils", all = TRUE), checkUtilsFun)))
## [1] "bug.report"   "file.edit""help.request" ## [4] "history" 
"install.packages" "remove.packages" ## [7] "View"


I don't know why they don't put these overrides in the tools:rstudio frame.
At least that would make them more visible.

You can fix all of these with something like

local({
  up <- match("package:utils", search())
  detach("package:utils")
  library(utils, pos = up)
})

or just install.packages with

local({
up <- match("package:utils", search())
unlockBinding("install.packages", pos.to.env(up))
assign("install.packages", utils::install.packages, "package:utils")
lockBinding("install.packages", pos.to.env(up))
})

Best,

luke

On Thu, 21 Mar 2024, Duncan Murdoch wrote:

Yes, you're right.  The version found in the search list entry for 
"package:utils" is the RStudio one; the ones found with two or three colons 
are the original.


Duncan Murdoch

On 21/03/2024 5:48 a.m., peter dalgaard wrote:
Um, what's with the triple colon? At least on my install, double seems to 
suffice:



identical(utils:::install.packages, utils::install.packages)

[1] TRUE

install.packages

function (...)
.rs.callAs(name, hook, original, ...)


-pd


On 21 Mar 2024, at 09:58 , Duncan Murdoch  wrote:

The good news for Jorgen (who may not be reading this thread any more) is 
that one can still be sure of getting the original install.packages() by 
using


utils:::install.packages( ... )

with *three* colons, to get the internal (namespace) version of the 
function.


Duncan Murdoch


On 21/03/2024 4:31 a.m., Martin Maechler wrote:

"Duncan Murdoch on Wed, 20 Mar 2024 13:20:12 -0400 writes:

 > On 20/03/2024 1:07 p.m., Duncan Murdoch wrote:
 >> On 20/03/2024 12:37 p.m., Ben Bolker wrote:
 >>> Ivan, can you give more detail on this? I've heard this
 >>> issue mentioned, but when I open RStudio and run
 >>> find("install.packages") it returns
 >>> "utils::install.packages", and running dump() from
 >>> within RStudio console and from an external "R
 >>> --vanilla" gives identical results.
 >>>
 >>> I thought at one point this might only refer to the GUI
 >>> package-installation interface, but you seem to be
 >>> saying it's the install.packages() function as well.
 >>>
 >>> Running an up-to-date RStudio on Linux, FWIW -- maybe
 >>> weirdness only happens on other OSs?
 >>
 >> On MacOS, I see this:
 >>
 >> > install.packages function (...)  .rs.callAs(name, hook,
 >> original, ...)  
 >>
 >> I get the same results as you from find().  I'm not sure
 >> what RStudio is doing to give a different value for the
 >> function than what find() sees.
 > Turns out that RStudio replaces the install.packages
 > object in the utils package.
 > Duncan Murdoch
Yes, and this has been the case for several years now, and I
have mentioned this several times, too  (though some of it
possibly not in a public R-* mailing list).
And yes, that they modify the package environment
   as.environment("package:utils")
but leave the
   namespace  asNamespace("utils")
unchanged, makes it harder to see what's
going on (but also has less severe consequences; if they kept to
the otherwise universal *rule* that the namespace and package must have 
the same objects

apart from those only in the namespace,
people would not even have access to R's true install.packages()
but only see the RStudio fake^Hsubstitute..
We are still not happy with their decision. Also
help(install.packages) goes to R's documentation of R's
install.packages, so there's even more misleading of useRs.
Martin



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.r-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.




__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.r-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.



--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Depart

Re: [R] [External] Re: Parser For Line Number Tracing

2025-01-19 Thread luke-tierney--- via R-help

On Sun, 19 Jan 2025, Ivo Welch wrote:


Hi Duncan — Wonderful.  Thank you.  Bug or no bug, I think it would be
a huge improvement for user-friendliness if R printed the last line by
default *every time* a script dies.  Most computer languages do so.

Should I file it as a request for improvement to the R code
development team?  Maybe R can be improved at a very low cost to the
development team and a very high benefit to newbies.


No. There are already many ways to influence the way the default error
handler prints information about errors, mstly via options(). In
particular you may want to look at entries in ?options for

show.error.locations
showErrorCalls
showWarningCalls

and adjust your options settings accordingly.

Best,

luke



Regards,

/ivo

On Sun, Jan 19, 2025 at 2:39 AM Duncan Murdoch  wrote:


On 2025-01-18 8:27 p.m., Ivo Welch wrote:

I am afraid my errors are worse!  (so are my postings.  I should have
given an example.)

```
x <- 1
y <- 2
nofunction("something stupid I am doing!")
z <- 4
```

and

```

source("where-is-my-water.R")

Error in nofunction("something stupid I am doing!") :
   could not find function "nofunction"
```

and no traceback is available.


Okay, I see.  In that case traceback() doesn't report the line, but it
still is known internally.  You can see it using the following function:

showKnownLocations <- function() {
   calls <- sys.calls()
   srcrefs <- sapply(calls, function(v) if (!is.null(srcref <- attr(v,

"srcref"))) {
 srcfile <- attr(srcref, "srcfile")
 paste0(basename(srcfile$filename), "#", srcref[1L])
   } else ".")
   cat("Current call stack locations:\n")
   cat(srcrefs, sep = " ")
   cat("\n")
}

I haven't done much testing on this, but I think it can be called
explicitly from any location if you want to know how you got there, or
you can set it as the error handler using

   options(error = showKnownLocations)

For example, try this script:

   options(error = showKnownLocations)
   f <- function() showKnownLocations()
   x <- 1
   f()
   y <- 2
   nofunction("something stupid I am doing!")
   z <- 4

I see this output from source("test.R"):

  > source("test.R")
   Current call stack locations:
   . . . . test.R#4 test.R#2
   Error in nofunction("something stupid I am doing!") :
 could not find function "nofunction"
   Current call stack locations:
   . . . . test.R#6

The first report is from the explicit call in f() on line 2 that was
invoked on line 4, and the second report happens during error handling.

I supppose the fact that traceback() isn't showing you the line 6
location could be considered a bug.

Duncan Murdoch




__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.r-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics andFax:   319-335-3017
   Actuarial Science
241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu/
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.