Re: [Rd] likely bug in 'serialize' or please explain the memory usage

Duncan Murdoch Tue, 03 Nov 2009 05:02:11 -0800

On 03/11/2009 7:29 AM, Sklyar, Oleg (London) wrote:

Duncan,


thanks for suggestions, I will try attaching a new environment.

However this still does not explain the behaviour and does not confirm
that it is correct. What puzzles me most is that if I define a function
within another function then only the function gets serialized, yet when

this is withing an S4 method definition, then also the args.

Okay, I've taken a look at your code. I think what you're seeing islazy evaluation. S4 generics evaluate their args when they dispatch toa method, but normal functions don't. So the increase from 106 bytes to253 bytes when the function was nested in a regular function was to holdthe promise to evaluate x, whereas in the method, x had been evaluatedto determine that it was numeric, and your particular method should bedispatched to.


So if in your nested case you add a line

force(x)

I think you'll see the size balloon up.

Now, it might be a problem that you're serializing a promise, because Ithink you'd likely get trouble with something like this:


 outerfun2 = function(x) {
     nestedfun = function() x
     mycall(x, nestedfun)
 }

If you serialize nestedfun and it only saves the promise to evaluate x,then unserialize it somewhere else, the promise probably won't evaluateto what you expected. But you often get problems when you createfunctions that depend on unevaluated promises, and there might be avalid reason to want to serialize one, so I wouldn't call it a bug.


Duncan Murdoch

Both have

their own environments, so I do not see why it should be different. As
an interim measure I just removed all the inline function definitions
from these 'parallel' calls defining the functions as hidden outside of
the caller, a bit ugly but works. I'd be thankful if you could look at
the examples when you get some more time.

My main problem is less in ensuring that my code works, but in ensuring
that when users use these parallel functionalities with their code, they
do not get stuck in transferring data for ages simply because with every
function one gets all the data passed.

Best,
Oleg

Dr Oleg Sklyar
Research Technologist
AHL / Man Investments Ltd
+44 (0)20 7144 3803
oskl...@maninvestments.com
-----Original Message-----
From: Duncan Murdoch [mailto:murd...@stats.uwo.ca]Sent: 03 November 2009 11:59
To: Sklyar, Oleg (London)
Cc: r-devel@r-project.org
Subject: Re: [Rd] likely bug in 'serialize' or please explainthe memory usage
I haven't had a chance to look really closely at this, but Iwould guessthe problem is that in R functions are "closures". The environmentattached to the function will be serialized along with it, so if youhave a big dataset in the same environment, you'll get that too.
I vaguely recall that the global environment and other systemenvironments are handled specially, so that's not true for functionscreated at the top level, but I'd have to do some experimentsto confirm.
So the solution to your problem is to pay attention to theenvironmentof the functions you create. If they need to refer to localvariablesin the creating frame, thenyou'll get all of them, so be careful about what you createthere. Ifthey don't need to refer to the local frame you can just attach a newsmaller environment after building the function.
Duncan Murdoch

Sklyar, Oleg (London) wrote:
Hi all,
assume the following problem: a function call takes a
function object
and a data variable and calls this function with this data
on a remote
host. It uses serialization to pass both the function and
the data via a
socket connection to a remote host. The problem is that
depending on the
way we call the same construct, the function may be serialized to
include the data, which was not requested as the example below
demonstrates (runnable). This is a problem for parallel
computing. The
problem described below is actually a problem for Rmpi and any other
parallel implementation we tested leading to endless
executions in some
cases, where the total data passed is huge.
Assume the below 'mycall' is the function that takes data
and a function
object, serializes them and calls the remote host. To make
it runable I
just print the size of the serialized objects. In a parallel apply
implemention it would serialize individual list elements
and a function
and pass those over. Assuming 1 element is 1Mb and having
100 elements
and a function as simple as function(z) z we would expect
to pass around
100Mb of data, 1 Mb to each individual process. However
what happens is
that in some situations all 100Mb of data are passed to all
the slaves
as the function is serialized to include all of the data!
This always
happens when we make such a call from an S4 method when the
function we
is defined inline, see last example.Anybody can explain this, and possibly suggest a solution?
Well, one is
-- do not define functions to call in the same environment
as the caller
:(
I do not have immediate access to the newest version of R,
so would be
grateful if sombody could test it in that and let me know
if the problem
is still there. The example is runnable.

Thanks,
Oleg

Dr Oleg Sklyar
Research Technologist
AHL / Man Investments Ltd
+44 (0)20 7144 3803
oskl...@maninvestments.com
--------------------------------------------------------------
----------
-------

mycall = function(x, fun) {
    FUN = serialize(fun, NULL)
    DAT = serialize(x, NULL)
cat(sprintf("length FUN=%d; length DAT=%d\n", length(FUN),
length(DAT)))
invisible(NULL) ## return results of a call on a remote
host with
FUN and DAN
}

## the function variant I  will be passing into mycall
innerfun = function(z) z
x = runif(1e6)

## test run from the command line
mycall(x, innerfun)
# output: length FUN=106; length DAT=8000022

## test run from within a function
outerfun1 = function(x) mycall(x, innerfun)
outerfun1(x)
# output: length FUN=106; length DAT=8000022

## test run from within a function, where function is defined within
outerfun2 = function(x) {
    nestedfun = function(z) z
    mycall(x, nestedfun)
}
outerfun2(x)
# output: length FUN=253; length DAT=8000022

setGeneric("outerfun3", function(x) standardGeneric("outerfun3"))
## define a method

## test run from within a method
setMethod("outerfun3", "numeric",
    function(x) mycall(x, innerfun))
outerfun3(x)
# output@ length FUN=106; length DAT=8000022

## test run from within a method, where function is defined within
setMethod("outerfun3", "numeric",
    function(x) {
        nestedfun = function(z) z
        mycall(x, nestedfun)
    })
## THIS WILL BE WRONG!
outerfun3(x)
# output: length FUN=8001680; length DAT=8000022


--------------------------------------------------
R version 2.9.0 (2009-04-17)x86_64-unknown-linux-gnu
locale:
C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
**********************************************************************
Please consider the environment before printing this email
or its attachments.
The contents of this email are for the named addressees
...{{dropped:19}}
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
**********************************************************************
 Please consider the environment before printing this email or its attachments.
The contents of this email are for the named addressees ...{{dropped:19}}

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] likely bug in 'serialize' or please explain the memory usage

Reply via email to