[Rd] we need an exists/get hybrid

2014-12-03 Thread Peter Haverty
Hi All,

I've been looking into speeding up the loading of packages that use a lot
of S4.  After profiling I noticed the "exists" function accounts for a
surprising fraction of the time.  I have some thoughts about speeding up
exists (below). More to the point of this post, Martin M�chler noted that
'exists' and 'get' are often used in conjunction.  Both functions are
different usages of the do_get C function, so it's a pity to run that twice.

"get" gives an error when a symbol is not found, so you can't just do a
'get'.  With R's C library, one might do

SEXP x = findVarInFrame3(symbol,env);
if (x != R_UnboundValue) {
// do stuff with x
}

It would be very convenient to have something like this at the R level. We
don't want to do any tryCatch stuff or to add args to get (That would kill
any speed advantage. The overhead for handling redundant args accounts for
30% of the time used by "exists").  Michael Lawrence and I worked out that
we need a function that returns either the desired object, or something
that represents R_UnboundValue. We also need a very cheap way to check if
something equals this new R_UnboundValue. This might look like

if (defined(x <- fetch(symbol, env))) {
  do_stuff_with_x(x)
}

A few more thoughts about "exists":

Moving the bit of R in the exists function to C saves 10% of the time.
Dropping the redundant pos and frame args entirely saves 30% of the time
used by this function. I suggest that the arguments of both get and
exists should
be simplified to (x, envir, mode, inherits). The existing C code handles
numeric, character, and environment input for where. The arg frame is
rarely used (0/128 exists calls in the methods package). Users that need to
can call sys.frame themselves. get already lacks a frame argument and the
manpage for exists notes that envir is only there for backwards
compatibility. Let's deprecate the extra args in exists and get and perhaps
move the extra argument handling to C in the interim.  Similarly, the
"assign" function does nothing with the "immediate" argument.

I'd be interested to hear if there is any support for a "fetch"-like
function (and/or deprecating some unused arguments).

All the best,
Pete



Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bundling system dependencies in binary packages

2014-12-03 Thread Louis Aslett
On 3 December 2014 at 07:40, Prof Brian Ripley wrote:
> On 02/12/2014 18:34, Louis Aslett wrote:
>> I've been hunting round for the accepted method of bundling system
>> dependencies into binary packages.
>>
>> For example, there are some CRAN packages (e.g. gmp, RcppArmadillo,
>> ...) which don't require the system dependencies be installed for the
>> Windows and Mac binary builds.  I understand that there are a very
>> limited number of packages for which CRAN would do this, so as a first
>> step I'm *not* asking how to get this on CRAN, but rather this
>> highlights there must be a (fairly automated/easy) mechanism to
>> achieve this.  Is it as simple as statically linking?  If so there's
>
> Well, packages using just C++ headers (RcppArmadillo is one) do not have
> libraries to link to.

Sorry, yes I discovered when I went to do a concrete example that I
was mistaken in thinking RcppArmadillo fell under the category of
packages I was thinking about.  gmp is the exemplar for bundling
dependencies that my query was driving at.

>
> But as far as possible the Windows and OS X binary packages are
> statically linked.  What is available to CRAN package builds is at
> http://www.stats.ox.ac.uk/pub/Rtools/libs.html
> http://r.research.att.com/libs/ (and includes gmp).
>
>> surely an automated way to trigger this without having to modify
>> Makevars to produce the static linked packages?
>>
>> The Writing R Extensions manual section on binary packages doesn't
>> mention this and I've tried extensive Googling without joy.
>
> Nothing special is needed: the linkers use static linking if there is no
> dynamic library available.  So the external software is built with
> configure options --enable-static --disable-shared .  On OS X it also
> has to be built with PIC flags (not the default for static libraries).
>

Ok, this was exactly the issue ... my GMP didn't have PIC for the
static library.  As soon as I recompiled GMP using --with-pic for the
configure script I was able to temporarily rename my dynamic library
versions of GMP and FLINT and it picked up the static libraries
without throwing any errors.

Thank you very much -- I wasn't at all aware PIC was needed for static
code on OS X!  I'd assumed I was missing something much more
elaborate.

>> So in a nut shell, I'm looking to bundle a binary version of GMP
>> (https://gmplib.org) and FLINT (http://flintlib.org) into my package
>> for Windows/Mac users who can't/won't compile the libraries and which
>> I can distribute independently of CRAN, but without having to do so in
>> a manual/hacky way by tweaking Makevars each time, or modifying the
>> tgz/zip produced by R.
>
> There are a few exceptions where dynamic linking is used on Windows, and
> the configure scripts are used to install DLLs into the libs directory.
>   (RCurl is one, currently.)  The main reason for not doing so is naming
> conflicts for DLLs on Windows. For example, if you were to have a
> gmp.dll and so did package gmp, the first loaded would win, even though
> they might be different versions (and this was common for zlib1.dll).
>

Many thanks again,

Louis

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] we need an exists/get hybrid

2014-12-03 Thread Winston Chang
I've looked at related speed issues in the past, and have a couple
related points to add. (I've put the info below at
http://rpubs.com/wch/46428.)

There’s a significant amount of overhead just from calling the R
function get(). This is true even when you skip the pos argument and
provide envir. For example, if you call get(), it takes much more time
than .Internal(get()), which is what get() does.

If you already know that the object exists in an environment, it's
faster to use e$x, and slightly faster still to use e[["x"]]:

e <- new.env()
e$a <- 1

# Accessing objects in environments
microbenchmark(
  get("a", e, inherits = FALSE),
  get("a", envir = e, inherits = FALSE),
  .Internal(get("a", e, "any", FALSE)),
  e$a,
  e[["a"]],
  .Primitive("[[")(e, "a"),

  unit = "us"
)
#>   median  name
#> 1 1.0300 get("a", e, inherits = FALSE)
#> 2 0.9425 get("a", envir = e, inherits = FALSE)
#> 3 0.3080  .Internal(get("a", e, "any", FALSE))
#> 4 0.2305   e$a
#> 5 0.1740  e[["a"]]
#> 6 0.2905  .Primitive("[[")(e, "a")


A similar thing happens with exists(): the R function wrapper adds
significant overhead on top of .Internal(exists()). It’s also faster
to use $ and [[, then test for NULL, but of course this won’t
distinguish between objects that don’t exist, and those that do exist
but have a NULL value:

# Test for existence of `a` (which exists), and `c` (which doesn't)
microbenchmark(
  exists('a', e, inherits = FALSE),
  exists('a', envir = e, inherits = FALSE),
  .Internal(exists('a', e, 'any', FALSE)),
  'a' %in% ls(e, all.names = TRUE),
  is.null(e[['a']]),
  is.null(e$a),

  exists('c', e, inherits = FALSE),
  exists('c', envir = e, inherits = FALSE),
  .Internal(exists('c', e, 'any', FALSE)),
  'c' %in% ls(e, all.names = TRUE),
  is.null(e[['c']]),
  is.null(e$c),

  unit = "us"
)
#>median name
#> 1  1.2015 exists("a", e, inherits = FALSE)
#> 2  1.0545 exists("a", envir = e, inherits = FALSE)
#> 3  0.3615  .Internal(exists("a", e, "any", FALSE))
#> 4  7.6345 "a" %in% ls(e, all.names = TRUE)
#> 5  0.3055is.null(e[["a"]])
#> 6  0.3270 is.null(e$a)
#> 7  1.1890 exists("c", e, inherits = FALSE)
#> 8  1.0370 exists("c", envir = e, inherits = FALSE)
#> 9  0.3465  .Internal(exists("c", e, "any", FALSE))
#> 10 7.5475 "c" %in% ls(e, all.names = TRUE)
#> 11 0.2675is.null(e[["c"]])
#> 12 0.3010 is.null(e$c)


-Winston

On Tue, Dec 2, 2014 at 8:46 PM, Peter Haverty  wrote:
> Hi All,
>
> I've been looking into speeding up the loading of packages that use a lot
> of S4.  After profiling I noticed the "exists" function accounts for a
> surprising fraction of the time.  I have some thoughts about speeding up
> exists (below). More to the point of this post, Martin Mächler noted that
> 'exists' and 'get' are often used in conjunction.  Both functions are
> different usages of the do_get C function, so it's a pity to run that twice.
>
> "get" gives an error when a symbol is not found, so you can't just do a
> 'get'.  With R's C library, one might do
>
> SEXP x = findVarInFrame3(symbol,env);
> if (x != R_UnboundValue) {
> // do stuff with x
> }
>
> It would be very convenient to have something like this at the R level. We
> don't want to do any tryCatch stuff or to add args to get (That would kill
> any speed advantage. The overhead for handling redundant args accounts for
> 30% of the time used by "exists").  Michael Lawrence and I worked out that
> we need a function that returns either the desired object, or something
> that represents R_UnboundValue. We also need a very cheap way to check if
> something equals this new R_UnboundValue. This might look like
>
> if (defined(x <- fetch(symbol, env))) {
>   do_stuff_with_x(x)
> }
>
> A few more thoughts about "exists":
>
> Moving the bit of R in the exists function to C saves 10% of the time.
> Dropping the redundant pos and frame args entirely saves 30% of the time
> used by this function. I suggest that the arguments of both get and
> exists should
> be simplified to (x, envir, mode, inherits). The existing C code handles
> numeric, character, and environment input for where. The arg frame is
> rarely used (0/128 exists calls in the methods package). Users that need to
> can call sys.frame themselves. get already lacks a frame argument and the
> manpage for exists notes that envir is only there for backwards
> compatibility. Let's deprecate the extra args in exists and get and perhaps
> move the extra argument handling to C in the interim.  Similarly, the
> "assign" function does nothing with the "immediate" argument.
>
> I'd be interested to hear if there is any support for a "fetch"-like
> function (and/or deprecating some unused arguments).
>
> All the best,
> Pete
>
>
>
> Pete
>
> 

Re: [Rd] we need an exists/get hybrid

2014-12-03 Thread Peter Haverty
Thanks Winston!  I'm amazed that "[[" beats calling the .Internal
directly.  I guess the difference between .Primitive vs. .Internal is
pretty significant for things on this time scale.

NULL meaning NULL and NULL meaning undefined would lead to the same path
for much of my code.  I'll be swapping out many exists and get calls later
today.  Thanks!

I do still think it would be very useful to have some way to discriminate
the two NULL cases.  I'm reminded of how perl does the same thing.  It's
been a while, but it was something like

if (defined(x{'c'})) { print x{'c'}; }  # This is still two lookups, but it
has the "defined" concept.

or maybe even

if (defined( foo = x{'c'} ) ) { print foo; }


Thanks again for the timings!


Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Wed, Dec 3, 2014 at 12:48 PM, Winston Chang 
wrote:

> I've looked at related speed issues in the past, and have a couple
> related points to add. (I've put the info below at
> http://rpubs.com/wch/46428.)
>
> There's a significant amount of overhead just from calling the R
> function get(). This is true even when you skip the pos argument and
> provide envir. For example, if you call get(), it takes much more time
> than .Internal(get()), which is what get() does.
>
> If you already know that the object exists in an environment, it's
> faster to use e$x, and slightly faster still to use e[["x"]]:
>
> e <- new.env()
> e$a <- 1
>
> # Accessing objects in environments
> microbenchmark(
>   get("a", e, inherits = FALSE),
>   get("a", envir = e, inherits = FALSE),
>   .Internal(get("a", e, "any", FALSE)),
>   e$a,
>   e[["a"]],
>   .Primitive("[[")(e, "a"),
>
>   unit = "us"
> )
> #>   median  name
> #> 1 1.0300 get("a", e, inherits = FALSE)
> #> 2 0.9425 get("a", envir = e, inherits = FALSE)
> #> 3 0.3080  .Internal(get("a", e, "any", FALSE))
> #> 4 0.2305   e$a
> #> 5 0.1740  e[["a"]]
> #> 6 0.2905  .Primitive("[[")(e, "a")
>
>
> A similar thing happens with exists(): the R function wrapper adds
> significant overhead on top of .Internal(exists()). It's also faster
> to use $ and [[, then test for NULL, but of course this won't
> distinguish between objects that don't exist, and those that do exist
> but have a NULL value:
>
> # Test for existence of `a` (which exists), and `c` (which doesn't)
> microbenchmark(
>   exists('a', e, inherits = FALSE),
>   exists('a', envir = e, inherits = FALSE),
>   .Internal(exists('a', e, 'any', FALSE)),
>   'a' %in% ls(e, all.names = TRUE),
>   is.null(e[['a']]),
>   is.null(e$a),
>
>   exists('c', e, inherits = FALSE),
>   exists('c', envir = e, inherits = FALSE),
>   .Internal(exists('c', e, 'any', FALSE)),
>   'c' %in% ls(e, all.names = TRUE),
>   is.null(e[['c']]),
>   is.null(e$c),
>
>   unit = "us"
> )
> #>median name
> #> 1  1.2015 exists("a", e, inherits = FALSE)
> #> 2  1.0545 exists("a", envir = e, inherits = FALSE)
> #> 3  0.3615  .Internal(exists("a", e, "any", FALSE))
> #> 4  7.6345 "a" %in% ls(e, all.names = TRUE)
> #> 5  0.3055is.null(e[["a"]])
> #> 6  0.3270 is.null(e$a)
> #> 7  1.1890 exists("c", e, inherits = FALSE)
> #> 8  1.0370 exists("c", envir = e, inherits = FALSE)
> #> 9  0.3465  .Internal(exists("c", e, "any", FALSE))
> #> 10 7.5475 "c" %in% ls(e, all.names = TRUE)
> #> 11 0.2675is.null(e[["c"]])
> #> 12 0.3010 is.null(e$c)
>
>
> -Winston
>
> On Tue, Dec 2, 2014 at 8:46 PM, Peter Haverty 
> wrote:
> > Hi All,
> >
> > I've been looking into speeding up the loading of packages that use a lot
> > of S4.  After profiling I noticed the "exists" function accounts for a
> > surprising fraction of the time.  I have some thoughts about speeding up
> > exists (below). More to the point of this post, Martin M�chler noted that
> > 'exists' and 'get' are often used in conjunction.  Both functions are
> > different usages of the do_get C function, so it's a pity to run that
> twice.
> >
> > "get" gives an error when a symbol is not found, so you can't just do a
> > 'get'.  With R's C library, one might do
> >
> > SEXP x = findVarInFrame3(symbol,env);
> > if (x != R_UnboundValue) {
> > // do stuff with x
> > }
> >
> > It would be very convenient to have something like this at the R level.
> We
> > don't want to do any tryCatch stuff or to add args to get (That would
> kill
> > any speed advantage. The overhead for handling redundant args accounts
> for
> > 30% of the time used by "exists").  Michael Lawrence and I worked out
> that
> > we need a function that returns either the desired object, or something
> > that represents R_UnboundValue. We also need a very cheap way to check if
> > something equals this new R_UnboundValue. This might look like
> >
> > if (defin