[Rd] Holding a large number of SEXPs in C++

2014-10-17 Thread Simon Knapp
Background:
I have an algorithm which produces a large number of small polygons (of the
spatial kind) which I would like to use within R using objects from sp. I
can't predict the exact number of polygons a-priori, the polygons will be
grouped into regions, and each region will be filled sequentially, so an
appropriate C++ 'framework' (for the point of illustration) might be:

typedef std::pair Point;
typedef std::vector Polygon;
typedef std::vector Polygons;
typedef std::vector Regions;

struct Holder {
void notifyNewRegion(void) const {
regions.push_back(Polygons());
}

template
void addSubPoly(Iter b, Iter e) {
regions.back().push_back(Polygon(b, e));
}

private:
Regions regions;
};

where the reference_type of Iter is convertible to Point. In practice I use
pointers in a couple of places to avoid resizing in push_back becoming too
expensive.

To construct the corresponding sp::Polygon, sp::Polygons and
sp::SpatialPolygons at the end of the algorithm, I iterate over the result
turning each Polygon into a two column matrix and calling the C functions
corresponding to the 'constructors' for these objects.

This is all working fine, but I could cut my memory consumption in half if
I could construct the sp::Polygon objects in addSubPoly, and the
sp::Polygons objects in notifyNewRegion. My vector typedefs would then all
be:

typedef std::vector




Question:
What I'm not sure about (and finally my question) is: I will have datasets
where I have more than 10,000 SEXPs in the Polygon and Polygons objects for
a single region, and possibly more than 10,000 regions, so how do I PROTECT
all those SEXPs (noting that the protection stack is limited to 10,000 and
bearing in mind that I don't know how many there will be before I start)?

I am also interested in this just out of general curiosity.




Thoughts:

1) I could create an environment and store the objects themselves in there
while keeping pointers in the vectors, but am not sure if this would be
that efficient (guidance would be appreciated), or

2) Just keep them in R vectors and grow these myself (as push_back is doing
for me in the above), but that sounds like a pain and I'm not sure if the
objects or just the pointers would be copied when I reassigned things
(guidance would be appreciated again). Bare in mind that I keep pointers in
the vectors, but omitted that for the sake of clarity.




Is there some other R type that would be suited to this, or a general
approach?

Cheers and thanks in advance,
Simon Knapp

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] model.matrix metadata

2014-10-17 Thread Patrick O'Reilly
Hi,

As far as I am aware, the model.matrix function does not return
perfect metadata on what each column of the model matrix "means".

The columns are named (e.g. age:genderM), but encoding the metadata as
strings can result in ambiguity. For example, the dummy variables
created when the factors var0 = 0 and var = 00 both are named var00.
Additionally, if a level of a factor variable contains a colon, this
could be confused for an interaction.

While a human can generally work out the meaning of each column
somewhat manually, I am interested in achieving this programmatically.

My solution is to edit the modelmatrix function in
/src/library/stats/src/model.c to additionally return the following:

intrcept
factors
contr1
contr2
count

With the availability of these in R it is possible to determine the
precise meaning of each column without the error-prone parsing of
strings. I have attached my edit: see lines 753-764.

I am seeking advice on this approach. Am I missing a simpler way of
achieving this (which perhaps avoids rebuilding R)?

Since model.matrix is used in so many modeling functions this would be
very helpful for the programmatic interpretation of model output. A
search on the Internet suggests there are other R users who would
welcome such functionality.

Many thanks in advance,

Pat O'Reilly
/*
 *  R : A Computer Language for Statistical Data Analysis
 *  Copyright (C) 1995, 1996  Robert Gentleman and Ross Ihaka
 *  Copyright (C) 1997--2013  The R Core Team
 *
 *  This program is free software; you can redistribute it and/or modify
 *  it under the terms of the GNU General Public License as published by
 *  the Free Software Foundation; either version 2 of the License, or
 *  (at your option) any later version.
 *
 *  This program is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *  GNU General Public License for more details.
 *
 *  You should have received a copy of the GNU General Public License
 *  along with this program; if not, a copy is available at
 *  http://www.r-project.org/Licenses/
 */

#ifdef HAVE_CONFIG_H
#include 
#endif

#include 

#include "statsR.h"
#undef _
#ifdef ENABLE_NLS
#include 
#define _(String) dgettext ("stats", String)
#else
#define _(String) (String)
#endif

/* inline-able versions, used just once! */
static R_INLINE Rboolean isUnordered_int(SEXP s)
{
return (TYPEOF(s) == INTSXP
	&& inherits(s, "factor")
	&& !inherits(s, "ordered"));
}

static R_INLINE Rboolean isOrdered_int(SEXP s)
{
return (TYPEOF(s) == INTSXP
	&& inherits(s, "factor")
	&& inherits(s, "ordered"));
}

/*
 *  model.frame
 *
 *  The argument "terms" contains the terms object generated from the
 *  model formula.  We first evaluate the "variables" attribute of
 *  "terms" in the "data" environment.  This gives us a list of basic
 *  variables to be in the model frame.  We do some basic sanity
 *  checks on these to ensure that resulting object make sense.
 *
 *  The argument "dots" gives additional things like "weights", "offsets"
 *  and "subset" which will also go into the model frame so that they can
 *  be treated in parallel.
 *
 *  Next we subset the data frame according to "subset" and finally apply
 *  "na.action" to get the final data frame.
 *
 *  Note that the "terms" argument is glued to the model frame as an
 *  attribute.  Code downstream appears to need this.
 *
 */

/* model.frame(terms, rownames, variables, varnames, */
/* dots, dotnames, subset, na.action) */

SEXP modelframe(SEXP call, SEXP op, SEXP args, SEXP rho)
{
SEXP terms, data, names, variables, varnames, dots, dotnames, na_action;
SEXP ans, row_names, subset, tmp;
char buf[256];
int i, j, nr, nc;
int nvars, ndots, nactualdots;
const void *vmax = vmaxget();

args = CDR(args);
terms = CAR(args); args = CDR(args);
row_names = CAR(args); args = CDR(args);
variables = CAR(args); args = CDR(args);
varnames = CAR(args); args = CDR(args);
dots = CAR(args); args = CDR(args);
dotnames = CAR(args); args = CDR(args);
subset = CAR(args); args = CDR(args);
na_action = CAR(args);

/* Argument Sanity Checks */

if (!isNewList(variables))
	error(_("invalid variables"));
if (!isString(varnames))
	error(_("invalid variable names"));
if ((nvars = length(variables)) != length(varnames))
	error(_("number of variables != number of variable names"));

if (!isNewList(dots))
	error(_("invalid extra variables"));
if ((ndots = length(dots)) != length(dotnames))
	error(_("number of variables != number of variable names"));
if ( ndots && !isString(dotnames))
	error(_("invalid extra variable names"));

/*  check for NULL extra arguments -- moved from interpreted code */

nactualdots = 0;
for (i = 0; i < ndots; i++)
	if (VECTOR_ELT(dots, i) != R_NilValue) nactualdots++;

/* Assemble

Re: [Rd] Holding a large number of SEXPs in C++

2014-10-17 Thread Simon Urbanek

On Oct 17, 2014, at 7:31 AM, Simon Knapp  wrote:

> Background:
> I have an algorithm which produces a large number of small polygons (of the
> spatial kind) which I would like to use within R using objects from sp. I
> can't predict the exact number of polygons a-priori, the polygons will be
> grouped into regions, and each region will be filled sequentially, so an
> appropriate C++ 'framework' (for the point of illustration) might be:
> 
> typedef std::pair Point;
> typedef std::vector Polygon;
> typedef std::vector Polygons;
> typedef std::vector Regions;
> 
> struct Holder {
>void notifyNewRegion(void) const {
>regions.push_back(Polygons());
>}
> 
>template
>void addSubPoly(Iter b, Iter e) {
>regions.back().push_back(Polygon(b, e));
>}
> 
> private:
>Regions regions;
> };
> 
> where the reference_type of Iter is convertible to Point. In practice I use
> pointers in a couple of places to avoid resizing in push_back becoming too
> expensive.
> 
> To construct the corresponding sp::Polygon, sp::Polygons and
> sp::SpatialPolygons at the end of the algorithm, I iterate over the result
> turning each Polygon into a two column matrix and calling the C functions
> corresponding to the 'constructors' for these objects.
> 
> This is all working fine, but I could cut my memory consumption in half if
> I could construct the sp::Polygon objects in addSubPoly, and the
> sp::Polygons objects in notifyNewRegion. My vector typedefs would then all
> be:
> 
> typedef std::vector
> 
> 
> 
> 
> Question:
> What I'm not sure about (and finally my question) is: I will have datasets
> where I have more than 10,000 SEXPs in the Polygon and Polygons objects for
> a single region, and possibly more than 10,000 regions, so how do I PROTECT
> all those SEXPs (noting that the protection stack is limited to 10,000 and
> bearing in mind that I don't know how many there will be before I start)?
> 
> I am also interested in this just out of general curiosity.
> 
> 
> 
> 
> Thoughts:
> 
> 1) I could create an environment and store the objects themselves in there
> while keeping pointers in the vectors, but am not sure if this would be
> that efficient (guidance would be appreciated), or
> 
> 2) Just keep them in R vectors and grow these myself (as push_back is doing
> for me in the above), but that sounds like a pain and I'm not sure if the
> objects or just the pointers would be copied when I reassigned things
> (guidance would be appreciated again). Bare in mind that I keep pointers in
> the vectors, but omitted that for the sake of clarity.
> 
> 
> 
> 
> Is there some other R type that would be suited to this, or a general
> approach?
> 

Lists in R (LISTSXP aka pairlists) are suited to appending (since that is fast 
and trivial) and sequential processing. The only issue is that pairlists are 
slow for random access. If you only want to load the polygons and finalize, 
then you can hold them in a pairlist and at the end copy to a generic vector 
(if random access is expected). DB applications typically use a hybrid approach 
-  allocate vector blocks and keep them in pairlists, but that's probably an 
overkill for your use (if you really cared about performance you wouldn't use 
sp objects for this ;))

Note that you only have to protect the top-level object, so you don't need to 
protect the individual elements.

Cheers,
Simon


> Cheers and thanks in advance,
> Simon Knapp
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] model.matrix metadata

2014-10-17 Thread Charles Berry
Patrick O'Reilly  gmail.com> writes:

> 
> Hi,
> 
> As far as I am aware, the model.matrix function does not return
> perfect metadata on what each column of the model matrix "means".
> 
> The columns are named (e.g. age:genderM), but encoding the metadata as
> strings can result in ambiguity. For example, the dummy variables
> created when the factors var0 = 0 and var = 00 both are named var00.
> Additionally, if a level of a factor variable contains a colon, this
> could be confused for an interaction.
> 
> While a human can generally work out the meaning of each column
> somewhat manually, I am interested in achieving this programmatically.
> 

Why don't you just retain the terms.object?

i.e

my.terms <- terms( my.formula, data=my.data.frame )
my.model.matrix <- model.matrix( my.terms, data= my.data.frame )

attributes(my.terms)


See ?terms, ?terms.object, ?model.frame (which contains a terms.object)


HTH,

Chuck

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Most efficient way to check the length of a variable mentioned in a formula.

2014-10-17 Thread Joris Meys
Dear R gurus,

I need to know the length of a variable (let's call that X) that is
mentioned in a formula. So obviously I look for the environment from which
the formula is called and then I have two options:

- using eval(parse(text='length(X)'),
envir=environment(formula) )

- using length(get('X'),
envir=environment(formula) )

a bit of benchmarking showed that the first option is about 20 times
slower, to that extent that if I repeat it 10,000 times I save more than
half a second. So speed is not really an issue here.

Personally I'd go for option 2 as that one is easier to read and does the
job nicely, but with these functions I'm always a bit afraid that I'm
overseeing important details or side effects here (possibly memory issues
when working with larger data).

Anybody an idea what the dangers are of these methods, and which one is the
most robust method?

Thank you
Joris

-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Mathematical Modelling, Statistics and Bio-Informatics

tel : +32 9 264 59 87
joris.m...@ugent.be
---
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Most efficient way to check the length of a variable mentioned in a formula.

2014-10-17 Thread Gabriel Becker
Joris,

For me

length(environment(form)[["x"]])

Was about twice as fast as

length(get("x",environment(form

In the year-old version of R (3.0.2) that I have on the virtual machine i'm
currently using.

As for you, the eval method was much slower (though my factor was much
larger than 20)

> system.time({thing <- replicate(1,length(environment(form)[["x"]]))})
   user  system elapsed
  0.018   0.000   0.018
> system.time({thing <-
replicate(1,length(get("x",environment(form})   user  system
elapsed
  0.031   0.000   0.033
> system.time({thing <- replicate(1,eval(parse(text = "length(x)"),
envir=environment(form)))})
   user  system elapsed
  4.528   0.003   4.656

I can't speak this second to whether this pattern will hold in the more
modern versions of R I typically use.

~G

> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base






On Fri, Oct 17, 2014 at 11:04 AM, Joris Meys  wrote:

> Dear R gurus,
>
> I need to know the length of a variable (let's call that X) that is
> mentioned in a formula. So obviously I look for the environment from which
> the formula is called and then I have two options:
>
> - using eval(parse(text='length(X)'),
> envir=environment(formula) )
>
> - using length(get('X'),
> envir=environment(formula) )
>
> a bit of benchmarking showed that the first option is about 20 times
> slower, to that extent that if I repeat it 10,000 times I save more than
> half a second. So speed is not really an issue here.
>
> Personally I'd go for option 2 as that one is easier to read and does the
> job nicely, but with these functions I'm always a bit afraid that I'm
> overseeing important details or side effects here (possibly memory issues
> when working with larger data).
>
> Anybody an idea what the dangers are of these methods, and which one is the
> most robust method?
>
> Thank you
> Joris
>
> --
> Joris Meys
> Statistical consultant
>
> Ghent University
> Faculty of Bioscience Engineering
> Department of Mathematical Modelling, Statistics and Bio-Informatics
>
> tel : +32 9 264 59 87
> joris.m...@ugent.be
> ---
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



-- 
Gabriel Becker
Graduate Student
Statistics Department
University of California, Davis

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Most efficient way to check the length of a variable mentioned in a formula.

2014-10-17 Thread William Dunlap
I would use eval(), but I think that most formula-using functions do
it more like the following.

getRHSLength <-
function (formula, data = parent.frame())
{
rhsExpr <- formula[[length(formula)]]
rhsValue <- eval(rhsExpr, envir = data, enclos = environment(formula))
length(rhsValue)
}

* use eval() instead of get() so you will find variables are in
ancestral environments
of envir (if envir is an environment), not just envir itself.
* just evaluate the stuff in the formula using the non-standard
evaluation frame,
call length() in the current frame.  Otherwise, if  envir inherits
directly from emptyenv() the 'length' function will not be found.
* use envir=data so it looks first in the data argument for variables
* the enclos argument is used if envir is not an environment and is used to
find variables that are not in envir.

Here are some examples:
  > X <- 1:10
  > getRHSLength(~X)
  [1] 10
  > getRHSLength(~X, data=data.frame(X=1:2))
  [1] 2
  > getRHSLength((function(){X <- 1:4; ~X})(), data=data.frame())
  [1] 4
  > getRHSLength((function(){X <- 1:4; ~X})(), data=data.frame(X=1:2))
  [1] 2
  > getRHSLength((function(){X <- 1:4; ~X})(), data=list2env(data.frame()))
  [1] 10
  > getRHSLength((function(){X <- 1:4; ~X})(), data=emptyenv())
  Error in eval(expr, envir, enclos) : object 'X' not found

I think you will see the same lookups if you try analogous things with lm().
Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Fri, Oct 17, 2014 at 11:04 AM, Joris Meys  wrote:
> Dear R gurus,
>
> I need to know the length of a variable (let's call that X) that is
> mentioned in a formula. So obviously I look for the environment from which
> the formula is called and then I have two options:
>
> - using eval(parse(text='length(X)'),
> envir=environment(formula) )
>
> - using length(get('X'),
> envir=environment(formula) )
>
> a bit of benchmarking showed that the first option is about 20 times
> slower, to that extent that if I repeat it 10,000 times I save more than
> half a second. So speed is not really an issue here.
>
> Personally I'd go for option 2 as that one is easier to read and does the
> job nicely, but with these functions I'm always a bit afraid that I'm
> overseeing important details or side effects here (possibly memory issues
> when working with larger data).
>
> Anybody an idea what the dangers are of these methods, and which one is the
> most robust method?
>
> Thank you
> Joris
>
> --
> Joris Meys
> Statistical consultant
>
> Ghent University
> Faculty of Bioscience Engineering
> Department of Mathematical Modelling, Statistics and Bio-Informatics
>
> tel : +32 9 264 59 87
> joris.m...@ugent.be
> ---
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Most efficient way to check the length of a variable mentioned in a formula.

2014-10-17 Thread William Dunlap
I got the default value for getRHSLength's data argument wrong - it
should be NULL, not parent.env().
   getRHSLength <- function (formula, data = NULL)
   {
   rhsExpr <- formula[[length(formula)]]
   rhsValue <- eval(rhsExpr, envir = data, enclos = environment(formula))
   length(rhsValue)
   }
so that the function firstHalf is found in the following
   > X <- 1:10
   > 
getRHSLength((function(){firstHalf<-function(x)x[seq_len(floor(length(x)/2))];
~firstHalf(X)})())
   [1] 5


Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Fri, Oct 17, 2014 at 11:57 AM, William Dunlap  wrote:
> I would use eval(), but I think that most formula-using functions do
> it more like the following.
>
> getRHSLength <-
> function (formula, data = parent.frame())
> {
> rhsExpr <- formula[[length(formula)]]
> rhsValue <- eval(rhsExpr, envir = data, enclos = environment(formula))
> length(rhsValue)
> }
>
> * use eval() instead of get() so you will find variables are in
> ancestral environments
> of envir (if envir is an environment), not just envir itself.
> * just evaluate the stuff in the formula using the non-standard
> evaluation frame,
> call length() in the current frame.  Otherwise, if  envir inherits
> directly from emptyenv() the 'length' function will not be found.
> * use envir=data so it looks first in the data argument for variables
> * the enclos argument is used if envir is not an environment and is used to
> find variables that are not in envir.
>
> Here are some examples:
>   > X <- 1:10
>   > getRHSLength(~X)
>   [1] 10
>   > getRHSLength(~X, data=data.frame(X=1:2))
>   [1] 2
>   > getRHSLength((function(){X <- 1:4; ~X})(), data=data.frame())
>   [1] 4
>   > getRHSLength((function(){X <- 1:4; ~X})(), data=data.frame(X=1:2))
>   [1] 2
>   > getRHSLength((function(){X <- 1:4; ~X})(), data=list2env(data.frame()))
>   [1] 10
>   > getRHSLength((function(){X <- 1:4; ~X})(), data=emptyenv())
>   Error in eval(expr, envir, enclos) : object 'X' not found
>
> I think you will see the same lookups if you try analogous things with lm().
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
>
> On Fri, Oct 17, 2014 at 11:04 AM, Joris Meys  wrote:
>> Dear R gurus,
>>
>> I need to know the length of a variable (let's call that X) that is
>> mentioned in a formula. So obviously I look for the environment from which
>> the formula is called and then I have two options:
>>
>> - using eval(parse(text='length(X)'),
>> envir=environment(formula) )
>>
>> - using length(get('X'),
>> envir=environment(formula) )
>>
>> a bit of benchmarking showed that the first option is about 20 times
>> slower, to that extent that if I repeat it 10,000 times I save more than
>> half a second. So speed is not really an issue here.
>>
>> Personally I'd go for option 2 as that one is easier to read and does the
>> job nicely, but with these functions I'm always a bit afraid that I'm
>> overseeing important details or side effects here (possibly memory issues
>> when working with larger data).
>>
>> Anybody an idea what the dangers are of these methods, and which one is the
>> most robust method?
>>
>> Thank you
>> Joris
>>
>> --
>> Joris Meys
>> Statistical consultant
>>
>> Ghent University
>> Faculty of Bioscience Engineering
>> Department of Mathematical Modelling, Statistics and Bio-Informatics
>>
>> tel : +32 9 264 59 87
>> joris.m...@ugent.be
>> ---
>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Most efficient way to check the length of a variable mentioned in a formula.

2014-10-17 Thread Joris Meys
Thank you both, great ideas.  William, I see the point of using eval, but
the problem is that I can't evaluate the formula itself yet. I need to know
the length of these variables to create a function that is used to
evaluate. So if I try to evaluate the formula in some way before I created
the function, it will just return an error.

Now I use the attribute variables of the formula terms to get the variables
that -after some more manipulation- eventually will be the model matrix.
Something like this :

afun <- function(formula, ...){

varnames <- all.vars(formula)
fenv <- environment(formula)

txt <- paste('length(',varnames[1],')')
n <- eval(parse(text=txt), envir=fenv)

fun <- function(x) x/n

myterms <- terms(formula)
eval(attr(myterms, 'variables'))

}

And that should give:

> x <- 1:10
> y <- 10:1
> z <- 11:20
> afun(z ~ fun(x) + y)
[[1]]
 [1] 11 12 13 14 15 16 17 18 19 20

[[2]]
 [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

[[3]]
 [1] 10  9  8  7  6  5  4  3  2  1

It might be I'm walking to Paris over Singapore, but I couldn't find a
better way to do it.

Cheers
Joris

On Fri, Oct 17, 2014 at 10:16 PM, William Dunlap  wrote:

> I got the default value for getRHSLength's data argument wrong - it
> should be NULL, not parent.env().
>getRHSLength <- function (formula, data = NULL)
>{
>rhsExpr <- formula[[length(formula)]]
>rhsValue <- eval(rhsExpr, envir = data, enclos =
> environment(formula))
>length(rhsValue)
>}
> so that the function firstHalf is found in the following
>> X <- 1:10
>>
> getRHSLength((function(){firstHalf<-function(x)x[seq_len(floor(length(x)/2))];
> ~firstHalf(X)})())
>[1] 5
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
>
> On Fri, Oct 17, 2014 at 11:57 AM, William Dunlap 
> wrote:
> > I would use eval(), but I think that most formula-using functions do
> > it more like the following.
> >
> > getRHSLength <-
> > function (formula, data = parent.frame())
> > {
> > rhsExpr <- formula[[length(formula)]]
> > rhsValue <- eval(rhsExpr, envir = data, enclos =
> environment(formula))
> > length(rhsValue)
> > }
> >
> > * use eval() instead of get() so you will find variables are in
> > ancestral environments
> > of envir (if envir is an environment), not just envir itself.
> > * just evaluate the stuff in the formula using the non-standard
> > evaluation frame,
> > call length() in the current frame.  Otherwise, if  envir inherits
> > directly from emptyenv() the 'length' function will not be found.
> > * use envir=data so it looks first in the data argument for variables
> > * the enclos argument is used if envir is not an environment and is used
> to
> > find variables that are not in envir.
> >
> > Here are some examples:
> >   > X <- 1:10
> >   > getRHSLength(~X)
> >   [1] 10
> >   > getRHSLength(~X, data=data.frame(X=1:2))
> >   [1] 2
> >   > getRHSLength((function(){X <- 1:4; ~X})(), data=data.frame())
> >   [1] 4
> >   > getRHSLength((function(){X <- 1:4; ~X})(), data=data.frame(X=1:2))
> >   [1] 2
> >   > getRHSLength((function(){X <- 1:4; ~X})(),
> data=list2env(data.frame()))
> >   [1] 10
> >   > getRHSLength((function(){X <- 1:4; ~X})(), data=emptyenv())
> >   Error in eval(expr, envir, enclos) : object 'X' not found
> >
> > I think you will see the same lookups if you try analogous things with
> lm().
> > Bill Dunlap
> > TIBCO Software
> > wdunlap tibco.com
> >
> >
> > On Fri, Oct 17, 2014 at 11:04 AM, Joris Meys 
> wrote:
> >> Dear R gurus,
> >>
> >> I need to know the length of a variable (let's call that X) that is
> >> mentioned in a formula. So obviously I look for the environment from
> which
> >> the formula is called and then I have two options:
> >>
> >> - using eval(parse(text='length(X)'),
> >> envir=environment(formula) )
> >>
> >> - using length(get('X'),
> >> envir=environment(formula) )
> >>
> >> a bit of benchmarking showed that the first option is about 20 times
> >> slower, to that extent that if I repeat it 10,000 times I save more than
> >> half a second. So speed is not really an issue here.
> >>
> >> Personally I'd go for option 2 as that one is easier to read and does
> the
> >> job nicely, but with these functions I'm always a bit afraid that I'm
> >> overseeing important details or side effects here (possibly memory
> issues
> >> when working with larger data).
> >>
> >> Anybody an idea what the dangers are of these methods, and which one is
> the
> >> most robust method?
> >>
> >> Thank you
> >> Joris
> >>
> >> --
> >> Joris Meys
> >> Statistical consultant
> >>
> >> Ghent University
> >> Faculty of Bioscience Engineering
> >> Department of Mathematical Modelling, Statistics and Bio-Informatics
> >>
> >> tel : +32 9 264 59 87
> >> joris.m...@ugent.be
> >> ---
> >> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
> >>
> >> [[alternative HTML version deleted]]
> >>
> >> _

Re: [Rd] Most efficient way to check the length of a variable mentioned in a formula.

2014-10-17 Thread William Dunlap
In my example function I did not evaluate the formula either, just a part of it.

If you leave off the envir and enclos arguments to eval in your
function you can get surprising (wrong) results.  E.g.,
  > afun(y ~ varnames)
  [[1]]
   [1] 10  9  8  7  6  5  4  3  2  1

  [[2]]
  [1] "y""varnames"

If you want to use the variables in data or environment(formula) and
some functions defined in your function, then you could make a child
environment of environment(formula), put your locally defined
functions in it, and use the child environment in the call to eval.
E.g., you code would become
afun2 <- function(formula, ...){

varnames <- all.vars(formula)
fenv <- environment(formula)

n <- length(eval(as.name(varnames[1]), envir=fenv))
childEnv <- new.env(parent=fenv)
childEnv$fun <- function(x) x/n

myterms <- terms(formula)
eval(attr(myterms, 'variables'), envir=childEnv)
}

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Fri, Oct 17, 2014 at 1:50 PM, Joris Meys  wrote:
> Thank you both, great ideas.  William, I see the point of using eval, but
> the problem is that I can't evaluate the formula itself yet. I need to know
> the length of these variables to create a function that is used to evaluate.
> So if I try to evaluate the formula in some way before I created the
> function, it will just return an error.
>
> Now I use the attribute variables of the formula terms to get the variables
> that -after some more manipulation- eventually will be the model matrix.
> Something like this :
>
> afun <- function(formula, ...){
>
> varnames <- all.vars(formula)
> fenv <- environment(formula)
>
> txt <- paste('length(',varnames[1],')')
> n <- eval(parse(text=txt), envir=fenv)
>
> fun <- function(x) x/n
>
> myterms <- terms(formula)
> eval(attr(myterms, 'variables'))
>
> }
>
> And that should give:
>
>> x <- 1:10
>> y <- 10:1
>> z <- 11:20
>> afun(z ~ fun(x) + y)
> [[1]]
>  [1] 11 12 13 14 15 16 17 18 19 20
>
> [[2]]
>  [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
>
> [[3]]
>  [1] 10  9  8  7  6  5  4  3  2  1
>
> It might be I'm walking to Paris over Singapore, but I couldn't find a
> better way to do it.
>
> Cheers
> Joris
>
> On Fri, Oct 17, 2014 at 10:16 PM, William Dunlap  wrote:
>>
>> I got the default value for getRHSLength's data argument wrong - it
>> should be NULL, not parent.env().
>>getRHSLength <- function (formula, data = NULL)
>>{
>>rhsExpr <- formula[[length(formula)]]
>>rhsValue <- eval(rhsExpr, envir = data, enclos =
>> environment(formula))
>>length(rhsValue)
>>}
>> so that the function firstHalf is found in the following
>>> X <- 1:10
>>>
>> getRHSLength((function(){firstHalf<-function(x)x[seq_len(floor(length(x)/2))];
>> ~firstHalf(X)})())
>>[1] 5
>>
>>
>> Bill Dunlap
>> TIBCO Software
>> wdunlap tibco.com
>>
>>
>> On Fri, Oct 17, 2014 at 11:57 AM, William Dunlap 
>> wrote:
>> > I would use eval(), but I think that most formula-using functions do
>> > it more like the following.
>> >
>> > getRHSLength <-
>> > function (formula, data = parent.frame())
>> > {
>> > rhsExpr <- formula[[length(formula)]]
>> > rhsValue <- eval(rhsExpr, envir = data, enclos =
>> > environment(formula))
>> > length(rhsValue)
>> > }
>> >
>> > * use eval() instead of get() so you will find variables are in
>> > ancestral environments
>> > of envir (if envir is an environment), not just envir itself.
>> > * just evaluate the stuff in the formula using the non-standard
>> > evaluation frame,
>> > call length() in the current frame.  Otherwise, if  envir inherits
>> > directly from emptyenv() the 'length' function will not be found.
>> > * use envir=data so it looks first in the data argument for variables
>> > * the enclos argument is used if envir is not an environment and is used
>> > to
>> > find variables that are not in envir.
>> >
>> > Here are some examples:
>> >   > X <- 1:10
>> >   > getRHSLength(~X)
>> >   [1] 10
>> >   > getRHSLength(~X, data=data.frame(X=1:2))
>> >   [1] 2
>> >   > getRHSLength((function(){X <- 1:4; ~X})(), data=data.frame())
>> >   [1] 4
>> >   > getRHSLength((function(){X <- 1:4; ~X})(), data=data.frame(X=1:2))
>> >   [1] 2
>> >   > getRHSLength((function(){X <- 1:4; ~X})(),
>> > data=list2env(data.frame()))
>> >   [1] 10
>> >   > getRHSLength((function(){X <- 1:4; ~X})(), data=emptyenv())
>> >   Error in eval(expr, envir, enclos) : object 'X' not found
>> >
>> > I think you will see the same lookups if you try analogous things with
>> > lm().
>> > Bill Dunlap
>> > TIBCO Software
>> > wdunlap tibco.com
>> >
>> >
>> > On Fri, Oct 17, 2014 at 11:04 AM, Joris Meys 
>> > wrote:
>> >> Dear R gurus,
>> >>
>> >> I need to know the length of a variable (let's call that X) that is
>> >> mentioned in a formula. So obviously I look for the environment from
>> >> which
>> >> the formula is called and then I have two options:
>> >>
>> >> - using eval(parse(text='length(X)'),
>> >>