[Rd] Holding a large number of SEXPs in C++
Background: I have an algorithm which produces a large number of small polygons (of the spatial kind) which I would like to use within R using objects from sp. I can't predict the exact number of polygons a-priori, the polygons will be grouped into regions, and each region will be filled sequentially, so an appropriate C++ 'framework' (for the point of illustration) might be: typedef std::pair Point; typedef std::vector Polygon; typedef std::vector Polygons; typedef std::vector Regions; struct Holder { void notifyNewRegion(void) const { regions.push_back(Polygons()); } template void addSubPoly(Iter b, Iter e) { regions.back().push_back(Polygon(b, e)); } private: Regions regions; }; where the reference_type of Iter is convertible to Point. In practice I use pointers in a couple of places to avoid resizing in push_back becoming too expensive. To construct the corresponding sp::Polygon, sp::Polygons and sp::SpatialPolygons at the end of the algorithm, I iterate over the result turning each Polygon into a two column matrix and calling the C functions corresponding to the 'constructors' for these objects. This is all working fine, but I could cut my memory consumption in half if I could construct the sp::Polygon objects in addSubPoly, and the sp::Polygons objects in notifyNewRegion. My vector typedefs would then all be: typedef std::vector Question: What I'm not sure about (and finally my question) is: I will have datasets where I have more than 10,000 SEXPs in the Polygon and Polygons objects for a single region, and possibly more than 10,000 regions, so how do I PROTECT all those SEXPs (noting that the protection stack is limited to 10,000 and bearing in mind that I don't know how many there will be before I start)? I am also interested in this just out of general curiosity. Thoughts: 1) I could create an environment and store the objects themselves in there while keeping pointers in the vectors, but am not sure if this would be that efficient (guidance would be appreciated), or 2) Just keep them in R vectors and grow these myself (as push_back is doing for me in the above), but that sounds like a pain and I'm not sure if the objects or just the pointers would be copied when I reassigned things (guidance would be appreciated again). Bare in mind that I keep pointers in the vectors, but omitted that for the sake of clarity. Is there some other R type that would be suited to this, or a general approach? Cheers and thanks in advance, Simon Knapp [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] model.matrix metadata
Hi, As far as I am aware, the model.matrix function does not return perfect metadata on what each column of the model matrix "means". The columns are named (e.g. age:genderM), but encoding the metadata as strings can result in ambiguity. For example, the dummy variables created when the factors var0 = 0 and var = 00 both are named var00. Additionally, if a level of a factor variable contains a colon, this could be confused for an interaction. While a human can generally work out the meaning of each column somewhat manually, I am interested in achieving this programmatically. My solution is to edit the modelmatrix function in /src/library/stats/src/model.c to additionally return the following: intrcept factors contr1 contr2 count With the availability of these in R it is possible to determine the precise meaning of each column without the error-prone parsing of strings. I have attached my edit: see lines 753-764. I am seeking advice on this approach. Am I missing a simpler way of achieving this (which perhaps avoids rebuilding R)? Since model.matrix is used in so many modeling functions this would be very helpful for the programmatic interpretation of model output. A search on the Internet suggests there are other R users who would welcome such functionality. Many thanks in advance, Pat O'Reilly /* * R : A Computer Language for Statistical Data Analysis * Copyright (C) 1995, 1996 Robert Gentleman and Ross Ihaka * Copyright (C) 1997--2013 The R Core Team * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, a copy is available at * http://www.r-project.org/Licenses/ */ #ifdef HAVE_CONFIG_H #include #endif #include #include "statsR.h" #undef _ #ifdef ENABLE_NLS #include #define _(String) dgettext ("stats", String) #else #define _(String) (String) #endif /* inline-able versions, used just once! */ static R_INLINE Rboolean isUnordered_int(SEXP s) { return (TYPEOF(s) == INTSXP && inherits(s, "factor") && !inherits(s, "ordered")); } static R_INLINE Rboolean isOrdered_int(SEXP s) { return (TYPEOF(s) == INTSXP && inherits(s, "factor") && inherits(s, "ordered")); } /* * model.frame * * The argument "terms" contains the terms object generated from the * model formula. We first evaluate the "variables" attribute of * "terms" in the "data" environment. This gives us a list of basic * variables to be in the model frame. We do some basic sanity * checks on these to ensure that resulting object make sense. * * The argument "dots" gives additional things like "weights", "offsets" * and "subset" which will also go into the model frame so that they can * be treated in parallel. * * Next we subset the data frame according to "subset" and finally apply * "na.action" to get the final data frame. * * Note that the "terms" argument is glued to the model frame as an * attribute. Code downstream appears to need this. * */ /* model.frame(terms, rownames, variables, varnames, */ /* dots, dotnames, subset, na.action) */ SEXP modelframe(SEXP call, SEXP op, SEXP args, SEXP rho) { SEXP terms, data, names, variables, varnames, dots, dotnames, na_action; SEXP ans, row_names, subset, tmp; char buf[256]; int i, j, nr, nc; int nvars, ndots, nactualdots; const void *vmax = vmaxget(); args = CDR(args); terms = CAR(args); args = CDR(args); row_names = CAR(args); args = CDR(args); variables = CAR(args); args = CDR(args); varnames = CAR(args); args = CDR(args); dots = CAR(args); args = CDR(args); dotnames = CAR(args); args = CDR(args); subset = CAR(args); args = CDR(args); na_action = CAR(args); /* Argument Sanity Checks */ if (!isNewList(variables)) error(_("invalid variables")); if (!isString(varnames)) error(_("invalid variable names")); if ((nvars = length(variables)) != length(varnames)) error(_("number of variables != number of variable names")); if (!isNewList(dots)) error(_("invalid extra variables")); if ((ndots = length(dots)) != length(dotnames)) error(_("number of variables != number of variable names")); if ( ndots && !isString(dotnames)) error(_("invalid extra variable names")); /* check for NULL extra arguments -- moved from interpreted code */ nactualdots = 0; for (i = 0; i < ndots; i++) if (VECTOR_ELT(dots, i) != R_NilValue) nactualdots++; /* Assemble
Re: [Rd] Holding a large number of SEXPs in C++
On Oct 17, 2014, at 7:31 AM, Simon Knapp wrote: > Background: > I have an algorithm which produces a large number of small polygons (of the > spatial kind) which I would like to use within R using objects from sp. I > can't predict the exact number of polygons a-priori, the polygons will be > grouped into regions, and each region will be filled sequentially, so an > appropriate C++ 'framework' (for the point of illustration) might be: > > typedef std::pair Point; > typedef std::vector Polygon; > typedef std::vector Polygons; > typedef std::vector Regions; > > struct Holder { >void notifyNewRegion(void) const { >regions.push_back(Polygons()); >} > >template >void addSubPoly(Iter b, Iter e) { >regions.back().push_back(Polygon(b, e)); >} > > private: >Regions regions; > }; > > where the reference_type of Iter is convertible to Point. In practice I use > pointers in a couple of places to avoid resizing in push_back becoming too > expensive. > > To construct the corresponding sp::Polygon, sp::Polygons and > sp::SpatialPolygons at the end of the algorithm, I iterate over the result > turning each Polygon into a two column matrix and calling the C functions > corresponding to the 'constructors' for these objects. > > This is all working fine, but I could cut my memory consumption in half if > I could construct the sp::Polygon objects in addSubPoly, and the > sp::Polygons objects in notifyNewRegion. My vector typedefs would then all > be: > > typedef std::vector > > > > > Question: > What I'm not sure about (and finally my question) is: I will have datasets > where I have more than 10,000 SEXPs in the Polygon and Polygons objects for > a single region, and possibly more than 10,000 regions, so how do I PROTECT > all those SEXPs (noting that the protection stack is limited to 10,000 and > bearing in mind that I don't know how many there will be before I start)? > > I am also interested in this just out of general curiosity. > > > > > Thoughts: > > 1) I could create an environment and store the objects themselves in there > while keeping pointers in the vectors, but am not sure if this would be > that efficient (guidance would be appreciated), or > > 2) Just keep them in R vectors and grow these myself (as push_back is doing > for me in the above), but that sounds like a pain and I'm not sure if the > objects or just the pointers would be copied when I reassigned things > (guidance would be appreciated again). Bare in mind that I keep pointers in > the vectors, but omitted that for the sake of clarity. > > > > > Is there some other R type that would be suited to this, or a general > approach? > Lists in R (LISTSXP aka pairlists) are suited to appending (since that is fast and trivial) and sequential processing. The only issue is that pairlists are slow for random access. If you only want to load the polygons and finalize, then you can hold them in a pairlist and at the end copy to a generic vector (if random access is expected). DB applications typically use a hybrid approach - allocate vector blocks and keep them in pairlists, but that's probably an overkill for your use (if you really cared about performance you wouldn't use sp objects for this ;)) Note that you only have to protect the top-level object, so you don't need to protect the individual elements. Cheers, Simon > Cheers and thanks in advance, > Simon Knapp > > [[alternative HTML version deleted]] > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] model.matrix metadata
Patrick O'Reilly gmail.com> writes: > > Hi, > > As far as I am aware, the model.matrix function does not return > perfect metadata on what each column of the model matrix "means". > > The columns are named (e.g. age:genderM), but encoding the metadata as > strings can result in ambiguity. For example, the dummy variables > created when the factors var0 = 0 and var = 00 both are named var00. > Additionally, if a level of a factor variable contains a colon, this > could be confused for an interaction. > > While a human can generally work out the meaning of each column > somewhat manually, I am interested in achieving this programmatically. > Why don't you just retain the terms.object? i.e my.terms <- terms( my.formula, data=my.data.frame ) my.model.matrix <- model.matrix( my.terms, data= my.data.frame ) attributes(my.terms) See ?terms, ?terms.object, ?model.frame (which contains a terms.object) HTH, Chuck __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Most efficient way to check the length of a variable mentioned in a formula.
Dear R gurus, I need to know the length of a variable (let's call that X) that is mentioned in a formula. So obviously I look for the environment from which the formula is called and then I have two options: - using eval(parse(text='length(X)'), envir=environment(formula) ) - using length(get('X'), envir=environment(formula) ) a bit of benchmarking showed that the first option is about 20 times slower, to that extent that if I repeat it 10,000 times I save more than half a second. So speed is not really an issue here. Personally I'd go for option 2 as that one is easier to read and does the job nicely, but with these functions I'm always a bit afraid that I'm overseeing important details or side effects here (possibly memory issues when working with larger data). Anybody an idea what the dangers are of these methods, and which one is the most robust method? Thank you Joris -- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Mathematical Modelling, Statistics and Bio-Informatics tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Most efficient way to check the length of a variable mentioned in a formula.
Joris, For me length(environment(form)[["x"]]) Was about twice as fast as length(get("x",environment(form In the year-old version of R (3.0.2) that I have on the virtual machine i'm currently using. As for you, the eval method was much slower (though my factor was much larger than 20) > system.time({thing <- replicate(1,length(environment(form)[["x"]]))}) user system elapsed 0.018 0.000 0.018 > system.time({thing <- replicate(1,length(get("x",environment(form}) user system elapsed 0.031 0.000 0.033 > system.time({thing <- replicate(1,eval(parse(text = "length(x)"), envir=environment(form)))}) user system elapsed 4.528 0.003 4.656 I can't speak this second to whether this pattern will hold in the more modern versions of R I typically use. ~G > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base On Fri, Oct 17, 2014 at 11:04 AM, Joris Meys wrote: > Dear R gurus, > > I need to know the length of a variable (let's call that X) that is > mentioned in a formula. So obviously I look for the environment from which > the formula is called and then I have two options: > > - using eval(parse(text='length(X)'), > envir=environment(formula) ) > > - using length(get('X'), > envir=environment(formula) ) > > a bit of benchmarking showed that the first option is about 20 times > slower, to that extent that if I repeat it 10,000 times I save more than > half a second. So speed is not really an issue here. > > Personally I'd go for option 2 as that one is easier to read and does the > job nicely, but with these functions I'm always a bit afraid that I'm > overseeing important details or side effects here (possibly memory issues > when working with larger data). > > Anybody an idea what the dangers are of these methods, and which one is the > most robust method? > > Thank you > Joris > > -- > Joris Meys > Statistical consultant > > Ghent University > Faculty of Bioscience Engineering > Department of Mathematical Modelling, Statistics and Bio-Informatics > > tel : +32 9 264 59 87 > joris.m...@ugent.be > --- > Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php > > [[alternative HTML version deleted]] > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > -- Gabriel Becker Graduate Student Statistics Department University of California, Davis [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Most efficient way to check the length of a variable mentioned in a formula.
I would use eval(), but I think that most formula-using functions do it more like the following. getRHSLength <- function (formula, data = parent.frame()) { rhsExpr <- formula[[length(formula)]] rhsValue <- eval(rhsExpr, envir = data, enclos = environment(formula)) length(rhsValue) } * use eval() instead of get() so you will find variables are in ancestral environments of envir (if envir is an environment), not just envir itself. * just evaluate the stuff in the formula using the non-standard evaluation frame, call length() in the current frame. Otherwise, if envir inherits directly from emptyenv() the 'length' function will not be found. * use envir=data so it looks first in the data argument for variables * the enclos argument is used if envir is not an environment and is used to find variables that are not in envir. Here are some examples: > X <- 1:10 > getRHSLength(~X) [1] 10 > getRHSLength(~X, data=data.frame(X=1:2)) [1] 2 > getRHSLength((function(){X <- 1:4; ~X})(), data=data.frame()) [1] 4 > getRHSLength((function(){X <- 1:4; ~X})(), data=data.frame(X=1:2)) [1] 2 > getRHSLength((function(){X <- 1:4; ~X})(), data=list2env(data.frame())) [1] 10 > getRHSLength((function(){X <- 1:4; ~X})(), data=emptyenv()) Error in eval(expr, envir, enclos) : object 'X' not found I think you will see the same lookups if you try analogous things with lm(). Bill Dunlap TIBCO Software wdunlap tibco.com On Fri, Oct 17, 2014 at 11:04 AM, Joris Meys wrote: > Dear R gurus, > > I need to know the length of a variable (let's call that X) that is > mentioned in a formula. So obviously I look for the environment from which > the formula is called and then I have two options: > > - using eval(parse(text='length(X)'), > envir=environment(formula) ) > > - using length(get('X'), > envir=environment(formula) ) > > a bit of benchmarking showed that the first option is about 20 times > slower, to that extent that if I repeat it 10,000 times I save more than > half a second. So speed is not really an issue here. > > Personally I'd go for option 2 as that one is easier to read and does the > job nicely, but with these functions I'm always a bit afraid that I'm > overseeing important details or side effects here (possibly memory issues > when working with larger data). > > Anybody an idea what the dangers are of these methods, and which one is the > most robust method? > > Thank you > Joris > > -- > Joris Meys > Statistical consultant > > Ghent University > Faculty of Bioscience Engineering > Department of Mathematical Modelling, Statistics and Bio-Informatics > > tel : +32 9 264 59 87 > joris.m...@ugent.be > --- > Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php > > [[alternative HTML version deleted]] > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Most efficient way to check the length of a variable mentioned in a formula.
I got the default value for getRHSLength's data argument wrong - it should be NULL, not parent.env(). getRHSLength <- function (formula, data = NULL) { rhsExpr <- formula[[length(formula)]] rhsValue <- eval(rhsExpr, envir = data, enclos = environment(formula)) length(rhsValue) } so that the function firstHalf is found in the following > X <- 1:10 > getRHSLength((function(){firstHalf<-function(x)x[seq_len(floor(length(x)/2))]; ~firstHalf(X)})()) [1] 5 Bill Dunlap TIBCO Software wdunlap tibco.com On Fri, Oct 17, 2014 at 11:57 AM, William Dunlap wrote: > I would use eval(), but I think that most formula-using functions do > it more like the following. > > getRHSLength <- > function (formula, data = parent.frame()) > { > rhsExpr <- formula[[length(formula)]] > rhsValue <- eval(rhsExpr, envir = data, enclos = environment(formula)) > length(rhsValue) > } > > * use eval() instead of get() so you will find variables are in > ancestral environments > of envir (if envir is an environment), not just envir itself. > * just evaluate the stuff in the formula using the non-standard > evaluation frame, > call length() in the current frame. Otherwise, if envir inherits > directly from emptyenv() the 'length' function will not be found. > * use envir=data so it looks first in the data argument for variables > * the enclos argument is used if envir is not an environment and is used to > find variables that are not in envir. > > Here are some examples: > > X <- 1:10 > > getRHSLength(~X) > [1] 10 > > getRHSLength(~X, data=data.frame(X=1:2)) > [1] 2 > > getRHSLength((function(){X <- 1:4; ~X})(), data=data.frame()) > [1] 4 > > getRHSLength((function(){X <- 1:4; ~X})(), data=data.frame(X=1:2)) > [1] 2 > > getRHSLength((function(){X <- 1:4; ~X})(), data=list2env(data.frame())) > [1] 10 > > getRHSLength((function(){X <- 1:4; ~X})(), data=emptyenv()) > Error in eval(expr, envir, enclos) : object 'X' not found > > I think you will see the same lookups if you try analogous things with lm(). > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > > On Fri, Oct 17, 2014 at 11:04 AM, Joris Meys wrote: >> Dear R gurus, >> >> I need to know the length of a variable (let's call that X) that is >> mentioned in a formula. So obviously I look for the environment from which >> the formula is called and then I have two options: >> >> - using eval(parse(text='length(X)'), >> envir=environment(formula) ) >> >> - using length(get('X'), >> envir=environment(formula) ) >> >> a bit of benchmarking showed that the first option is about 20 times >> slower, to that extent that if I repeat it 10,000 times I save more than >> half a second. So speed is not really an issue here. >> >> Personally I'd go for option 2 as that one is easier to read and does the >> job nicely, but with these functions I'm always a bit afraid that I'm >> overseeing important details or side effects here (possibly memory issues >> when working with larger data). >> >> Anybody an idea what the dangers are of these methods, and which one is the >> most robust method? >> >> Thank you >> Joris >> >> -- >> Joris Meys >> Statistical consultant >> >> Ghent University >> Faculty of Bioscience Engineering >> Department of Mathematical Modelling, Statistics and Bio-Informatics >> >> tel : +32 9 264 59 87 >> joris.m...@ugent.be >> --- >> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php >> >> [[alternative HTML version deleted]] >> >> __ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Most efficient way to check the length of a variable mentioned in a formula.
Thank you both, great ideas. William, I see the point of using eval, but the problem is that I can't evaluate the formula itself yet. I need to know the length of these variables to create a function that is used to evaluate. So if I try to evaluate the formula in some way before I created the function, it will just return an error. Now I use the attribute variables of the formula terms to get the variables that -after some more manipulation- eventually will be the model matrix. Something like this : afun <- function(formula, ...){ varnames <- all.vars(formula) fenv <- environment(formula) txt <- paste('length(',varnames[1],')') n <- eval(parse(text=txt), envir=fenv) fun <- function(x) x/n myterms <- terms(formula) eval(attr(myterms, 'variables')) } And that should give: > x <- 1:10 > y <- 10:1 > z <- 11:20 > afun(z ~ fun(x) + y) [[1]] [1] 11 12 13 14 15 16 17 18 19 20 [[2]] [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 [[3]] [1] 10 9 8 7 6 5 4 3 2 1 It might be I'm walking to Paris over Singapore, but I couldn't find a better way to do it. Cheers Joris On Fri, Oct 17, 2014 at 10:16 PM, William Dunlap wrote: > I got the default value for getRHSLength's data argument wrong - it > should be NULL, not parent.env(). >getRHSLength <- function (formula, data = NULL) >{ >rhsExpr <- formula[[length(formula)]] >rhsValue <- eval(rhsExpr, envir = data, enclos = > environment(formula)) >length(rhsValue) >} > so that the function firstHalf is found in the following >> X <- 1:10 >> > getRHSLength((function(){firstHalf<-function(x)x[seq_len(floor(length(x)/2))]; > ~firstHalf(X)})()) >[1] 5 > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > > On Fri, Oct 17, 2014 at 11:57 AM, William Dunlap > wrote: > > I would use eval(), but I think that most formula-using functions do > > it more like the following. > > > > getRHSLength <- > > function (formula, data = parent.frame()) > > { > > rhsExpr <- formula[[length(formula)]] > > rhsValue <- eval(rhsExpr, envir = data, enclos = > environment(formula)) > > length(rhsValue) > > } > > > > * use eval() instead of get() so you will find variables are in > > ancestral environments > > of envir (if envir is an environment), not just envir itself. > > * just evaluate the stuff in the formula using the non-standard > > evaluation frame, > > call length() in the current frame. Otherwise, if envir inherits > > directly from emptyenv() the 'length' function will not be found. > > * use envir=data so it looks first in the data argument for variables > > * the enclos argument is used if envir is not an environment and is used > to > > find variables that are not in envir. > > > > Here are some examples: > > > X <- 1:10 > > > getRHSLength(~X) > > [1] 10 > > > getRHSLength(~X, data=data.frame(X=1:2)) > > [1] 2 > > > getRHSLength((function(){X <- 1:4; ~X})(), data=data.frame()) > > [1] 4 > > > getRHSLength((function(){X <- 1:4; ~X})(), data=data.frame(X=1:2)) > > [1] 2 > > > getRHSLength((function(){X <- 1:4; ~X})(), > data=list2env(data.frame())) > > [1] 10 > > > getRHSLength((function(){X <- 1:4; ~X})(), data=emptyenv()) > > Error in eval(expr, envir, enclos) : object 'X' not found > > > > I think you will see the same lookups if you try analogous things with > lm(). > > Bill Dunlap > > TIBCO Software > > wdunlap tibco.com > > > > > > On Fri, Oct 17, 2014 at 11:04 AM, Joris Meys > wrote: > >> Dear R gurus, > >> > >> I need to know the length of a variable (let's call that X) that is > >> mentioned in a formula. So obviously I look for the environment from > which > >> the formula is called and then I have two options: > >> > >> - using eval(parse(text='length(X)'), > >> envir=environment(formula) ) > >> > >> - using length(get('X'), > >> envir=environment(formula) ) > >> > >> a bit of benchmarking showed that the first option is about 20 times > >> slower, to that extent that if I repeat it 10,000 times I save more than > >> half a second. So speed is not really an issue here. > >> > >> Personally I'd go for option 2 as that one is easier to read and does > the > >> job nicely, but with these functions I'm always a bit afraid that I'm > >> overseeing important details or side effects here (possibly memory > issues > >> when working with larger data). > >> > >> Anybody an idea what the dangers are of these methods, and which one is > the > >> most robust method? > >> > >> Thank you > >> Joris > >> > >> -- > >> Joris Meys > >> Statistical consultant > >> > >> Ghent University > >> Faculty of Bioscience Engineering > >> Department of Mathematical Modelling, Statistics and Bio-Informatics > >> > >> tel : +32 9 264 59 87 > >> joris.m...@ugent.be > >> --- > >> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php > >> > >> [[alternative HTML version deleted]] > >> > >> _
Re: [Rd] Most efficient way to check the length of a variable mentioned in a formula.
In my example function I did not evaluate the formula either, just a part of it. If you leave off the envir and enclos arguments to eval in your function you can get surprising (wrong) results. E.g., > afun(y ~ varnames) [[1]] [1] 10 9 8 7 6 5 4 3 2 1 [[2]] [1] "y""varnames" If you want to use the variables in data or environment(formula) and some functions defined in your function, then you could make a child environment of environment(formula), put your locally defined functions in it, and use the child environment in the call to eval. E.g., you code would become afun2 <- function(formula, ...){ varnames <- all.vars(formula) fenv <- environment(formula) n <- length(eval(as.name(varnames[1]), envir=fenv)) childEnv <- new.env(parent=fenv) childEnv$fun <- function(x) x/n myterms <- terms(formula) eval(attr(myterms, 'variables'), envir=childEnv) } Bill Dunlap TIBCO Software wdunlap tibco.com On Fri, Oct 17, 2014 at 1:50 PM, Joris Meys wrote: > Thank you both, great ideas. William, I see the point of using eval, but > the problem is that I can't evaluate the formula itself yet. I need to know > the length of these variables to create a function that is used to evaluate. > So if I try to evaluate the formula in some way before I created the > function, it will just return an error. > > Now I use the attribute variables of the formula terms to get the variables > that -after some more manipulation- eventually will be the model matrix. > Something like this : > > afun <- function(formula, ...){ > > varnames <- all.vars(formula) > fenv <- environment(formula) > > txt <- paste('length(',varnames[1],')') > n <- eval(parse(text=txt), envir=fenv) > > fun <- function(x) x/n > > myterms <- terms(formula) > eval(attr(myterms, 'variables')) > > } > > And that should give: > >> x <- 1:10 >> y <- 10:1 >> z <- 11:20 >> afun(z ~ fun(x) + y) > [[1]] > [1] 11 12 13 14 15 16 17 18 19 20 > > [[2]] > [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 > > [[3]] > [1] 10 9 8 7 6 5 4 3 2 1 > > It might be I'm walking to Paris over Singapore, but I couldn't find a > better way to do it. > > Cheers > Joris > > On Fri, Oct 17, 2014 at 10:16 PM, William Dunlap wrote: >> >> I got the default value for getRHSLength's data argument wrong - it >> should be NULL, not parent.env(). >>getRHSLength <- function (formula, data = NULL) >>{ >>rhsExpr <- formula[[length(formula)]] >>rhsValue <- eval(rhsExpr, envir = data, enclos = >> environment(formula)) >>length(rhsValue) >>} >> so that the function firstHalf is found in the following >>> X <- 1:10 >>> >> getRHSLength((function(){firstHalf<-function(x)x[seq_len(floor(length(x)/2))]; >> ~firstHalf(X)})()) >>[1] 5 >> >> >> Bill Dunlap >> TIBCO Software >> wdunlap tibco.com >> >> >> On Fri, Oct 17, 2014 at 11:57 AM, William Dunlap >> wrote: >> > I would use eval(), but I think that most formula-using functions do >> > it more like the following. >> > >> > getRHSLength <- >> > function (formula, data = parent.frame()) >> > { >> > rhsExpr <- formula[[length(formula)]] >> > rhsValue <- eval(rhsExpr, envir = data, enclos = >> > environment(formula)) >> > length(rhsValue) >> > } >> > >> > * use eval() instead of get() so you will find variables are in >> > ancestral environments >> > of envir (if envir is an environment), not just envir itself. >> > * just evaluate the stuff in the formula using the non-standard >> > evaluation frame, >> > call length() in the current frame. Otherwise, if envir inherits >> > directly from emptyenv() the 'length' function will not be found. >> > * use envir=data so it looks first in the data argument for variables >> > * the enclos argument is used if envir is not an environment and is used >> > to >> > find variables that are not in envir. >> > >> > Here are some examples: >> > > X <- 1:10 >> > > getRHSLength(~X) >> > [1] 10 >> > > getRHSLength(~X, data=data.frame(X=1:2)) >> > [1] 2 >> > > getRHSLength((function(){X <- 1:4; ~X})(), data=data.frame()) >> > [1] 4 >> > > getRHSLength((function(){X <- 1:4; ~X})(), data=data.frame(X=1:2)) >> > [1] 2 >> > > getRHSLength((function(){X <- 1:4; ~X})(), >> > data=list2env(data.frame())) >> > [1] 10 >> > > getRHSLength((function(){X <- 1:4; ~X})(), data=emptyenv()) >> > Error in eval(expr, envir, enclos) : object 'X' not found >> > >> > I think you will see the same lookups if you try analogous things with >> > lm(). >> > Bill Dunlap >> > TIBCO Software >> > wdunlap tibco.com >> > >> > >> > On Fri, Oct 17, 2014 at 11:04 AM, Joris Meys >> > wrote: >> >> Dear R gurus, >> >> >> >> I need to know the length of a variable (let's call that X) that is >> >> mentioned in a formula. So obviously I look for the environment from >> >> which >> >> the formula is called and then I have two options: >> >> >> >> - using eval(parse(text='length(X)'), >> >>