Re: [R-pkg-devel] Absent variables and tibble

Duncan Murdoch Tue, 28 Jun 2016 07:58:49 -0700

On 28/06/2016 10:03 AM, William Dunlap wrote:

Currently exists("someName", where=someDataFrame) reports if"someName" is an columnof the data.frame 'someDataFrame' and the 'where=' may be omitted. Ifwe have anenvironment we use exsts("someName", envir=someEnvironment). It mightbe nice tocontinue using exists() instead of introducing a new function has(),although, since wewant the same syntax to work for environments, data.frames, tbl_dfs,data.tables, etc.,
we may need the new function.

One issue with exists("someName", someDataFrame) is that it's quite abit slower. (I think it converts the dataframe to an environment.) Onthe other hand, getting the names from an environment requires more workthan checking for one, so exists("someName", someEnvironment) is fasterthan checking for the name in the obvious way. The slow operations

could be sped up, but is that worth the effort?

The other issue with exists() is that it has a complicated definitionand hard to follow argument list (with args "where", "envir", "frame"that all do related things); the thing I like about hasName() is that itis very clear what it does. A criticism of it is that it is hardly anyshorter than just doing


  name %in% names(x)

so is there really any point in making a function for this?

Duncan Murdoch



Bill Dunlap
TIBCO Software
wdunlap tibco.com <http://tibco.com>

On Tue, Jun 28, 2016 at 4:08 AM, Duncan Murdoch<murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com>> wrote:


    On 27/06/2016 10:15 PM, Lenth, Russell V wrote:

        Hadley's note on partial matching has me scared the most
        concerning the as.null() coding. So the need for a hasName()
        (or whatever) function seems all the more compelling, and that
        it be in base R. Perhaps it should be generic, with a default
        method that searches in the names attribute, potentially
        extensible to other classes.


    I am thinking of putting it in, but if I do the definition will be
    equivalent to the one-liner down below.  That's already slower
    than the is.null() test; making it generic would slow it down too
    much.

    Duncan Murdoch


        Thanks so much, several of you, for your positive and helpful
        responses.

        Russ

        -----Original Message-----
        From: Duncan Murdoch [mailto:murdoch.dun...@gmail.com
        <mailto:murdoch.dun...@gmail.com>]
        Sent: Monday, June 27, 2016 12:50 PM
        To: Hadley Wickham <h.wick...@gmail.com
        <mailto:h.wick...@gmail.com>>; Lenth, Russell V
        <russell-le...@uiowa.edu <mailto:russell-le...@uiowa.edu>>
        Cc: r-package-devel@r-project.org
        <mailto:r-package-devel@r-project.org>
        Subject: Re: [R-pkg-devel] Absent variables and tibble

        On 27/06/2016 1:09 PM, Hadley Wickham wrote:

            The other thing you need to be aware of it you're using
            the other
            approach is partial matching:

            df <- data.frame(xyz = 1)
            is.null(df$x)
            #> [1] FALSE

            Duncan - I think that argues for including a has_name()
            (hasName() ?)
            function in base R. Is that something you'd consider?


        Yes, I'd consider it.  I think hasName() would be more
        consistent with other has*() functions in the R sources.

        I guess the implementation should be defined to be equivalent to

        hasName <- function(x, name)
           name %in% names(x)

        though it would make sense to make a faster internal
        implementation;
        !is.null(df$x) is quite a bit faster than "x" %in% names(df).

        Duncan Murdoch



            Hadley

            On Mon, Jun 27, 2016 at 10:05 AM, Lenth, Russell V
            <russell-le...@uiowa.edu <mailto:russell-le...@uiowa.edu>>
            wrote:

                Thanks, Hadley. I do understand why you'd want more
                careful checking.

                If you're going to provide a variable-existing
                function, may I suggest a short name like 'has'? I.e.,
                has(x, var) returns TRUE if x has var in it.

                Thanks

                Russ

                    On Jun 27, 2016, at 9:47 AM, Hadley Wickham
                    <h.wick...@gmail.com <mailto:h.wick...@gmail.com>>
                    wrote:

                    On Mon, Jun 27, 2016 at 9:03 AM, Duncan Murdoch
                    <murdoch.dun...@gmail.com
                    <mailto:murdoch.dun...@gmail.com>> wrote:

                        On 27/06/2016 9:22 AM, Lenth, Russell V wrote:


                            My package 'lsmeans' is now suddenly
                            broken because of a new
                            provision in the 'tibble' package (loaded
                            by 'dplyr' 0.5.0), whereby the "[[" and "$"
                            methods for 'tbl_df' objects - as
                            documented - throw an error if
                            a variable is not found.

                            The problem is that my code uses tests
                            like this:

                                   if (is.null (x$var)) {...}

                            to see whether 'x' has a variable 'var'.
                            Obviously, I can work
                            around this using

                                   if (!("var" %in% names(x))) {...}

                            but (a) I like the first version better,
                            in terms of the code
                            being understandable; and (b) isn't there
                            a long history whereby
                            we can expect a NULL result when accessing
                            an absent member of a
                            list (and hence a data.frame)? (c) the
                            code base for 'lsmeans'
                            has about 50 instances of such tests.

                            Anyway, I wonder if a lot of other package
                            developers test for
                            absent variables in that first way; if so,
                            they too are in for a
                            rude awakening if their users provide a
                            tbl_df instead of a
                            data.frame. And what is considered the
                            best practice for testing
                            absence of a list member? Apparently, not
                            either of the above;
                            and because of (c), I want to do these
                            many tedious corrections only once.

                            Thanks for any light you can shed.



                        This is why CRAN asks that people test reverse
                        dependencies.


                    Which we did do - the problem is that this is
                    actually caused by a
                    recursive reverse dependency (lsmeans -> dplyr ->
                    tibble), and we
                    didn't correctly anticipate how much pain this
                    would cause.

                        I think the most defensive thing you can do is
                        to write a small
                        function

                        name_missing <- function(x, name)
                           !(name %in% names(x))

                        and use name_missing(x, "var") in your tests.
                        (Pick your own name
                        to make your code understandable if you don't
                        like my choice.)

                        You could suggest to the tibble maintainers
                        that they add a
                        function like this.


                    We're definitely going to add this.

                    And I think we'll make df[["var"]] return NULL
                    too, so at least
                    there's one easy way to opt out.

                    The motivation for this change was that returning
                    NULL + recycling
                    rules means it's very easy for errors to silently
                    propagate. But I
                    think this approach might be somewhat too
                    aggressive - I hadn't
                    considered that people use `is.null()` to check
                    for missing columns.

                    We'll try and get an update to tibble out soon
                    after useR.
                    Thoughts on what we should do are greatly appreciated.

                    Hadley

                    --
                    http://hadley.nz






    ______________________________________________
    R-package-devel@r-project.org
    <mailto:R-package-devel@r-project.org> mailing list
    https://stat.ethz.ch/mailman/listinfo/r-package-devel


______________________________________________
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel

Re: [R-pkg-devel] Absent variables and tibble

Reply via email to