Re: [Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

frederik Tue, 20 Feb 2018 19:16:30 -0800

Hi Scott,

I tried the new patch and can confirm that it has the advertised
behavior on a couple of test cases. I think it makes sense to apply
it, because any existing code which refers to a second duplicate
data.frame column by name is already broken, while if the reference is
by numerical index then changing the column name shouldn't break it.


I don't know if you need to update the documentation as part of your
patch, or if whoever applies it would be happy to do that. Somebody
from R core want to weigh in on this?

I attach a file with the test example from your original email as well
as a second test case I added with two "by" columns.

Thanks,

Frederick

On Wed, Feb 21, 2018 at 10:06:21AM +1100, Scott Ritchie wrote:
> Hi Frederick,
> 
> It looks like I didn't overwrite the patch.diff file after the last edits.
> Here's the correct patch (attached and copied below):
> 
> Index: src/library/base/R/merge.R
> ===================================================================
> --- src/library/base/R/merge.R (revision 74280)
> +++ src/library/base/R/merge.R (working copy)
> @@ -157,6 +157,14 @@
>          }
> 
>          if(has.common.nms) names(y) <- nm.y
> +        ## If by.x %in% names(y) then duplicate column names still arise,
> +        ## apply suffixes to just y - this keeps backwards compatibility
> +        ## when referring to by.x in the resulting data.frame
> +        dupe.keyx <- intersect(nm.by, names(y))
> +        if(length(dupe.keyx)) {
> +          if(nzchar(suffixes[2L]))
> +            names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx,
> suffixes[2L], sep="")
> +        }
>          nm <- c(names(x), names(y))
>          if(any(d <- duplicated(nm)))
>              if(sum(d) > 1L)
> 
> Best,
> 
> Scott
> 
> On 21 February 2018 at 08:23, <[email protected]> wrote:
> 
> > Hi Scott,
> >
> > I think that's a good idea and I tried your patch on my copy of the
> > repository. But it looks to me like the recent patch is identical to
> > the previous one, can you confirm this?
> >
> > Frederick
> >
> > On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote:
> > > Thanks Gabriel,
> > >
> > > I think your suggested approach is 100% backwards compatible
> > >
> > > Currently in the case of duplicate column names only the first can be
> > > indexed by its name. This will always be the column appearing in by.x,
> > > meaning the column in y with the same name cannot be accessed. Appending
> > > ".y" (suffixes[2L]) to this column means it can now be accessed, while
> > > keeping the current behaviour of making the key columns always accessible
> > > by using the names provided to by.x.
> > >
> > > I've attached a new patch that has this behaviour.
> > >
> > > Best,
> > >
> > > Scott
> > >
> > > On 19 February 2018 at 05:08, Gabriel Becker <[email protected]>
> > wrote:
> > >
> > > > It seems like there is a way that is backwards compatible-ish in the
> > sense
> > > > mentioned and still has the (arguably, but a good argument I think)
> > better
> > > > behavior:
> > > >
> > > > if by.x is 'name', (AND by.y is not also 'name'), then x's 'name'
> > column
> > > > is called name and y's 'name' column (not used int he merge) is
> > changed to
> > > > name.y.
> > > >
> > > > Now of course this would still change output, but it would change it to
> > > > something I think would be better, while retaining the 'merge columns
> > > > retain their exact names' mechanic as documented.
> > > >
> > > > ~G
> > > >
> > > > On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <[email protected]>
> > > > wrote:
> > > >
> > > >> Thanks Duncan and Frederick,
> > > >>
> > > >> I suspected as much - there doesn't appear to be any reason why
> > conflicts
> > > >> between by.x and names(y) shouldn't and cannot be checked, but I can
> > see
> > > >> how this might be more trouble than its worth given it potentially may
> > > >> break downstream packages (i.e. any cases where this occurs but they
> > > >> expect
> > > >> the name of the key column(s) to remain the same).
> > > >>
> > > >> Best,
> > > >>
> > > >> Scott
> > > >>
> > > >> On 18 February 2018 at 11:48, Duncan Murdoch <
> > [email protected]>
> > > >> wrote:
> > > >>
> > > >> > On 17/02/2018 6:36 PM, [email protected] wrote:
> > > >> >
> > > >> >> Hi Scott,
> > > >> >>
> > > >> >> Thanks for the patch. I'm not really involved in R development; it
> > > >> >> will be up to someone in the R core team to apply it. I would
> > hazard
> > > >> >> to say that even if correct (I haven't checked), it will not be
> > > >> >> applied because the change might break existing code. For example
> > it
> > > >> >> seems like reasonable code might easily assume that a column with
> > the
> > > >> >> same name as "by.x" exists in the output of 'merge'. That's just my
> > > >> >> best guess... I don't participate on here often.
> > > >> >>
> > > >> >
> > > >> >
> > > >> > I think you're right.  If I were still a member of R Core, I would
> > want
> > > >> to
> > > >> > test this against all packages on CRAN and Bioconductor, and since
> > that
> > > >> > test takes a couple of days to run on my laptop, I'd probably never
> > get
> > > >> > around to it.
> > > >> >
> > > >> > There are lots of cases where "I would have done that differently",
> > but
> > > >> > most of them are far too much trouble to change now that R is more
> > than
> > > >> 20
> > > >> > years old.  And in many cases it will turn out that the way R does
> > it
> > > >> > actually does make more sense than the way I would have done it.
> > > >> >
> > > >> > Duncan Murdoch
> > > >> >
> > > >> >
> > > >> >
> > > >> >> Cheers,
> > > >> >>
> > > >> >> Frederick
> > > >> >>
> > > >> >> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
> > > >> >>
> > > >> >>> The attached patch.diff will make merge.data.frame() append the
> > > >> suffixes
> > > >> >>> to
> > > >> >>> columns with common names between by.x and names(y).
> > > >> >>>
> > > >> >>> Best,
> > > >> >>>
> > > >> >>> Scott Ritchie
> > > >> >>>
> > > >> >>> On 17 February 2018 at 11:15, Scott Ritchie <
> > [email protected]>
> > > >> >>> wrote:
> > > >> >>>
> > > >> >>> Hi Frederick,
> > > >> >>>>
> > > >> >>>> I would expect that any duplicate names in the resulting
> > data.frame
> > > >> >>>> would
> > > >> >>>> have the suffixes appended to them, regardless of whether or not
> > they
> > > >> >>>> are
> > > >> >>>> used as the join key. So in my example I would expect "names.x"
> > and
> > > >> >>>> "names.y" to indicate their source data.frame.
> > > >> >>>>
> > > >> >>>> While careful reading of the documentation reveals this is not
> > the
> > > >> >>>> case, I
> > > >> >>>> would argue the intent of the suffixes functionality should
> > equally
> > > >> be
> > > >> >>>> applied to this type of case.
> > > >> >>>>
> > > >> >>>> If you agree this would be useful, I'm happy to write a patch for
> > > >> >>>> merge.data.frame that will add suffixes in this case - I intend
> > to do
> > > >> >>>> the
> > > >> >>>> same for merge.data.table in the data.table package where I
> > initially
> > > >> >>>> encountered the edge case.
> > > >> >>>>
> > > >> >>>> Best,
> > > >> >>>>
> > > >> >>>> Scott
> > > >> >>>>
> > > >> >>>> On 17 February 2018 at 03:53, <[email protected]> wrote:
> > > >> >>>>
> > > >> >>>> Hi Scott,
> > > >> >>>>>
> > > >> >>>>> It seems like reasonable behavior to me. What result would you
> > > >> expect?
> > > >> >>>>> That the second "name" should be called "name.y"?
> > > >> >>>>>
> > > >> >>>>> The "merge" documentation says:
> > > >> >>>>>
> > > >> >>>>>      If the columns in the data frames not used in merging have
> > any
> > > >> >>>>>      common names, these have ‘suffixes’ (‘".x"’ and ‘".y"’ by
> > > >> default)
> > > >> >>>>>      appended to try to make the names of the result unique.
> > > >> >>>>>
> > > >> >>>>> Since the first "name" column was used in merging, leaving both
> > > >> >>>>> without a suffix seems consistent with the documentation...
> > > >> >>>>>
> > > >> >>>>> Frederick
> > > >> >>>>>
> > > >> >>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
> > > >> >>>>>
> > > >> >>>>>> Hi,
> > > >> >>>>>>
> > > >> >>>>>> I was unable to find a bug report for this with a cursory
> > search,
> > > >> but
> > > >> >>>>>>
> > > >> >>>>> would
> > > >> >>>>>
> > > >> >>>>>> like clarification if this is intended or unavoidable
> > behaviour:
> > > >> >>>>>>
> > > >> >>>>>> ```{r}
> > > >> >>>>>> # Create example data.frames
> > > >> >>>>>> parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
> > > >> >>>>>>                        sex=c("F", "M", "F", "M"),
> > > >> >>>>>>                        age=c(41, 43, 36, 51))
> > > >> >>>>>> children <- data.frame(parent=c("Sarah", "Max", "Qin"),
> > > >> >>>>>>                         name=c("Oliver", "Sebastian",
> > "Kai-lee"),
> > > >> >>>>>>                         sex=c("M", "M", "F"),
> > > >> >>>>>>                         age=c(5,8,7))
> > > >> >>>>>>
> > > >> >>>>>> # Merge() creates a duplicated "name" column:
> > > >> >>>>>> merge(parents, children, by.x = "name", by.y = "parent")
> > > >> >>>>>> ```
> > > >> >>>>>>
> > > >> >>>>>> Output:
> > > >> >>>>>> ```
> > > >> >>>>>>     name sex.x age.x      name sex.y age.y
> > > >> >>>>>> 1   Max     M    43 Sebastian     M     8
> > > >> >>>>>> 2   Qin     F    36   Kai-lee     F     7
> > > >> >>>>>> 3 Sarah     F    41    Oliver     M     5
> > > >> >>>>>> Warning message:
> > > >> >>>>>> In merge.data.frame(parents, children, by.x = "name", by.y =
> > > >> >>>>>> "parent") :
> > > >> >>>>>>    column name ‘name’ is duplicated in the result
> > > >> >>>>>> ```
> > > >> >>>>>>
> > > >> >>>>>> Kind Regards,
> > > >> >>>>>>
> > > >> >>>>>> Scott Ritchie
> > > >> >>>>>>
> > > >> >>>>>>        [[alternative HTML version deleted]]
> > > >> >>>>>>
> > > >> >>>>>> ______________________________________________
> > > >> >>>>>> [email protected] mailing list
> > > >> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> > > >> >>>>>>
> > > >> >>>>>>
> > > >> >>>>>
> > > >> >>>>
> > > >> >>>>
> > > >> >> Index: src/library/base/R/merge.R
> > > >> >>> ============================================================
> > =======
> > > >> >>> --- src/library/base/R/merge.R  (revision 74264)
> > > >> >>> +++ src/library/base/R/merge.R  (working copy)
> > > >> >>> @@ -157,6 +157,15 @@
> > > >> >>>           }
> > > >> >>>             if(has.common.nms) names(y) <- nm.y
> > > >> >>> +        ## If by.x %in% names(y) then duplicate column names
> > still
> > > >> >>> arise,
> > > >> >>> +        ## apply suffixes to these
> > > >> >>> +        dupe.keyx <- intersect(nm.by, names(y))
> > > >> >>> +        if(length(dupe.keyx)) {
> > > >> >>> +          if(nzchar(suffixes[1L]))
> > > >> >>> +            names(x)[match(dupe.keyx, names(x), 0L)] <-
> > > >> >>> paste(dupe.keyx, suffixes[1L], sep="")
> > > >> >>> +          if(nzchar(suffixes[2L]))
> > > >> >>> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
> > > >> >>> paste(dupe.keyx, suffixes[2L], sep="")
> > > >> >>> +        }
> > > >> >>>           nm <- c(names(x), names(y))
> > > >> >>>           if(any(d <- duplicated(nm)))
> > > >> >>>               if(sum(d) > 1L)
> > > >> >>>
> > > >> >>
> > > >> >> ______________________________________________
> > > >> >> [email protected] mailing list
> > > >> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> > > >> >>
> > > >> >>
> > > >> >
> > > >>
> > > >>         [[alternative HTML version deleted]]
> > > >>
> > > >> ______________________________________________
> > > >> [email protected] mailing list
> > > >> https://stat.ethz.ch/mailman/listinfo/r-devel
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Gabriel Becker, PhD
> > > > Scientist (Bioinformatics)
> > > > Genentech Research
> > > >
> >
> > > Index: src/library/base/R/merge.R
> > > ===================================================================
> > > --- src/library/base/R/merge.R        (revision 74264)
> > > +++ src/library/base/R/merge.R        (working copy)
> > > @@ -157,6 +157,15 @@
> > >          }
> > >
> > >          if(has.common.nms) names(y) <- nm.y
> > > +        ## If by.x %in% names(y) then duplicate column names still
> > arise,
> > > +        ## apply suffixes to these
> > > +        dupe.keyx <- intersect(nm.by, names(y))
> > > +        if(length(dupe.keyx)) {
> > > +          if(nzchar(suffixes[1L]))
> > > +            names(x)[match(dupe.keyx, names(x), 0L)] <-
> > paste(dupe.keyx, suffixes[1L], sep="")
> > > +          if(nzchar(suffixes[2L]))
> > > +            names(y)[match(dupe.keyx, names(y), 0L)] <-
> > paste(dupe.keyx, suffixes[2L], sep="")
> > > +        }
> > >          nm <- c(names(x), names(y))
> > >          if(any(d <- duplicated(nm)))
> > >              if(sum(d) > 1L)
> >
> > > ______________________________________________
> > > [email protected] mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >

> Index: src/library/base/R/merge.R
> ===================================================================
> --- src/library/base/R/merge.R        (revision 74280)
> +++ src/library/base/R/merge.R        (working copy)
> @@ -157,6 +157,14 @@
>          }
>  
>          if(has.common.nms) names(y) <- nm.y
> +        ## If by.x %in% names(y) then duplicate column names still arise,
> +        ## apply suffixes to just y - this keeps backwards compatibility
> +        ## when referring to by.x in the resulting data.frame
> +        dupe.keyx <- intersect(nm.by, names(y))
> +        if(length(dupe.keyx)) {
> +          if(nzchar(suffixes[2L]))
> +            names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx, 
> suffixes[2L], sep="")
> +        }
>          nm <- c(names(x), names(y))
>          if(any(d <- duplicated(nm)))
>              if(sum(d) > 1L)

# Create example data.frames
parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
                      sex=c("F", "M", "F", "M"),
                      age=c(41, 43, 36, 51))
children <- data.frame(parent=c("Sarah", "Max", "Qin"),
                       name=c("Oliver", "Sebastian", "Kai-lee"),
                       sex=c("M", "M", "F"),
                       age=c(5,8,7))

# Merge() creates a duplicated "name" column:
# (not anymore)
merge(parents, children, by.x = "name", by.y = "parent")
##   name sex.x age.x    name.y sex.y age.y

# FHE 20 Feb 2018
# try it with two "by" columns
merge(parents, children, by.x = c("name","sex"), by.y = c("parent","sex"))
##   name sex age.x    name.y age.y

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

Reply via email to