Hi Scott, I tried the new patch and can confirm that it has the advertised behavior on a couple of test cases. I think it makes sense to apply it, because any existing code which refers to a second duplicate data.frame column by name is already broken, while if the reference is by numerical index then changing the column name shouldn't break it.
I don't know if you need to update the documentation as part of your patch, or if whoever applies it would be happy to do that. Somebody from R core want to weigh in on this? I attach a file with the test example from your original email as well as a second test case I added with two "by" columns. Thanks, Frederick On Wed, Feb 21, 2018 at 10:06:21AM +1100, Scott Ritchie wrote: > Hi Frederick, > > It looks like I didn't overwrite the patch.diff file after the last edits. > Here's the correct patch (attached and copied below): > > Index: src/library/base/R/merge.R > =================================================================== > --- src/library/base/R/merge.R (revision 74280) > +++ src/library/base/R/merge.R (working copy) > @@ -157,6 +157,14 @@ > } > > if(has.common.nms) names(y) <- nm.y > + ## If by.x %in% names(y) then duplicate column names still arise, > + ## apply suffixes to just y - this keeps backwards compatibility > + ## when referring to by.x in the resulting data.frame > + dupe.keyx <- intersect(nm.by, names(y)) > + if(length(dupe.keyx)) { > + if(nzchar(suffixes[2L])) > + names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx, > suffixes[2L], sep="") > + } > nm <- c(names(x), names(y)) > if(any(d <- duplicated(nm))) > if(sum(d) > 1L) > > Best, > > Scott > > On 21 February 2018 at 08:23, <frede...@ofb.net> wrote: > > > Hi Scott, > > > > I think that's a good idea and I tried your patch on my copy of the > > repository. But it looks to me like the recent patch is identical to > > the previous one, can you confirm this? > > > > Frederick > > > > On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote: > > > Thanks Gabriel, > > > > > > I think your suggested approach is 100% backwards compatible > > > > > > Currently in the case of duplicate column names only the first can be > > > indexed by its name. This will always be the column appearing in by.x, > > > meaning the column in y with the same name cannot be accessed. Appending > > > ".y" (suffixes[2L]) to this column means it can now be accessed, while > > > keeping the current behaviour of making the key columns always accessible > > > by using the names provided to by.x. > > > > > > I've attached a new patch that has this behaviour. > > > > > > Best, > > > > > > Scott > > > > > > On 19 February 2018 at 05:08, Gabriel Becker <gmbec...@ucdavis.edu> > > wrote: > > > > > > > It seems like there is a way that is backwards compatible-ish in the > > sense > > > > mentioned and still has the (arguably, but a good argument I think) > > better > > > > behavior: > > > > > > > > if by.x is 'name', (AND by.y is not also 'name'), then x's 'name' > > column > > > > is called name and y's 'name' column (not used int he merge) is > > changed to > > > > name.y. > > > > > > > > Now of course this would still change output, but it would change it to > > > > something I think would be better, while retaining the 'merge columns > > > > retain their exact names' mechanic as documented. > > > > > > > > ~G > > > > > > > > On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <s.ritchi...@gmail.com> > > > > wrote: > > > > > > > >> Thanks Duncan and Frederick, > > > >> > > > >> I suspected as much - there doesn't appear to be any reason why > > conflicts > > > >> between by.x and names(y) shouldn't and cannot be checked, but I can > > see > > > >> how this might be more trouble than its worth given it potentially may > > > >> break downstream packages (i.e. any cases where this occurs but they > > > >> expect > > > >> the name of the key column(s) to remain the same). > > > >> > > > >> Best, > > > >> > > > >> Scott > > > >> > > > >> On 18 February 2018 at 11:48, Duncan Murdoch < > > murdoch.dun...@gmail.com> > > > >> wrote: > > > >> > > > >> > On 17/02/2018 6:36 PM, frede...@ofb.net wrote: > > > >> > > > > >> >> Hi Scott, > > > >> >> > > > >> >> Thanks for the patch. I'm not really involved in R development; it > > > >> >> will be up to someone in the R core team to apply it. I would > > hazard > > > >> >> to say that even if correct (I haven't checked), it will not be > > > >> >> applied because the change might break existing code. For example > > it > > > >> >> seems like reasonable code might easily assume that a column with > > the > > > >> >> same name as "by.x" exists in the output of 'merge'. That's just my > > > >> >> best guess... I don't participate on here often. > > > >> >> > > > >> > > > > >> > > > > >> > I think you're right. If I were still a member of R Core, I would > > want > > > >> to > > > >> > test this against all packages on CRAN and Bioconductor, and since > > that > > > >> > test takes a couple of days to run on my laptop, I'd probably never > > get > > > >> > around to it. > > > >> > > > > >> > There are lots of cases where "I would have done that differently", > > but > > > >> > most of them are far too much trouble to change now that R is more > > than > > > >> 20 > > > >> > years old. And in many cases it will turn out that the way R does > > it > > > >> > actually does make more sense than the way I would have done it. > > > >> > > > > >> > Duncan Murdoch > > > >> > > > > >> > > > > >> > > > > >> >> Cheers, > > > >> >> > > > >> >> Frederick > > > >> >> > > > >> >> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote: > > > >> >> > > > >> >>> The attached patch.diff will make merge.data.frame() append the > > > >> suffixes > > > >> >>> to > > > >> >>> columns with common names between by.x and names(y). > > > >> >>> > > > >> >>> Best, > > > >> >>> > > > >> >>> Scott Ritchie > > > >> >>> > > > >> >>> On 17 February 2018 at 11:15, Scott Ritchie < > > s.ritchi...@gmail.com> > > > >> >>> wrote: > > > >> >>> > > > >> >>> Hi Frederick, > > > >> >>>> > > > >> >>>> I would expect that any duplicate names in the resulting > > data.frame > > > >> >>>> would > > > >> >>>> have the suffixes appended to them, regardless of whether or not > > they > > > >> >>>> are > > > >> >>>> used as the join key. So in my example I would expect "names.x" > > and > > > >> >>>> "names.y" to indicate their source data.frame. > > > >> >>>> > > > >> >>>> While careful reading of the documentation reveals this is not > > the > > > >> >>>> case, I > > > >> >>>> would argue the intent of the suffixes functionality should > > equally > > > >> be > > > >> >>>> applied to this type of case. > > > >> >>>> > > > >> >>>> If you agree this would be useful, I'm happy to write a patch for > > > >> >>>> merge.data.frame that will add suffixes in this case - I intend > > to do > > > >> >>>> the > > > >> >>>> same for merge.data.table in the data.table package where I > > initially > > > >> >>>> encountered the edge case. > > > >> >>>> > > > >> >>>> Best, > > > >> >>>> > > > >> >>>> Scott > > > >> >>>> > > > >> >>>> On 17 February 2018 at 03:53, <frede...@ofb.net> wrote: > > > >> >>>> > > > >> >>>> Hi Scott, > > > >> >>>>> > > > >> >>>>> It seems like reasonable behavior to me. What result would you > > > >> expect? > > > >> >>>>> That the second "name" should be called "name.y"? > > > >> >>>>> > > > >> >>>>> The "merge" documentation says: > > > >> >>>>> > > > >> >>>>> If the columns in the data frames not used in merging have > > any > > > >> >>>>> common names, these have ‘suffixes’ (‘".x"’ and ‘".y"’ by > > > >> default) > > > >> >>>>> appended to try to make the names of the result unique. > > > >> >>>>> > > > >> >>>>> Since the first "name" column was used in merging, leaving both > > > >> >>>>> without a suffix seems consistent with the documentation... > > > >> >>>>> > > > >> >>>>> Frederick > > > >> >>>>> > > > >> >>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote: > > > >> >>>>> > > > >> >>>>>> Hi, > > > >> >>>>>> > > > >> >>>>>> I was unable to find a bug report for this with a cursory > > search, > > > >> but > > > >> >>>>>> > > > >> >>>>> would > > > >> >>>>> > > > >> >>>>>> like clarification if this is intended or unavoidable > > behaviour: > > > >> >>>>>> > > > >> >>>>>> ```{r} > > > >> >>>>>> # Create example data.frames > > > >> >>>>>> parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"), > > > >> >>>>>> sex=c("F", "M", "F", "M"), > > > >> >>>>>> age=c(41, 43, 36, 51)) > > > >> >>>>>> children <- data.frame(parent=c("Sarah", "Max", "Qin"), > > > >> >>>>>> name=c("Oliver", "Sebastian", > > "Kai-lee"), > > > >> >>>>>> sex=c("M", "M", "F"), > > > >> >>>>>> age=c(5,8,7)) > > > >> >>>>>> > > > >> >>>>>> # Merge() creates a duplicated "name" column: > > > >> >>>>>> merge(parents, children, by.x = "name", by.y = "parent") > > > >> >>>>>> ``` > > > >> >>>>>> > > > >> >>>>>> Output: > > > >> >>>>>> ``` > > > >> >>>>>> name sex.x age.x name sex.y age.y > > > >> >>>>>> 1 Max M 43 Sebastian M 8 > > > >> >>>>>> 2 Qin F 36 Kai-lee F 7 > > > >> >>>>>> 3 Sarah F 41 Oliver M 5 > > > >> >>>>>> Warning message: > > > >> >>>>>> In merge.data.frame(parents, children, by.x = "name", by.y = > > > >> >>>>>> "parent") : > > > >> >>>>>> column name ‘name’ is duplicated in the result > > > >> >>>>>> ``` > > > >> >>>>>> > > > >> >>>>>> Kind Regards, > > > >> >>>>>> > > > >> >>>>>> Scott Ritchie > > > >> >>>>>> > > > >> >>>>>> [[alternative HTML version deleted]] > > > >> >>>>>> > > > >> >>>>>> ______________________________________________ > > > >> >>>>>> R-devel@r-project.org mailing list > > > >> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel > > > >> >>>>>> > > > >> >>>>>> > > > >> >>>>> > > > >> >>>> > > > >> >>>> > > > >> >> Index: src/library/base/R/merge.R > > > >> >>> ============================================================ > > ======= > > > >> >>> --- src/library/base/R/merge.R (revision 74264) > > > >> >>> +++ src/library/base/R/merge.R (working copy) > > > >> >>> @@ -157,6 +157,15 @@ > > > >> >>> } > > > >> >>> if(has.common.nms) names(y) <- nm.y > > > >> >>> + ## If by.x %in% names(y) then duplicate column names > > still > > > >> >>> arise, > > > >> >>> + ## apply suffixes to these > > > >> >>> + dupe.keyx <- intersect(nm.by, names(y)) > > > >> >>> + if(length(dupe.keyx)) { > > > >> >>> + if(nzchar(suffixes[1L])) > > > >> >>> + names(x)[match(dupe.keyx, names(x), 0L)] <- > > > >> >>> paste(dupe.keyx, suffixes[1L], sep="") > > > >> >>> + if(nzchar(suffixes[2L])) > > > >> >>> + names(y)[match(dupe.keyx, names(y), 0L)] <- > > > >> >>> paste(dupe.keyx, suffixes[2L], sep="") > > > >> >>> + } > > > >> >>> nm <- c(names(x), names(y)) > > > >> >>> if(any(d <- duplicated(nm))) > > > >> >>> if(sum(d) > 1L) > > > >> >>> > > > >> >> > > > >> >> ______________________________________________ > > > >> >> R-devel@r-project.org mailing list > > > >> >> https://stat.ethz.ch/mailman/listinfo/r-devel > > > >> >> > > > >> >> > > > >> > > > > >> > > > >> [[alternative HTML version deleted]] > > > >> > > > >> ______________________________________________ > > > >> R-devel@r-project.org mailing list > > > >> https://stat.ethz.ch/mailman/listinfo/r-devel > > > >> > > > > > > > > > > > > > > > > -- > > > > Gabriel Becker, PhD > > > > Scientist (Bioinformatics) > > > > Genentech Research > > > > > > > > > Index: src/library/base/R/merge.R > > > =================================================================== > > > --- src/library/base/R/merge.R (revision 74264) > > > +++ src/library/base/R/merge.R (working copy) > > > @@ -157,6 +157,15 @@ > > > } > > > > > > if(has.common.nms) names(y) <- nm.y > > > + ## If by.x %in% names(y) then duplicate column names still > > arise, > > > + ## apply suffixes to these > > > + dupe.keyx <- intersect(nm.by, names(y)) > > > + if(length(dupe.keyx)) { > > > + if(nzchar(suffixes[1L])) > > > + names(x)[match(dupe.keyx, names(x), 0L)] <- > > paste(dupe.keyx, suffixes[1L], sep="") > > > + if(nzchar(suffixes[2L])) > > > + names(y)[match(dupe.keyx, names(y), 0L)] <- > > paste(dupe.keyx, suffixes[2L], sep="") > > > + } > > > nm <- c(names(x), names(y)) > > > if(any(d <- duplicated(nm))) > > > if(sum(d) > 1L) > > > > > ______________________________________________ > > > R-devel@r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > Index: src/library/base/R/merge.R > =================================================================== > --- src/library/base/R/merge.R (revision 74280) > +++ src/library/base/R/merge.R (working copy) > @@ -157,6 +157,14 @@ > } > > if(has.common.nms) names(y) <- nm.y > + ## If by.x %in% names(y) then duplicate column names still arise, > + ## apply suffixes to just y - this keeps backwards compatibility > + ## when referring to by.x in the resulting data.frame > + dupe.keyx <- intersect(nm.by, names(y)) > + if(length(dupe.keyx)) { > + if(nzchar(suffixes[2L])) > + names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx, > suffixes[2L], sep="") > + } > nm <- c(names(x), names(y)) > if(any(d <- duplicated(nm))) > if(sum(d) > 1L)
# Create example data.frames parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"), sex=c("F", "M", "F", "M"), age=c(41, 43, 36, 51)) children <- data.frame(parent=c("Sarah", "Max", "Qin"), name=c("Oliver", "Sebastian", "Kai-lee"), sex=c("M", "M", "F"), age=c(5,8,7)) # Merge() creates a duplicated "name" column: # (not anymore) merge(parents, children, by.x = "name", by.y = "parent") ## name sex.x age.x name.y sex.y age.y # FHE 20 Feb 2018 # try it with two "by" columns merge(parents, children, by.x = c("name","sex"), by.y = c("parent","sex")) ## name sex age.x name.y age.y
______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel