>>>>> Gabriel Becker <gmbec...@ucdavis.edu> >>>>> on Wed, 21 Feb 2018 07:11:44 -0800 writes:
> Hi all, > For the record this approach isnt 100% backwards compatible, because > names(mergeddf) will e incompatibly different. Thatx why i claimed > bakcwards compatable-ish exactly. > That said its still worth considering imho because of the reasons stated > (and honestly one particular simple reading of the docs might suggest that > this was thr intended behavior all along). Im not a member of Rcore through > so i cant do said considering myself. I agree with Scott, Frederik and you that this changes seems worth considering. As Duncan Murdoch has mentioned, this alone may not be sufficient. In addition to your proposed patch (which I have simplified, not using intersection() but working with underlying match() directly), it is little work to introduce an extra argument, I'm calling 'no.dups = TRUE' which when set to false would mirror current R's behavior... and documenting it, then also documents the new behavior (to some extent). My plan is to commit it soonish ;-) Martin > Best, > ~G > On Feb 20, 2018 7:15 PM, <frede...@ofb.net> wrote: > Hi Scott, > I tried the new patch and can confirm that it has the advertised > behavior on a couple of test cases. I think it makes sense to apply > it, because any existing code which refers to a second duplicate > data.frame column by name is already broken, while if the reference is > by numerical index then changing the column name shouldn't break it. > I don't know if you need to update the documentation as part of your > patch, or if whoever applies it would be happy to do that. Somebody > from R core want to weigh in on this? > I attach a file with the test example from your original email as well > as a second test case I added with two "by" columns. > Thanks, > Frederick > On Wed, Feb 21, 2018 at 10:06:21AM +1100, Scott Ritchie wrote: >> Hi Frederick, >> >> It looks like I didn't overwrite the patch.diff file after the last edits. >> Here's the correct patch (attached and copied below): >> >> Index: src/library/base/R/merge.R >> =================================================================== >> --- src/library/base/R/merge.R (revision 74280) >> +++ src/library/base/R/merge.R (working copy) >> @@ -157,6 +157,14 @@ >> } >> >> if(has.common.nms) names(y) <- nm.y >> + ## If by.x %in% names(y) then duplicate column names still arise, >> + ## apply suffixes to just y - this keeps backwards compatibility >> + ## when referring to by.x in the resulting data.frame >> + dupe.keyx <- intersect(nm.by, names(y)) >> + if(length(dupe.keyx)) { >> + if(nzchar(suffixes[2L])) >> + names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx, >> suffixes[2L], sep="") >> + } >> nm <- c(names(x), names(y)) >> if(any(d <- duplicated(nm))) >> if(sum(d) > 1L) >> >> Best, >> >> Scott >> >> On 21 February 2018 at 08:23, <frede...@ofb.net> wrote: >> >> > Hi Scott, >> > >> > I think that's a good idea and I tried your patch on my copy of the >> > repository. But it looks to me like the recent patch is identical to >> > the previous one, can you confirm this? >> > >> > Frederick >> > >> > On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote: >> > > Thanks Gabriel, >> > > >> > > I think your suggested approach is 100% backwards compatible >> > > >> > > Currently in the case of duplicate column names only the first can be >> > > indexed by its name. This will always be the column appearing in by.x, >> > > meaning the column in y with the same name cannot be accessed. > Appending >> > > ".y" (suffixes[2L]) to this column means it can now be accessed, while >> > > keeping the current behaviour of making the key columns always > accessible >> > > by using the names provided to by.x. >> > > >> > > I've attached a new patch that has this behaviour. >> > > >> > > Best, >> > > >> > > Scott >> > > >> > > On 19 February 2018 at 05:08, Gabriel Becker <gmbec...@ucdavis.edu> >> > wrote: >> > > >> > > > It seems like there is a way that is backwards compatible-ish in the >> > sense >> > > > mentioned and still has the (arguably, but a good argument I think) >> > better >> > > > behavior: >> > > > >> > > > if by.x is 'name', (AND by.y is not also 'name'), then x's 'name' >> > column >> > > > is called name and y's 'name' column (not used int he merge) is >> > changed to >> > > > name.y. >> > > > >> > > > Now of course this would still change output, but it would change > it to >> > > > something I think would be better, while retaining the 'merge > columns >> > > > retain their exact names' mechanic as documented. >> > > > >> > > > ~G >> > > > >> > > > On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie < > s.ritchi...@gmail.com> >> > > > wrote: >> > > > >> > > >> Thanks Duncan and Frederick, >> > > >> >> > > >> I suspected as much - there doesn't appear to be any reason why >> > conflicts >> > > >> between by.x and names(y) shouldn't and cannot be checked, but I > can >> > see >> > > >> how this might be more trouble than its worth given it potentially > may >> > > >> break downstream packages (i.e. any cases where this occurs but > they >> > > >> expect >> > > >> the name of the key column(s) to remain the same). >> > > >> >> > > >> Best, >> > > >> >> > > >> Scott >> > > >> >> > > >> On 18 February 2018 at 11:48, Duncan Murdoch < >> > murdoch.dun...@gmail.com> >> > > >> wrote: >> > > >> >> > > >> > On 17/02/2018 6:36 PM, frede...@ofb.net wrote: >> > > >> > >> > > >> >> Hi Scott, >> > > >> >> >> > > >> >> Thanks for the patch. I'm not really involved in R development; > it >> > > >> >> will be up to someone in the R core team to apply it. I would >> > hazard >> > > >> >> to say that even if correct (I haven't checked), it will not be >> > > >> >> applied because the change might break existing code. For > example >> > it >> > > >> >> seems like reasonable code might easily assume that a column > with >> > the >> > > >> >> same name as "by.x" exists in the output of 'merge'. That's > just my >> > > >> >> best guess... I don't participate on here often. >> > > >> >> >> > > >> > >> > > >> > >> > > >> > I think you're right. If I were still a member of R Core, I > would >> > want >> > > >> to >> > > >> > test this against all packages on CRAN and Bioconductor, and > since >> > that >> > > >> > test takes a couple of days to run on my laptop, I'd probably > never >> > get >> > > >> > around to it. >> > > >> > >> > > >> > There are lots of cases where "I would have done that > differently", >> > but >> > > >> > most of them are far too much trouble to change now that R is > more >> > than >> > > >> 20 >> > > >> > years old. And in many cases it will turn out that the way R > does >> > it >> > > >> > actually does make more sense than the way I would have done it. >> > > >> > >> > > >> > Duncan Murdoch >> > > >> > >> > > >> > >> > > >> > >> > > >> >> Cheers, >> > > >> >> >> > > >> >> Frederick >> > > >> >> >> > > >> >> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote: >> > > >> >> >> > > >> >>> The attached patch.diff will make merge.data.frame() append the >> > > >> suffixes >> > > >> >>> to >> > > >> >>> columns with common names between by.x and names(y). >> > > >> >>> >> > > >> >>> Best, >> > > >> >>> >> > > >> >>> Scott Ritchie >> > > >> >>> >> > > >> >>> On 17 February 2018 at 11:15, Scott Ritchie < >> > s.ritchi...@gmail.com> >> > > >> >>> wrote: >> > > >> >>> >> > > >> >>> Hi Frederick, >> > > >> >>>> >> > > >> >>>> I would expect that any duplicate names in the resulting >> > data.frame >> > > >> >>>> would >> > > >> >>>> have the suffixes appended to them, regardless of whether or > not >> > they >> > > >> >>>> are >> > > >> >>>> used as the join key. So in my example I would expect > "names.x" >> > and >> > > >> >>>> "names.y" to indicate their source data.frame. >> > > >> >>>> >> > > >> >>>> While careful reading of the documentation reveals this is not >> > the >> > > >> >>>> case, I >> > > >> >>>> would argue the intent of the suffixes functionality should >> > equally >> > > >> be >> > > >> >>>> applied to this type of case. >> > > >> >>>> >> > > >> >>>> If you agree this would be useful, I'm happy to write a patch > for >> > > >> >>>> merge.data.frame that will add suffixes in this case - I > intend >> > to do >> > > >> >>>> the >> > > >> >>>> same for merge.data.table in the data.table package where I >> > initially >> > > >> >>>> encountered the edge case. >> > > >> >>>> >> > > >> >>>> Best, >> > > >> >>>> >> > > >> >>>> Scott >> > > >> >>>> >> > > >> >>>> On 17 February 2018 at 03:53, <frede...@ofb.net> wrote: >> > > >> >>>> >> > > >> >>>> Hi Scott, >> > > >> >>>>> >> > > >> >>>>> It seems like reasonable behavior to me. What result would > you >> > > >> expect? >> > > >> >>>>> That the second "name" should be called "name.y"? >> > > >> >>>>> >> > > >> >>>>> The "merge" documentation says: >> > > >> >>>>> >> > > >> >>>>> If the columns in the data frames not used in merging > have >> > any >> > > >> >>>>> common names, these have ‘suffixes’ (‘".x"’ and ‘".y"’ > by >> > > >> default) >> > > >> >>>>> appended to try to make the names of the result unique. >> > > >> >>>>> >> > > >> >>>>> Since the first "name" column was used in merging, leaving > both >> > > >> >>>>> without a suffix seems consistent with the documentation... >> > > >> >>>>> >> > > >> >>>>> Frederick >> > > >> >>>>> >> > > >> >>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie > wrote: >> > > >> >>>>> >> > > >> >>>>>> Hi, >> > > >> >>>>>> >> > > >> >>>>>> I was unable to find a bug report for this with a cursory >> > search, >> > > >> but >> > > >> >>>>>> >> > > >> >>>>> would >> > > >> >>>>> >> > > >> >>>>>> like clarification if this is intended or unavoidable >> > behaviour: >> > > >> >>>>>> >> > > >> >>>>>> ```{r} >> > > >> >>>>>> # Create example data.frames >> > > >> >>>>>> parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"), >> > > >> >>>>>> sex=c("F", "M", "F", "M"), >> > > >> >>>>>> age=c(41, 43, 36, 51)) >> > > >> >>>>>> children <- data.frame(parent=c("Sarah", "Max", "Qin"), >> > > >> >>>>>> name=c("Oliver", "Sebastian", >> > "Kai-lee"), >> > > >> >>>>>> sex=c("M", "M", "F"), >> > > >> >>>>>> age=c(5,8,7)) >> > > >> >>>>>> >> > > >> >>>>>> # Merge() creates a duplicated "name" column: >> > > >> >>>>>> merge(parents, children, by.x = "name", by.y = "parent") >> > > >> >>>>>> ``` >> > > >> >>>>>> >> > > >> >>>>>> Output: >> > > >> >>>>>> ``` >> > > >> >>>>>> name sex.x age.x name sex.y age.y >> > > >> >>>>>> 1 Max M 43 Sebastian M 8 >> > > >> >>>>>> 2 Qin F 36 Kai-lee F 7 >> > > >> >>>>>> 3 Sarah F 41 Oliver M 5 >> > > >> >>>>>> Warning message: >> > > >> >>>>>> In merge.data.frame(parents, children, by.x = "name", by.y = >> > > >> >>>>>> "parent") : >> > > >> >>>>>> column name ‘name’ is duplicated in the result >> > > >> >>>>>> ``` >> > > >> >>>>>> >> > > >> >>>>>> Kind Regards, >> > > >> >>>>>> >> > > >> >>>>>> Scott Ritchie >> > > >> >>>>>> >> > > >> >>>>>> [[alternative HTML version deleted]] >> > > >> >>>>>> >> > > >> >>>>>> ______________________________________________ >> > > >> >>>>>> R-devel@r-project.org mailing list >> > > >> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel >> > > >> >>>>>> >> > > >> >>>>>> >> > > >> >>>>> >> > > >> >>>> >> > > >> >>>> >> > > >> >> Index: src/library/base/R/merge.R >> > > >> >>> ============================================================ >> > ======= >> > > >> >>> --- src/library/base/R/merge.R (revision 74264) >> > > >> >>> +++ src/library/base/R/merge.R (working copy) >> > > >> >>> @@ -157,6 +157,15 @@ >> > > >> >>> } >> > > >> >>> if(has.common.nms) names(y) <- nm.y >> > > >> >>> + ## If by.x %in% names(y) then duplicate column names >> > still >> > > >> >>> arise, >> > > >> >>> + ## apply suffixes to these >> > > >> >>> + dupe.keyx <- intersect(nm.by, names(y)) >> > > >> >>> + if(length(dupe.keyx)) { >> > > >> >>> + if(nzchar(suffixes[1L])) >> > > >> >>> + names(x)[match(dupe.keyx, names(x), 0L)] <- >> > > >> >>> paste(dupe.keyx, suffixes[1L], sep="") >> > > >> >>> + if(nzchar(suffixes[2L])) >> > > >> >>> + names(y)[match(dupe.keyx, names(y), 0L)] <- >> > > >> >>> paste(dupe.keyx, suffixes[2L], sep="") >> > > >> >>> + } >> > > >> >>> nm <- c(names(x), names(y)) >> > > >> >>> if(any(d <- duplicated(nm))) >> > > >> >>> if(sum(d) > 1L) >> > > >> >>> >> > > >> >> >> > > >> >> ______________________________________________ >> > > >> >> R-devel@r-project.org mailing list >> > > >> >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > > >> >> >> > > >> >> >> > > >> > >> > > >> >> > > >> [[alternative HTML version deleted]] >> > > >> >> > > >> ______________________________________________ >> > > >> R-devel@r-project.org mailing list >> > > >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > > >> >> > > > >> > > > >> > > > >> > > > -- >> > > > Gabriel Becker, PhD >> > > > Scientist (Bioinformatics) >> > > > Genentech Research >> > > > >> > >> > > Index: src/library/base/R/merge.R >> > > =================================================================== >> > > --- src/library/base/R/merge.R (revision 74264) >> > > +++ src/library/base/R/merge.R (working copy) >> > > @@ -157,6 +157,15 @@ >> > > } >> > > >> > > if(has.common.nms) names(y) <- nm.y >> > > + ## If by.x %in% names(y) then duplicate column names still >> > arise, >> > > + ## apply suffixes to these >> > > + dupe.keyx <- intersect(nm.by, names(y)) >> > > + if(length(dupe.keyx)) { >> > > + if(nzchar(suffixes[1L])) >> > > + names(x)[match(dupe.keyx, names(x), 0L)] <- >> > paste(dupe.keyx, suffixes[1L], sep="") >> > > + if(nzchar(suffixes[2L])) >> > > + names(y)[match(dupe.keyx, names(y), 0L)] <- >> > paste(dupe.keyx, suffixes[2L], sep="") >> > > + } >> > > nm <- c(names(x), names(y)) >> > > if(any(d <- duplicated(nm))) >> > > if(sum(d) > 1L) >> > >> > > ______________________________________________ >> > > R-devel@r-project.org mailing list >> > > https://stat.ethz.ch/mailman/listinfo/r-devel >> > >> > >> Index: src/library/base/R/merge.R >> =================================================================== >> --- src/library/base/R/merge.R (revision 74280) >> +++ src/library/base/R/merge.R (working copy) >> @@ -157,6 +157,14 @@ >> } >> >> if(has.common.nms) names(y) <- nm.y >> + ## If by.x %in% names(y) then duplicate column names still arise, >> + ## apply suffixes to just y - this keeps backwards compatibility >> + ## when referring to by.x in the resulting data.frame >> + dupe.keyx <- intersect(nm.by, names(y)) >> + if(length(dupe.keyx)) { >> + if(nzchar(suffixes[2L])) >> + names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx, > suffixes[2L], sep="") >> + } >> nm <- c(names(x), names(y)) >> if(any(d <- duplicated(nm))) >> if(sum(d) > 1L) > [[alternative HTML version deleted]] > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel