>>>>> Scott Ritchie <s.ritchi...@gmail.com> >>>>> on Fri, 23 Feb 2018 12:32:41 +1100 writes:
> Thanks Martin! > Can you clarify the functionality of the 'no.dups' argument so I can change > my patch to `data.table:::merge.data.table` accordingly? > - When `no.dups=TRUE` will the suffix to the by.x column name? Or will it > take the functionality of the second functionality where only the column in > y has the suffix added? > - When `no.dups=FALSE` will the output be the same as it currently (no > suffix added to either column)? Or will add the suffix to the column in y? I had started from your patch... and worked from there. So, there's no need (and use) to provide another one. I also needed to update the man page, add a regression test, add an entry to NEWS.Rd ... Just wait until I commit.. Martin > Best, > Scott > On 22 February 2018 at 22:31, Martin Maechler <maech...@stat.math.ethz.ch> > wrote: >> >>>>> Gabriel Becker <gmbec...@ucdavis.edu> >> >>>>> on Wed, 21 Feb 2018 07:11:44 -0800 writes: >> >> > Hi all, >> > For the record this approach isnt 100% backwards compatible, because >> > names(mergeddf) will e incompatibly different. Thatx why i claimed >> > bakcwards compatable-ish >> >> exactly. >> >> > That said its still worth considering imho because of the reasons >> stated >> > (and honestly one particular simple reading of the docs might >> suggest that >> > this was thr intended behavior all along). Im not a member of Rcore >> through >> > so i cant do said considering myself. >> >> I agree with Scott, Frederik and you that this changes seems >> worth considering. >> As Duncan Murdoch has mentioned, this alone may not be >> sufficient. >> >> In addition to your proposed patch (which I have simplified, not >> using intersection() but working with underlying match() >> directly), it is little work to introduce an extra argument, I'm >> calling 'no.dups = TRUE' which when set to false would mirror >> current R's behavior... and documenting it, then also documents the >> new behavior (to some extent). >> >> My plan is to commit it soonish ;-) >> Martin >> >> > Best, >> > ~G >> >> > On Feb 20, 2018 7:15 PM, <frede...@ofb.net> wrote: >> >> > Hi Scott, >> >> > I tried the new patch and can confirm that it has the advertised >> > behavior on a couple of test cases. I think it makes sense to apply >> > it, because any existing code which refers to a second duplicate >> > data.frame column by name is already broken, while if the reference >> is >> > by numerical index then changing the column name shouldn't break it. >> >> > I don't know if you need to update the documentation as part of your >> > patch, or if whoever applies it would be happy to do that. Somebody >> > from R core want to weigh in on this? >> >> > I attach a file with the test example from your original email as >> well >> > as a second test case I added with two "by" columns. >> >> > Thanks, >> >> > Frederick >> >> > On Wed, Feb 21, 2018 at 10:06:21AM +1100, Scott Ritchie wrote: >> >> Hi Frederick, >> >> >> >> It looks like I didn't overwrite the patch.diff file after the last >> edits. >> >> Here's the correct patch (attached and copied below): >> >> >> >> Index: src/library/base/R/merge.R >> >> =================================================================== >> >> --- src/library/base/R/merge.R (revision 74280) >> >> +++ src/library/base/R/merge.R (working copy) >> >> @@ -157,6 +157,14 @@ >> >> } >> >> >> >> if(has.common.nms) names(y) <- nm.y >> >> + ## If by.x %in% names(y) then duplicate column names still >> arise, >> >> + ## apply suffixes to just y - this keeps backwards >> compatibility >> >> + ## when referring to by.x in the resulting data.frame >> >> + dupe.keyx <- intersect(nm.by, names(y)) >> >> + if(length(dupe.keyx)) { >> >> + if(nzchar(suffixes[2L])) >> >> + names(y)[match(dupe.keyx, names(y), 0L)] <- >> paste(dupe.keyx, >> >> suffixes[2L], sep="") >> >> + } >> >> nm <- c(names(x), names(y)) >> >> if(any(d <- duplicated(nm))) >> >> if(sum(d) > 1L) >> >> >> >> Best, >> >> >> >> Scott >> >> >> >> On 21 February 2018 at 08:23, <frede...@ofb.net> wrote: >> >> >> >> > Hi Scott, >> >> > >> >> > I think that's a good idea and I tried your patch on my copy of >> the >> >> > repository. But it looks to me like the recent patch is identical >> to >> >> > the previous one, can you confirm this? >> >> > >> >> > Frederick >> >> > >> >> > On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote: >> >> > > Thanks Gabriel, >> >> > > >> >> > > I think your suggested approach is 100% backwards compatible >> >> > > >> >> > > Currently in the case of duplicate column names only the first >> can be >> >> > > indexed by its name. This will always be the column appearing >> in by.x, >> >> > > meaning the column in y with the same name cannot be accessed. >> > Appending >> >> > > ".y" (suffixes[2L]) to this column means it can now be >> accessed, while >> >> > > keeping the current behaviour of making the key columns always >> > accessible >> >> > > by using the names provided to by.x. >> >> > > >> >> > > I've attached a new patch that has this behaviour. >> >> > > >> >> > > Best, >> >> > > >> >> > > Scott >> >> > > >> >> > > On 19 February 2018 at 05:08, Gabriel Becker < >> gmbec...@ucdavis.edu> >> >> > wrote: >> >> > > >> >> > > > It seems like there is a way that is backwards compatible-ish >> in the >> >> > sense >> >> > > > mentioned and still has the (arguably, but a good argument I >> think) >> >> > better >> >> > > > behavior: >> >> > > > >> >> > > > if by.x is 'name', (AND by.y is not also 'name'), then x's >> 'name' >> >> > column >> >> > > > is called name and y's 'name' column (not used int he merge) >> is >> >> > changed to >> >> > > > name.y. >> >> > > > >> >> > > > Now of course this would still change output, but it would >> change >> > it to >> >> > > > something I think would be better, while retaining the 'merge >> > columns >> >> > > > retain their exact names' mechanic as documented. >> >> > > > >> >> > > > ~G >> >> > > > >> >> > > > On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie < >> > s.ritchi...@gmail.com> >> >> > > > wrote: >> >> > > > >> >> > > >> Thanks Duncan and Frederick, >> >> > > >> >> >> > > >> I suspected as much - there doesn't appear to be any reason >> why >> >> > conflicts >> >> > > >> between by.x and names(y) shouldn't and cannot be checked, >> but I >> > can >> >> > see >> >> > > >> how this might be more trouble than its worth given it >> potentially >> > may >> >> > > >> break downstream packages (i.e. any cases where this occurs >> but >> > they >> >> > > >> expect >> >> > > >> the name of the key column(s) to remain the same). >> >> > > >> >> >> > > >> Best, >> >> > > >> >> >> > > >> Scott >> >> > > >> >> >> > > >> On 18 February 2018 at 11:48, Duncan Murdoch < >> >> > murdoch.dun...@gmail.com> >> >> > > >> wrote: >> >> > > >> >> >> > > >> > On 17/02/2018 6:36 PM, frede...@ofb.net wrote: >> >> > > >> > >> >> > > >> >> Hi Scott, >> >> > > >> >> >> >> > > >> >> Thanks for the patch. I'm not really involved in R >> development; >> > it >> >> > > >> >> will be up to someone in the R core team to apply it. I >> would >> >> > hazard >> >> > > >> >> to say that even if correct (I haven't checked), it will >> not be >> >> > > >> >> applied because the change might break existing code. For >> > example >> >> > it >> >> > > >> >> seems like reasonable code might easily assume that a >> column >> > with >> >> > the >> >> > > >> >> same name as "by.x" exists in the output of 'merge'. >> That's >> > just my >> >> > > >> >> best guess... I don't participate on here often. >> >> > > >> >> >> >> > > >> > >> >> > > >> > >> >> > > >> > I think you're right. If I were still a member of R Core, >> I >> > would >> >> > want >> >> > > >> to >> >> > > >> > test this against all packages on CRAN and Bioconductor, >> and >> > since >> >> > that >> >> > > >> > test takes a couple of days to run on my laptop, I'd >> probably >> > never >> >> > get >> >> > > >> > around to it. >> >> > > >> > >> >> > > >> > There are lots of cases where "I would have done that >> > differently", >> >> > but >> >> > > >> > most of them are far too much trouble to change now that R >> is >> > more >> >> > than >> >> > > >> 20 >> >> > > >> > years old. And in many cases it will turn out that the >> way R >> > does >> >> > it >> >> > > >> > actually does make more sense than the way I would have >> done it. >> >> > > >> > >> >> > > >> > Duncan Murdoch >> >> > > >> > >> >> > > >> > >> >> > > >> > >> >> > > >> >> Cheers, >> >> > > >> >> >> >> > > >> >> Frederick >> >> > > >> >> >> >> > > >> >> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie >> wrote: >> >> > > >> >> >> >> > > >> >>> The attached patch.diff will make merge.data.frame() >> append the >> >> > > >> suffixes >> >> > > >> >>> to >> >> > > >> >>> columns with common names between by.x and names(y). >> >> > > >> >>> >> >> > > >> >>> Best, >> >> > > >> >>> >> >> > > >> >>> Scott Ritchie >> >> > > >> >>> >> >> > > >> >>> On 17 February 2018 at 11:15, Scott Ritchie < >> >> > s.ritchi...@gmail.com> >> >> > > >> >>> wrote: >> >> > > >> >>> >> >> > > >> >>> Hi Frederick, >> >> > > >> >>>> >> >> > > >> >>>> I would expect that any duplicate names in the resulting >> >> > data.frame >> >> > > >> >>>> would >> >> > > >> >>>> have the suffixes appended to them, regardless of >> whether or >> > not >> >> > they >> >> > > >> >>>> are >> >> > > >> >>>> used as the join key. So in my example I would expect >> > "names.x" >> >> > and >> >> > > >> >>>> "names.y" to indicate their source data.frame. >> >> > > >> >>>> >> >> > > >> >>>> While careful reading of the documentation reveals this >> is not >> >> > the >> >> > > >> >>>> case, I >> >> > > >> >>>> would argue the intent of the suffixes functionality >> should >> >> > equally >> >> > > >> be >> >> > > >> >>>> applied to this type of case. >> >> > > >> >>>> >> >> > > >> >>>> If you agree this would be useful, I'm happy to write a >> patch >> > for >> >> > > >> >>>> merge.data.frame that will add suffixes in this case - I >> > intend >> >> > to do >> >> > > >> >>>> the >> >> > > >> >>>> same for merge.data.table in the data.table package >> where I >> >> > initially >> >> > > >> >>>> encountered the edge case. >> >> > > >> >>>> >> >> > > >> >>>> Best, >> >> > > >> >>>> >> >> > > >> >>>> Scott >> >> > > >> >>>> >> >> > > >> >>>> On 17 February 2018 at 03:53, <frede...@ofb.net> wrote: >> >> > > >> >>>> >> >> > > >> >>>> Hi Scott, >> >> > > >> >>>>> >> >> > > >> >>>>> It seems like reasonable behavior to me. What result >> would >> > you >> >> > > >> expect? >> >> > > >> >>>>> That the second "name" should be called "name.y"? >> >> > > >> >>>>> >> >> > > >> >>>>> The "merge" documentation says: >> >> > > >> >>>>> >> >> > > >> >>>>> If the columns in the data frames not used in >> merging >> > have >> >> > any >> >> > > >> >>>>> common names, these have ‘suffixes’ (‘".x"’ and >> ‘".y"’ >> > by >> >> > > >> default) >> >> > > >> >>>>> appended to try to make the names of the result >> unique. >> >> > > >> >>>>> >> >> > > >> >>>>> Since the first "name" column was used in merging, >> leaving >> > both >> >> > > >> >>>>> without a suffix seems consistent with the >> documentation... >> >> > > >> >>>>> >> >> > > >> >>>>> Frederick >> >> > > >> >>>>> >> >> > > >> >>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie >> > wrote: >> >> > > >> >>>>> >> >> > > >> >>>>>> Hi, >> >> > > >> >>>>>> >> >> > > >> >>>>>> I was unable to find a bug report for this with a >> cursory >> >> > search, >> >> > > >> but >> >> > > >> >>>>>> >> >> > > >> >>>>> would >> >> > > >> >>>>> >> >> > > >> >>>>>> like clarification if this is intended or unavoidable >> >> > behaviour: >> >> > > >> >>>>>> >> >> > > >> >>>>>> ```{r} >> >> > > >> >>>>>> # Create example data.frames >> >> > > >> >>>>>> parents <- data.frame(name=c("Sarah", "Max", "Qin", >> "Lex"), >> >> > > >> >>>>>> sex=c("F", "M", "F", "M"), >> >> > > >> >>>>>> age=c(41, 43, 36, 51)) >> >> > > >> >>>>>> children <- data.frame(parent=c("Sarah", "Max", >> "Qin"), >> >> > > >> >>>>>> name=c("Oliver", "Sebastian", >> >> > "Kai-lee"), >> >> > > >> >>>>>> sex=c("M", "M", "F"), >> >> > > >> >>>>>> age=c(5,8,7)) >> >> > > >> >>>>>> >> >> > > >> >>>>>> # Merge() creates a duplicated "name" column: >> >> > > >> >>>>>> merge(parents, children, by.x = "name", by.y = >> "parent") >> >> > > >> >>>>>> ``` >> >> > > >> >>>>>> >> >> > > >> >>>>>> Output: >> >> > > >> >>>>>> ``` >> >> > > >> >>>>>> name sex.x age.x name sex.y age.y >> >> > > >> >>>>>> 1 Max M 43 Sebastian M 8 >> >> > > >> >>>>>> 2 Qin F 36 Kai-lee F 7 >> >> > > >> >>>>>> 3 Sarah F 41 Oliver M 5 >> >> > > >> >>>>>> Warning message: >> >> > > >> >>>>>> In merge.data.frame(parents, children, by.x = "name", >> by.y = >> >> > > >> >>>>>> "parent") : >> >> > > >> >>>>>> column name ‘name’ is duplicated in the result >> >> > > >> >>>>>> ``` >> >> > > >> >>>>>> >> >> > > >> >>>>>> Kind Regards, >> >> > > >> >>>>>> >> >> > > >> >>>>>> Scott Ritchie >> >> > > >> >>>>>> >> >> > > >> >>>>>> [[alternative HTML version deleted]] >> >> > > >> >>>>>> >> >> > > >> >>>>>> ______________________________________________ >> >> > > >> >>>>>> R-devel@r-project.org mailing list >> >> > > >> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> > > >> >>>>>> >> >> > > >> >>>>>> >> >> > > >> >>>>> >> >> > > >> >>>> >> >> > > >> >>>> >> >> > > >> >> Index: src/library/base/R/merge.R >> >> > > >> >>> ============================== >> ============================== >> >> > ======= >> >> > > >> >>> --- src/library/base/R/merge.R (revision 74264) >> >> > > >> >>> +++ src/library/base/R/merge.R (working copy) >> >> > > >> >>> @@ -157,6 +157,15 @@ >> >> > > >> >>> } >> >> > > >> >>> if(has.common.nms) names(y) <- nm.y >> >> > > >> >>> + ## If by.x %in% names(y) then duplicate column >> names >> >> > still >> >> > > >> >>> arise, >> >> > > >> >>> + ## apply suffixes to these >> >> > > >> >>> + dupe.keyx <- intersect(nm.by, names(y)) >> >> > > >> >>> + if(length(dupe.keyx)) { >> >> > > >> >>> + if(nzchar(suffixes[1L])) >> >> > > >> >>> + names(x)[match(dupe.keyx, names(x), 0L)] <- >> >> > > >> >>> paste(dupe.keyx, suffixes[1L], sep="") >> >> > > >> >>> + if(nzchar(suffixes[2L])) >> >> > > >> >>> + names(y)[match(dupe.keyx, names(y), 0L)] <- >> >> > > >> >>> paste(dupe.keyx, suffixes[2L], sep="") >> >> > > >> >>> + } >> >> > > >> >>> nm <- c(names(x), names(y)) >> >> > > >> >>> if(any(d <- duplicated(nm))) >> >> > > >> >>> if(sum(d) > 1L) >> >> > > >> >>> >> >> > > >> >> >> >> > > >> >> ______________________________________________ >> >> > > >> >> R-devel@r-project.org mailing list >> >> > > >> >> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> > > >> >> >> >> > > >> >> >> >> > > >> > >> >> > > >> >> >> > > >> [[alternative HTML version deleted]] >> >> > > >> >> >> > > >> ______________________________________________ >> >> > > >> R-devel@r-project.org mailing list >> >> > > >> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> > > >> >> >> > > > >> >> > > > >> >> > > > >> >> > > > -- >> >> > > > Gabriel Becker, PhD >> >> > > > Scientist (Bioinformatics) >> >> > > > Genentech Research >> >> > > > >> >> > >> >> > > Index: src/library/base/R/merge.R >> >> > > ============================================================ >> ======= >> >> > > --- src/library/base/R/merge.R (revision 74264) >> >> > > +++ src/library/base/R/merge.R (working copy) >> >> > > @@ -157,6 +157,15 @@ >> >> > > } >> >> > > >> >> > > if(has.common.nms) names(y) <- nm.y >> >> > > + ## If by.x %in% names(y) then duplicate column names >> still >> >> > arise, >> >> > > + ## apply suffixes to these >> >> > > + dupe.keyx <- intersect(nm.by, names(y)) >> >> > > + if(length(dupe.keyx)) { >> >> > > + if(nzchar(suffixes[1L])) >> >> > > + names(x)[match(dupe.keyx, names(x), 0L)] <- >> >> > paste(dupe.keyx, suffixes[1L], sep="") >> >> > > + if(nzchar(suffixes[2L])) >> >> > > + names(y)[match(dupe.keyx, names(y), 0L)] <- >> >> > paste(dupe.keyx, suffixes[2L], sep="") >> >> > > + } >> >> > > nm <- c(names(x), names(y)) >> >> > > if(any(d <- duplicated(nm))) >> >> > > if(sum(d) > 1L) >> >> > >> >> > > ______________________________________________ >> >> > > R-devel@r-project.org mailing list >> >> > > https://stat.ethz.ch/mailman/listinfo/r-devel >> >> > >> >> > >> >> >> Index: src/library/base/R/merge.R >> >> =================================================================== >> >> --- src/library/base/R/merge.R (revision 74280) >> >> +++ src/library/base/R/merge.R (working copy) >> >> @@ -157,6 +157,14 @@ >> >> } >> >> >> >> if(has.common.nms) names(y) <- nm.y >> >> + ## If by.x %in% names(y) then duplicate column names still >> arise, >> >> + ## apply suffixes to just y - this keeps backwards >> compatibility >> >> + ## when referring to by.x in the resulting data.frame >> >> + dupe.keyx <- intersect(nm.by, names(y)) >> >> + if(length(dupe.keyx)) { >> >> + if(nzchar(suffixes[2L])) >> >> + names(y)[match(dupe.keyx, names(y), 0L)] <- >> paste(dupe.keyx, >> > suffixes[2L], sep="") >> >> + } >> >> nm <- c(names(x), names(y)) >> >> if(any(d <- duplicated(nm))) >> >> if(sum(d) > 1L) >> >> > [[alternative HTML version deleted]] >> >> > ______________________________________________ >> > R-devel@r-project.org mailing list >> > https://stat.ethz.ch/mailman/listinfo/r-devel >> > [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel