Re: [Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

Duncan Murdoch Sat, 17 Feb 2018 16:49:36 -0800

On 17/02/2018 6:36 PM, [email protected] wrote:

Hi Scott,


Thanks for the patch. I'm not really involved in R development; it
will be up to someone in the R core team to apply it. I would hazard
to say that even if correct (I haven't checked), it will not be
applied because the change might break existing code. For example it
seems like reasonable code might easily assume that a column with the
same name as "by.x" exists in the output of 'merge'. That's just my
best guess... I don't participate on here often.

I think you're right. If I were still a member of R Core, I would wantto test this against all packages on CRAN and Bioconductor, and sincethat test takes a couple of days to run on my laptop, I'd probably neverget around to it.

There are lots of cases where "I would have done that differently", butmost of them are far too much trouble to change now that R is more than20 years old. And in many cases it will turn out that the way R does itactually does make more sense than the way I would have done it.


Duncan Murdoch


Cheers,

Frederick

On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:

The attached patch.diff will make merge.data.frame() append the suffixes to
columns with common names between by.x and names(y).

Best,

Scott Ritchie

On 17 February 2018 at 11:15, Scott Ritchie <[email protected]> wrote:

Hi Frederick,

I would expect that any duplicate names in the resulting data.frame would
have the suffixes appended to them, regardless of whether or not they are
used as the join key. So in my example I would expect "names.x" and
"names.y" to indicate their source data.frame.

While careful reading of the documentation reveals this is not the case, I
would argue the intent of the suffixes functionality should equally be
applied to this type of case.

If you agree this would be useful, I'm happy to write a patch for
merge.data.frame that will add suffixes in this case - I intend to do the
same for merge.data.table in the data.table package where I initially
encountered the edge case.

Best,

Scott

On 17 February 2018 at 03:53, <[email protected]> wrote:

Hi Scott,

It seems like reasonable behavior to me. What result would you expect?
That the second "name" should be called "name.y"?

The "merge" documentation says:

     If the columns in the data frames not used in merging have any
     common names, these have ‘suffixes’ (‘".x"’ and ‘".y"’ by default)
     appended to try to make the names of the result unique.

Since the first "name" column was used in merging, leaving both
without a suffix seems consistent with the documentation...

Frederick

On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:

Hi,

I was unable to find a bug report for this with a cursory search, but

would

like clarification if this is intended or unavoidable behaviour:

```{r}
# Create example data.frames
parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
                       sex=c("F", "M", "F", "M"),
                       age=c(41, 43, 36, 51))
children <- data.frame(parent=c("Sarah", "Max", "Qin"),
                        name=c("Oliver", "Sebastian", "Kai-lee"),
                        sex=c("M", "M", "F"),
                        age=c(5,8,7))

# Merge() creates a duplicated "name" column:
merge(parents, children, by.x = "name", by.y = "parent")
```

Output:
```
    name sex.x age.x      name sex.y age.y
1   Max     M    43 Sebastian     M     8
2   Qin     F    36   Kai-lee     F     7
3 Sarah     F    41    Oliver     M     5
Warning message:
In merge.data.frame(parents, children, by.x = "name", by.y = "parent") :
   column name ‘name’ is duplicated in the result
```

Kind Regards,

Scott Ritchie

       [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Index: src/library/base/R/merge.R
===================================================================
--- src/library/base/R/merge.R  (revision 74264)
+++ src/library/base/R/merge.R  (working copy)
@@ -157,6 +157,15 @@
          }

if(has.common.nms) names(y) <- nm.y

+        ## If by.x %in% names(y) then duplicate column names still arise,
+        ## apply suffixes to these
+        dupe.keyx <- intersect(nm.by, names(y))
+        if(length(dupe.keyx)) {
+          if(nzchar(suffixes[1L]))
+            names(x)[match(dupe.keyx, names(x), 0L)] <- paste(dupe.keyx, suffixes[1L], 
sep="")
+          if(nzchar(suffixes[2L]))
+            names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx, suffixes[2L], 
sep="")
+        }
          nm <- c(names(x), names(y))
          if(any(d <- duplicated(nm)))
              if(sum(d) > 1L)


______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

Reply via email to