[R] Duplicate names in the pivot column

phil Sat, 28 Mar 2020 18:20:09 -0700

I have a problem involving inefficient coding. My code works, but in myactual application it takes a very long time to execute. I have includeda reprex here that uses the same code, but with a much smaller-scaleapplication.

The data frame I am working with (df in my reprex) is in long form and Iwant to change it to wide form. My problem is that the pivot column,column 2 in my reprex, has some duplicate strings, so the pivot doesn'twork well (df1 in my reprex). I want to find all the duplicates and tagthem so they are no longer duplicates. My code succeeds (df3 in myreprex). But in the real application there can be over 100 "cases" andthe for loops grind on far too long.

I encounter this problem frequently in the datasets I use, so I amlooking for a general solution that is as efficient as possible. Anyhelp will be much appreciated.


Philip

``` r
library(tidyverse)
df <- data.frame(time=c(1,1,1,1,1,1,2,2,2,2,2,2),
                 y=c("A","B","C","B","D","C","A","B","C","B","D","C"),
                 z=sample(1:100,12,replace=TRUE),stringsAsFactors=FALSE)
df1 <- pivot_wider(df,id_cols=1,names_from=y,values_from=z)

#> Warning: Values in `z` are not uniquely identified; output willcontain list-cols.

#> * Use `values_fn = list(z = list)` to suppress this warning.

#> * Use `values_fn = list(z = length)` to identify where the duplicatesarise

#> * Use `values_fn = list(z = summary_fun)` to summarise duplicates
fixcol <- function(dfm,cases,per,s,tag) {
  # dfm is the data frame
  # s is the target column number, containing character names
  # tag is a string to be added to a duplicate name
  # cases is the number of rows for a single time period
  # per is the number of time periods
  # all time periods must have the same number of rows
  for (k in 1:per) {
    for (i in (1+(k-1)*cases):(k*cases-1)) {
      for (j in (i+1):(k*cases)) {
        if (dfm[j,s]==dfm[i,s]) { # found a duplicate
          dfm[j,s] <- paste0(dfm[i,s],tag) # fix the duplicate
          dfm[j,s]
        }
      }
    }
  }
  return(dfm)
}
df2 <- fixcol(df,6,2,2,"_dup")
df3 <- pivot_wider(df2,id_cols=1,names_from=y,values_from=z)
```

<sup>Created on 2020-03-28 by the [reprexpackage](https://reprex.tidyverse.org) (v0.3.0)</sup>

I have a problem involving inefficient coding. My code works, but in my actual 
application it takes a very long time to execute. I have included a reprex here 
that uses the same code, but with a much smaller-scale application. 

The data frame I am working with (df in my reprex) is in long form and I want 
to change it to wide form. My problem is that the pivot column, column 2 in my 
reprex, has some duplicate strings, so the pivot doesn't work well (df1 in my 
reprex). I want to find all the duplicates and tag them so they are no longer 
duplicates. My code succeeds (df3 in my reprex). But in the real application 
there can be over 100 "cases" and the for loops grind on far too long.

I encounter this problem frequently in the datasets I use, so I am looking for 
a general solution that is as efficient as possible. Any help will be much 
appreciated.

Philip

``` r
library(tidyverse)
df <- data.frame(time=c(1,1,1,1,1,1,2,2,2,2,2,2),
                 y=c("A","B","C","B","D","C","A","B","C","B","D","C"),
                 z=sample(1:100,12,replace=TRUE),stringsAsFactors=FALSE)
df1 <- pivot_wider(df,id_cols=1,names_from=y,values_from=z)
#> Warning: Values in `z` are not uniquely identified; output will contain 
list-cols.
#> * Use `values_fn = list(z = list)` to suppress this warning.
#> * Use `values_fn = list(z = length)` to identify where the duplicates arise
#> * Use `values_fn = list(z = summary_fun)` to summarise duplicates
fixcol <- function(dfm,cases,per,s,tag) {
  # dfm is the data frame
  # s is the target column number, containing character names
  # tag is a string to be added to a duplicate name
  # cases is the number of rows for a single time period
  # per is the number of time periods
  # all time periods must have the same number of rows
  for (k in 1:per) {
    for (i in (1+(k-1)*cases):(k*cases-1)) {
      for (j in (i+1):(k*cases)) { 
        if (dfm[j,s]==dfm[i,s]) { # found a duplicate
          dfm[j,s] <- paste0(dfm[i,s],tag) # fix the duplicate
          dfm[j,s]
        }
      }
    }
  }
  return(dfm)
}
df2 <- fixcol(df,6,2,2,"_dup")
df3 <- pivot_wider(df2,id_cols=1,names_from=y,values_from=z)
```

<sup>Created on 2020-03-28 by the [reprex 
package](https://reprex.tidyverse.org) (v0.3.0)</sup>

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Duplicate names in the pivot column

Reply via email to