Re: [R] challenging data merging/joining problem

Rasmus Liland Sun, 05 Jul 2020 15:18:19 -0700

On 2020-07-05 14:50 -0400, Christopher W. Ryan wrote:
> I've been conducting relatively simple 
> COVID-19 surveillance for our jurisdiction.


Dear Christopher,

As I am a bit unfamiliar when it comes to the 
tidyverse, I wrote these lines using regular 
data.frames:

        ### Convert to data.frame
        dataSystemA <- as.data.frame(dataSystemA)
        dataSystemB <- as.data.frame(dataSystemB)
        
        ### Add some unique columns to show how 
        #   they are formatted later in this pipe.
        dataSystemA$someIncompleteInfo <- 1:4
        dataSystemB$other_incomplete_info <-
          c("Yes", "No", "Perhaps", "Sometimes", "Yes")
        
        ### Add the dfs to a list, as perhaps the 
        #   data kan be read somehow using 
        #   something like
        #   sapply(c("A", "B"), read.from.somewhere)
        dat <- list("A"=dataSystemA,
                    "B"=dataSystemB)
        
        ### Define a new dataSystem column in boths dfs
        dat <- sapply(names(dat), function(n, dat) {
          dat[[n]]$dataSystem <- n
          return(list(dat[[n]]))
        }, dat=dat)
        
        ### Read from a csv file column names 
        #   where you have defined which ones 
        #   are conceptually identical.
        text <- "A,B
        lastName,last_name
        firstName,first_name
        dob,birthdate
        onsetDate,date_of_onset
        symptomatic,symptoms_present"
        conceptually.identical <- read.csv(text=text)
        
        ### Rename dataSystemA columns to the 
        #   dataSystemB naming convention.
        idx <- match(x=conceptually.identical$A,
                     table=colnames(dat$A))
        colnames(dat$A)[idx] <-
          conceptually.identical[idx,"B"]
        
        ### Find all column names, and fill the
        #   ones that does not exists in each 
        #   df with NA, order the dfs by this 
        #   vector, then rbind the dfs.
        cn <- unique(unlist(lapply(dat, colnames)))
        dat <- sapply(dat, function(x, cn) {
          x[,cn[!(cn %in% colnames(x))]] <- NA
          list(x[,cn])
        }, cn=cn)
        dat <- do.call(rbind, dat)
        
        ### Order unified df decreasingly by 
        #   last_name and birthdate
        dat <- dat[order(dat$last_name,
          dat$birthdate, decreasing=FALSE),]
        rownames(dat) <- NULL
        
        dat

which yields

           last_name first_name  birthdate date_of_onset symptoms_present 
someIncompleteInfo dataSystem other_incomplete_info
        1    DIGGORY     cedric 2011-12-16    2020-07-12             TRUE       
          NA          B                   Yes
        2   GRAINGER   hermione 2010-12-05    2020-07-08               NA       
           3          A                  <NA>
        3   GRAINGER   hermione 2010-12-05    2020-07-08             TRUE       
          NA          B                   Yes
        4 LONGBOTTOM    neville 2011-01-24    2020-07-09               NA       
           4          A                  <NA>
        5 LONGBOTTOM    neville 2011-01-24    2020-07-09             TRUE       
          NA          B                    No
        6   LOVEGOOD       luna 2011-03-15    2020-07-11            FALSE       
          NA          B             Sometimes
        7     MALFOY      draco 2011-07-04    2020-07-10            FALSE       
          NA          B               Perhaps
        8     POTTER      harry 2010-12-16    2020-07-06             TRUE       
           1          A                  <NA>
        9    WEASLEY        ron 2010-12-30    2020-07-07            FALSE       
           2          A                  <NA>

When comparing the incomplete columns in each 
data system, it might be useful to do some 
reshaping like this:

        cols <- c("last_name", "birthdate", "dataSystem", "date_of_onset")
        reshape(dat[,cols],
                idvar=c("last_name", "birthdate"),
                timevar="dataSystem",
                direction="wide")

which yields

           last_name  birthdate date_of_onset.B date_of_onset.A
        1    DIGGORY 2011-03-17      2020-07-13            <NA>
        2   GRAINGER 2010-12-06      2020-07-09      2020-07-09
        4 LONGBOTTOM 2011-01-25      2020-07-10      2020-07-10
        6   LOVEGOOD 2010-10-15      2020-07-12            <NA>
        7     MALFOY 2010-12-25      2020-07-11            <NA>
        8     POTTER 2011-05-09            <NA>      2020-07-07
        9    WEASLEY 2012-04-05            <NA>      2020-07-08

Best,
Rasmus

signature.asc
Description: PGP signature

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] challenging data merging/joining problem

Reply via email to