On 2020-07-05 14:50 -0400, Christopher W. Ryan wrote: > I've been conducting relatively simple > COVID-19 surveillance for our jurisdiction.
Dear Christopher, As I am a bit unfamiliar when it comes to the tidyverse, I wrote these lines using regular data.frames: ### Convert to data.frame dataSystemA <- as.data.frame(dataSystemA) dataSystemB <- as.data.frame(dataSystemB) ### Add some unique columns to show how # they are formatted later in this pipe. dataSystemA$someIncompleteInfo <- 1:4 dataSystemB$other_incomplete_info <- c("Yes", "No", "Perhaps", "Sometimes", "Yes") ### Add the dfs to a list, as perhaps the # data kan be read somehow using # something like # sapply(c("A", "B"), read.from.somewhere) dat <- list("A"=dataSystemA, "B"=dataSystemB) ### Define a new dataSystem column in boths dfs dat <- sapply(names(dat), function(n, dat) { dat[[n]]$dataSystem <- n return(list(dat[[n]])) }, dat=dat) ### Read from a csv file column names # where you have defined which ones # are conceptually identical. text <- "A,B lastName,last_name firstName,first_name dob,birthdate onsetDate,date_of_onset symptomatic,symptoms_present" conceptually.identical <- read.csv(text=text) ### Rename dataSystemA columns to the # dataSystemB naming convention. idx <- match(x=conceptually.identical$A, table=colnames(dat$A)) colnames(dat$A)[idx] <- conceptually.identical[idx,"B"] ### Find all column names, and fill the # ones that does not exists in each # df with NA, order the dfs by this # vector, then rbind the dfs. cn <- unique(unlist(lapply(dat, colnames))) dat <- sapply(dat, function(x, cn) { x[,cn[!(cn %in% colnames(x))]] <- NA list(x[,cn]) }, cn=cn) dat <- do.call(rbind, dat) ### Order unified df decreasingly by # last_name and birthdate dat <- dat[order(dat$last_name, dat$birthdate, decreasing=FALSE),] rownames(dat) <- NULL dat which yields last_name first_name birthdate date_of_onset symptoms_present someIncompleteInfo dataSystem other_incomplete_info 1 DIGGORY cedric 2011-12-16 2020-07-12 TRUE NA B Yes 2 GRAINGER hermione 2010-12-05 2020-07-08 NA 3 A <NA> 3 GRAINGER hermione 2010-12-05 2020-07-08 TRUE NA B Yes 4 LONGBOTTOM neville 2011-01-24 2020-07-09 NA 4 A <NA> 5 LONGBOTTOM neville 2011-01-24 2020-07-09 TRUE NA B No 6 LOVEGOOD luna 2011-03-15 2020-07-11 FALSE NA B Sometimes 7 MALFOY draco 2011-07-04 2020-07-10 FALSE NA B Perhaps 8 POTTER harry 2010-12-16 2020-07-06 TRUE 1 A <NA> 9 WEASLEY ron 2010-12-30 2020-07-07 FALSE 2 A <NA> When comparing the incomplete columns in each data system, it might be useful to do some reshaping like this: cols <- c("last_name", "birthdate", "dataSystem", "date_of_onset") reshape(dat[,cols], idvar=c("last_name", "birthdate"), timevar="dataSystem", direction="wide") which yields last_name birthdate date_of_onset.B date_of_onset.A 1 DIGGORY 2011-03-17 2020-07-13 <NA> 2 GRAINGER 2010-12-06 2020-07-09 2020-07-09 4 LONGBOTTOM 2011-01-25 2020-07-10 2020-07-10 6 LOVEGOOD 2010-10-15 2020-07-12 <NA> 7 MALFOY 2010-12-25 2020-07-11 <NA> 8 POTTER 2011-05-09 <NA> 2020-07-07 9 WEASLEY 2012-04-05 <NA> 2020-07-08 Best, Rasmus
signature.asc
Description: PGP signature
______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.