Have you talked directly to the designers of the new database? One would hope that they had a clear migration path in mind. Perhaps they just didn't document it to your satisfaction.
Rich On Sun, Jul 5, 2020 at 2:51 PM Christopher W. Ryan <cr...@binghamton.edu> wrote: > > I've been conducting relatively simple COVID-19 surveillance for our > jurisdiction. We get data on lab test results automatically, and then > interview patients to obtain other information, like clinical details. > We had been recording all data in our long-time data system (call it > dataSystemA). But as of a particular date, there was a major change in > the data system we were compelled to use. Call the new one dataSystemB. > dataSystemA and dataSystemB contain very similar information, > conceptually, but the variable names are all different, and there are > some variables in one that do not appear in the other. Total number of > variables in each is about 50-70. > > Furthermore, for about 2 weeks prior to the transition, lab test results > started being deposited into dataSystemB while dataSystemA was still > being used to record the full information from the interviews. > Subsequent to the transition, lab test results and interview information > are being recorded in dataSystemB, while the lab test results alone are > still being automatically deposited into dataSystemA. > > Diagrammatically: > > dataSystemA usage: ____________________ ............>> > > dataSystemB usage: ......._____________>> > > where ________ represents full data and ..... represents partial data, > and >> represents the progress of time. > > > The following will create MWE of the data wrangling problem, with the > change in data systems made to occur overnight on 2020-07-07: > > library(dplyr) > dataSystemA <- tibble(lastName = c("POTTER", "WEASLEY", "GRAINGER", > "LONGBOTTOM"), > firstName = c("harry", "ron", "hermione", "neville"), > dob = as.Date(Sys.Date() + c(sample(-3650:-3000, > size = 2), -3500, -3450)), > onsetDate = as.Date(Sys.Date() + 1:4), > symptomatic = c(TRUE, FALSE, NA, NA) ) > dataSystemB <- tibble(last_name = c("GRAINGER", "LONGBOTTOM", "MALFOY", > "LOVEGOOD", "DIGGORY"), > first_name = c("hermione", "neville", "draco", > "luna", "cedric"), > birthdate = as.Date(Sys.Date() + c(-3500, -3450, > sample(-3650:-3000, size = 3))), > date_of_onset = as.Date(Sys.Date() + 3:7), > symptoms_present = c(TRUE, TRUE, FALSE, FALSE, TRUE)) > > > > Obviously, this is all the same public health problem, so I don't want a > big uninterpretable gap in my reports. I am looking for advice on the > best strategy for combining two different tibbles with some overlap in > observations (some patients appear in both data systems, with varying > degrees of completeness of data) and with some of the same things being > mesaured and recorded in the two data systems, but with different > variable names. > > I've thought of two different strategies, neither of which seems ideal > but either of which might work: > > 1. change the variable names in dataSystemB to match their > conceptually-identical variables in dataSystemA, and then use some > version of bind_rows() > > 2. Create a unique identifier from last names, first names, and dates of > birth, use some type of full_join(), matching on that identifier, > obtaining all columns from both tibbles, and then "collapse" > conceptually-identical variables like onsetDate and date_of_onset using > coalesce() > > Sorry for my long-windedness. Grateful for any advice. > > --Chris Ryan > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.