Thank you very much. This is incredibly helpful, I just added an R package and put a bunch of code in it which works very well. I just had a quick follow-up question.
Suppose across the uncurated data-sets, stage of cancer progression is entered in the following way, where the column headers are all different names: Example Data-set 1 Example Data-set 2 Example Data-set 3 Tumor stage: T3b pt stage: 3 stage: I Tumor stage: T2b pt stage: 2 stage: II Tumor stage: T3a pt stage: 1 stage: IV Tumor stage: T4c pt stage: 4 stage: IV What would the code in the R package I'm creating look like such that the corresponding code in each of these R scripts could be: curated$stage <- curate_stage (uncurated$characteristics_ch1.5[or whatever the column header is called]) That is, what code in the function curate_stage() would make it such that the output would be: Stage Output Data-set 1 Stage Output Data-set 2 Stage Output Data-set 3 3 3 1 2 2 2 3 1 4 4 4 4 Your helpful advice from before was that: curate <- function(characteristic, word="grade: ") { tmp <- sub(word, "", characteristic, fixed=TRUE) tmp[tmp=="I"] <- "low" tmp[tmp=="II"] <- "low" tmp[tmp=="III"] <- "high" tmp } I'm wondering about situations when it's not the same exact wording (ie I can't just find "grade: " and take the value after it), but is very similar (ie when the word "stage" appears in the row but is preceded and followed by distinct words) (ie "Tumor stage: ", "pt stage: ", and "stage"). Thanks in advance-- you guys are saving me countless hours of repetitive coding, allowing me to move onto the more important and interesting parts of the study. I really appreciate all your suggestions and help. Sincerely, Ben On Tue, Jun 7, 2011 at 12:53 PM, Duncan Murdoch <murdoch.dun...@gmail.com>wrote: > On 07/06/2011 12:41 PM, Ben Ganzfried wrote: > >> Hi, >> >> My project is set up the following way: >> root directory contains the following folders: >> folders: "Breast_Cancer" AND "Colorectal_Cancer" AND "Lung_Cancer" AND >> "Prostate_Cancer" >> >> I want to create a file, call it: "repeating_functions.R" and place it in >> the root directory such that I can call these functions from within the >> sub-folders in each type of cancer. My confusion is that I'm not sure of >> the syntax to make this happen. For example: >> >> Within the "Prostate_Cancer" folder, I have the following folders: >> "curated" AND "src" AND "uncurated" >> >> Within "uncurated" I have a ton of files, one of which could be: >> PMID5377_fullpdata.csv >> >> within "src" I have my R scripts, the one corresponding to the above >> "uncurated" file would be: >> PMID5377_curation.R >> >> Here's the problem I'm trying to address: >> Many of the uncurated files will require the same R code to curate them >> and >> I find myself spending a lot of time copying and pasting the same code >> over >> and over. I've spent at least 40 hours copying code I've already written >> and >> pasting it into a new dataset. There has simply got to be a better way to >> do this. >> > > There is: you should put your common functions in a package. Packages are > a good way to organize your own code, you don't need to publish them. (You > will get a warning if you put "Not for distribution" into the License field > in the DESCRIPTION file, but it's just a warning.) You can also put > datasets in a package; this makes sense if they are relatively static. If > you get new data every day you probably wouldn't. > > A common example of the code I'll write in an "uncurated" file is the >> following (let's call the following snippet of code UNCURATED_EXAMPLE1): >> ##characteristics_ch1.2 -> G >> tmp<- uncurated$characteristics_ch1.2 >> tmp<- sub("grade: ","",tmp,fixed=TRUE) >> tmp[tmp=="I"]<- "low" >> tmp[tmp=="II"]<- "low" >> tmp[tmp=="III"]<- "high" >> curated$G<- tmp >> >> The thing that changes depending on the dataset is *typically* the column >> header (ie "uncurated$characteristics_ch1.2" might be >> "uncurated$description" or "uncurated_characteristics_ch1.7" depending on >> the dataset), although sometimes I want to substitute different words (ie >> "grade" can be referred to in many different ways). >> >> What's the easiest way to automate this? I'd like, at a minimum, to make >> UNCURATED_EXAMPLE1 look like the following: >> tmp<- uncurated$characteristics_ch1.2 >> insert_call_to_repeating_functions.R_and_access_("grade")_function >> curated$G<- tmp >> >> It would be even better if I could say, for Prostate_Cancer, write one R >> script that standardizes all the "uncurated" datasets; rather than writing >> 100 different R scripts. Although I don't know how feasible this is. >> > > Both of those sound very easy. For example, > > curate <- function(characteristic, word="grade: ") { > tmp <- sub(word, "", characteristic, fixed=TRUE) > > tmp[tmp=="I"] <- "low" > tmp[tmp=="II"] <- "low" > tmp[tmp=="III"] <- "high" > tmp > } > > Then your script would just need one line > > curated$G <- curate(uncurated$characteristics_ch1.2) > > I don't know where you'll find the names of all the datasets, but if you > can get them into a vector, it's pretty easy to write a loop that calls > curate() for each one. > > Deciding how much goes in the package and how much is one-off code that > stays with a particular dataset is a judgment call. I'd guess based on your > description that curate() belongs in the package but the rest doesn't, but > you know a lot more about the details than I do. > > Duncan Murdoch > >> I'm sorry if this sounds confusing. Basically, I have thousands of >> "uncurated" datasets with clinical information and I'm trying to >> standardize >> all the datasets via R scripts so that all the information is standardized >> for statistical analysis. Not all of the datasets contain the same >> information, but many of them do contain similar data (ie age, stage, >> grade, >> days_to_recurrence, and many others). Furthermore, in many cases the >> standardization code is very similar across datasets (ie I'll want to >> delete >> the words "Age: " before the actual number). But this is not always the >> case (ie sometimes a dataset will not put the different patient data (ie >> age, stage, grade) in separate columns, instead putting it all in one >> column, so I have to write a different function to split it by the ";" and >> make a new table that is separated by column). Anyway, I would be forever >> grateful for any advice to make this quicker and am happy to provide any >> clarifications. >> >> Thank you very much. >> >> Ben >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.