On 07/06/2011 12:41 PM, Ben Ganzfried wrote:
Hi,

My project is set up the following way:
root directory contains the following folders:
   folders: "Breast_Cancer" AND "Colorectal_Cancer" AND "Lung_Cancer" AND
"Prostate_Cancer"

I want to create a file, call it: "repeating_functions.R" and place it in
the root directory such that I can call these functions from within the
sub-folders in each type of cancer.  My confusion is that I'm not sure of
the syntax to make this happen.  For example:

Within the "Prostate_Cancer" folder, I have the following folders:
"curated" AND "src" AND "uncurated"

Within "uncurated" I have a ton of files, one of which could be:
PMID5377_fullpdata.csv

within "src" I have my R scripts, the one corresponding to the above
"uncurated" file would be:
PMID5377_curation.R

Here's the problem I'm trying to address:
Many of the uncurated files will require the same R code to curate them and
I find myself spending a lot of time copying and pasting the same code over
and over. I've spent at least 40 hours copying code I've already written and
pasting it into a new dataset.  There has simply got to be a better way to
do this.

There is: you should put your common functions in a package. Packages are a good way to organize your own code, you don't need to publish them. (You will get a warning if you put "Not for distribution" into the License field in the DESCRIPTION file, but it's just a warning.) You can also put datasets in a package; this makes sense if they are relatively static. If you get new data every day you probably wouldn't.
A common example of the code I'll write in an "uncurated" file is the
following (let's call the following snippet of code UNCURATED_EXAMPLE1):
##characteristics_ch1.2 ->  G
tmp<- uncurated$characteristics_ch1.2
tmp<- sub("grade: ","",tmp,fixed=TRUE)
tmp[tmp=="I"]<- "low"
tmp[tmp=="II"]<- "low"
tmp[tmp=="III"]<- "high"
curated$G<- tmp

The thing that changes depending on the dataset is *typically* the column
header (ie "uncurated$characteristics_ch1.2" might be
"uncurated$description" or "uncurated_characteristics_ch1.7" depending on
the dataset), although sometimes I want to substitute different words (ie
"grade" can be referred to in many different ways).

What's the easiest way to automate this?  I'd like, at a minimum, to make
UNCURATED_EXAMPLE1 look like the following:
tmp<- uncurated$characteristics_ch1.2
insert_call_to_repeating_functions.R_and_access_("grade")_function
curated$G<- tmp

It would be even better if I could say, for Prostate_Cancer, write one R
script that standardizes all the "uncurated" datasets; rather than writing
100 different R scripts.  Although I don't know how feasible this is.

Both of those sound very easy.   For example,

curate <- function(characteristic, word="grade: ") {
  tmp <- sub(word, "", characteristic, fixed=TRUE)
  tmp[tmp=="I"] <- "low"
  tmp[tmp=="II"] <- "low"
  tmp[tmp=="III"] <- "high"
  tmp
}

Then your script would just need one line

curated$G <- curate(uncurated$characteristics_ch1.2)

I don't know where you'll find the names of all the datasets, but if you can get them into a vector, it's pretty easy to write a loop that calls curate() for each one.

Deciding how much goes in the package and how much is one-off code that stays with a particular dataset is a judgment call. I'd guess based on your description that curate() belongs in the package but the rest doesn't, but you know a lot more about the details than I do.

Duncan Murdoch
I'm sorry if this sounds confusing.  Basically, I have thousands of
"uncurated" datasets with clinical information and I'm trying to standardize
all the datasets via R scripts so that all the information is standardized
for statistical analysis.  Not all of the datasets contain the same
information, but many of them do contain similar data (ie age, stage, grade,
days_to_recurrence, and many others).  Furthermore, in many cases the
standardization code is very similar across datasets (ie I'll want to delete
the words "Age: " before the actual number).  But this is not always the
case (ie sometimes a dataset will not put the different patient data (ie
age, stage, grade) in separate columns, instead putting it all in one
column, so I have to write a different function to split it by the ";" and
make a new table that is separated by column).  Anyway, I would be forever
grateful for any advice to make this quicker and am happy to provide any
clarifications.

Thank you very much.

Ben

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to