Thank you very much.  This is incredibly helpful, I just added an R package
and put a bunch of code in it which works very well.  I just had a quick
follow-up question.

Suppose across the uncurated data-sets, stage of cancer progression is
entered in the following way, where the column headers are all different
names:

Example Data-set 1                    Example Data-set 2    Example Data-set
3
Tumor stage: T3b                        pt stage: 3                 stage: I
Tumor stage: T2b                        pt stage: 2                 stage:
II
Tumor stage: T3a                        pt stage: 1                 stage:
IV
Tumor stage: T4c                        pt stage: 4                 stage:
IV

What would the code in the R package I'm creating look like such that the
corresponding code in each of these R scripts could be:

curated$stage <- curate_stage (uncurated$characteristics_ch1.5[or whatever
the column header is called])

That is, what code in the function curate_stage() would make it such that
the output would be:

Stage Output Data-set 1        Stage Output Data-set 2   Stage Output
Data-set 3

3
3                                      1
2
2                                      2
3
1                                      4
4
4                                      4


Your helpful advice from before was that:

curate <- function(characteristic, word="grade: ") {
 tmp <- sub(word, "", characteristic, fixed=TRUE)

 tmp[tmp=="I"] <- "low"
 tmp[tmp=="II"] <- "low"
 tmp[tmp=="III"] <- "high"
 tmp
}

I'm wondering about situations when it's not the same exact wording (ie I
can't just find "grade: " and take the value after it), but is very similar
(ie when the word "stage" appears in the row but is preceded and followed by
distinct words) (ie "Tumor stage: ", "pt stage: ", and "stage").

Thanks in advance-- you guys are saving me countless hours of repetitive
coding, allowing me to move onto the more important and interesting parts of
the study. I really appreciate all your suggestions and help.

Sincerely,

Ben


On Tue, Jun 7, 2011 at 12:53 PM, Duncan Murdoch <murdoch.dun...@gmail.com>wrote:

> On 07/06/2011 12:41 PM, Ben Ganzfried wrote:
>
>> Hi,
>>
>> My project is set up the following way:
>> root directory contains the following folders:
>>   folders: "Breast_Cancer" AND "Colorectal_Cancer" AND "Lung_Cancer" AND
>> "Prostate_Cancer"
>>
>> I want to create a file, call it: "repeating_functions.R" and place it in
>> the root directory such that I can call these functions from within the
>> sub-folders in each type of cancer.  My confusion is that I'm not sure of
>> the syntax to make this happen.  For example:
>>
>> Within the "Prostate_Cancer" folder, I have the following folders:
>> "curated" AND "src" AND "uncurated"
>>
>> Within "uncurated" I have a ton of files, one of which could be:
>> PMID5377_fullpdata.csv
>>
>> within "src" I have my R scripts, the one corresponding to the above
>> "uncurated" file would be:
>> PMID5377_curation.R
>>
>> Here's the problem I'm trying to address:
>> Many of the uncurated files will require the same R code to curate them
>> and
>> I find myself spending a lot of time copying and pasting the same code
>> over
>> and over. I've spent at least 40 hours copying code I've already written
>> and
>> pasting it into a new dataset.  There has simply got to be a better way to
>> do this.
>>
>
> There is:  you should put your common functions in a package.  Packages are
> a good way to organize your own code, you don't need to publish them.  (You
> will get a warning if you put "Not for distribution" into the License field
> in the DESCRIPTION file, but it's just a warning.)  You can also put
> datasets in a package; this makes sense if they are relatively static.  If
> you get new data every day you probably wouldn't.
>
>  A common example of the code I'll write in an "uncurated" file is the
>> following (let's call the following snippet of code UNCURATED_EXAMPLE1):
>> ##characteristics_ch1.2 ->  G
>> tmp<- uncurated$characteristics_ch1.2
>> tmp<- sub("grade: ","",tmp,fixed=TRUE)
>> tmp[tmp=="I"]<- "low"
>> tmp[tmp=="II"]<- "low"
>> tmp[tmp=="III"]<- "high"
>> curated$G<- tmp
>>
>> The thing that changes depending on the dataset is *typically* the column
>> header (ie "uncurated$characteristics_ch1.2" might be
>> "uncurated$description" or "uncurated_characteristics_ch1.7" depending on
>> the dataset), although sometimes I want to substitute different words (ie
>> "grade" can be referred to in many different ways).
>>
>> What's the easiest way to automate this?  I'd like, at a minimum, to make
>> UNCURATED_EXAMPLE1 look like the following:
>> tmp<- uncurated$characteristics_ch1.2
>> insert_call_to_repeating_functions.R_and_access_("grade")_function
>> curated$G<- tmp
>>
>> It would be even better if I could say, for Prostate_Cancer, write one R
>> script that standardizes all the "uncurated" datasets; rather than writing
>> 100 different R scripts.  Although I don't know how feasible this is.
>>
>
> Both of those sound very easy.   For example,
>
> curate <- function(characteristic, word="grade: ") {
>  tmp <- sub(word, "", characteristic, fixed=TRUE)
>
>  tmp[tmp=="I"] <- "low"
>  tmp[tmp=="II"] <- "low"
>  tmp[tmp=="III"] <- "high"
>  tmp
> }
>
> Then your script would just need one line
>
> curated$G <- curate(uncurated$characteristics_ch1.2)
>
> I don't know where you'll find the names of all the datasets, but if you
> can get them into a vector, it's pretty easy to write a loop that calls
> curate() for each one.
>
> Deciding how much goes in the package and how much is one-off code that
> stays with a particular dataset is a judgment call.  I'd guess based on your
> description that curate() belongs in the package but the rest doesn't, but
> you know a lot more about the details than I do.
>
> Duncan Murdoch
>
>> I'm sorry if this sounds confusing.  Basically, I have thousands of
>> "uncurated" datasets with clinical information and I'm trying to
>> standardize
>> all the datasets via R scripts so that all the information is standardized
>> for statistical analysis.  Not all of the datasets contain the same
>> information, but many of them do contain similar data (ie age, stage,
>> grade,
>> days_to_recurrence, and many others).  Furthermore, in many cases the
>> standardization code is very similar across datasets (ie I'll want to
>> delete
>> the words "Age: " before the actual number).  But this is not always the
>> case (ie sometimes a dataset will not put the different patient data (ie
>> age, stage, grade) in separate columns, instead putting it all in one
>> column, so I have to write a different function to split it by the ";" and
>> make a new table that is separated by column).  Anyway, I would be forever
>> grateful for any advice to make this quicker and am happy to provide any
>> clarifications.
>>
>> Thank you very much.
>>
>> Ben
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to