Hello All,

This will probably be easy for some but isn't for me. Currently am working on a 
text mining exercise. Want to be able to predict whether cancer patients got 
KRAS testing, and, if so, whether the test yielded a result of wild 
type/negative or mutant/positive. I've begun with a "bag-of-words approach" 
that looks at the count of specific terms in the medical records and then uses 
some of those as predictors. 

This works great for predicting whether or not patients got tested. It's not so 
good though when it comes to predicting the outcome of testing. Trouble is that 
patients can have a reference to KRAS testing and also have a lot of references 
to, say, "positive" where that term has nothing to do with the result of their 
KRAS testing. 

So I'd like to be able to identify the number of instances in a patient's 
medical record where relevant terms like "wild type", "negative", "mutant", or 
"positive" come either shortly before or shortly after "KRAS". It would be 
great if there is a way to do this within the tm package. I've found that very 
helpful for preparing my data thus far.

If not though, I have a data frame that contains patient number in one column 
and the patient's complete text medical record in another. So some sort of 
regular expression likely would work just fine. 

Here are some examples of the sort of thing I'm looking to count:

"Received KRAS testing results on xx/xx/xxxx. Test results indicate the 
presence of a mutation."

"Tumor is KRAS negative"

"KRAS (mutated)" 

"Tumor is positive for KRAS mutation" 

And here's an example of something I want to ignore.

"Will conduct KRAS testing prior to initiation of therapy. ... (Several lines 
of material) ... Bilirubin positive."

A couple of things stand out here. The first is that I need to be able to pick 
up on variations of the relevant terms. So, for example, that means being able 
to identify that either "mutant" or "mutated" came in close proximity to 
"KRAS". 

The other thing is that while increasing the number of words to look forward 
and backward will identify more valid cases, it will also tend to identify more 
invalid ones as well. For example, looking as many as 12 words after KRAS will 
lead to correct identification of:

"Received KRAS testing results on xx/xx/xxxx. Test results indicate the 
presence of a mutation."

but also incorrect identification of:

"Will conduct KRAS testing prior to initiation of therapy. Note that patient 
was positive for Lynch mutation."

Thinking I will need to to keep the window short in order to obtain the best 
results. Would be nice if I could easily increase or decrease the number of 
words to look forward and backward though. Would also be good if I could, say, 
select a relatively small number of terms to look forward and a larger number 
of words to look forward.

Having gotten to the end of this description it occurs to me this is actually 
harder than I thought.

If one of you gurus could help me out, that would be greatly appreciated.

Thanks,

Paul

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to