Hello Everyone,

Trying to learn a little bit about data mining. I'm working on a text mining 
project that will attempt to predict whether cancer patients got a particular 
type of genetic testing. A subsequent stage then will be aimed at predicting 
what the results of that testing were. 
 
I've used the tm package to prepare my data and am planning to use rattle to do 
the actual data mining. The tm package has proved to be a great help so far. 
I've managed to perform a variety of transformations of my data. I've also 
managed to create a document-term matrix that has a row for each of my patients 
and columns for each of the terms in my patient medical records. 
 
Because I'm not yet a particularly good R programmer, I've converted my 
document-term matrix to a data frame and then added information about the 
genetic testing. 
 
So here's the thing. The tm package has a feature that would allow me to drop 
words that occur infrequently in patient medical records. However, I've been 
asked not to use it because it's believed that even infrequently occurring 
terms may be highly diagnostic. The consequence is that my data frame has a 
large number of columns for the various words. In fact, over 27,000 of them.
 
So my question is how to reduce this to some more manageable number. One 
thought has been to look at semi-partial correlations. Here these would be 
between tested(y/n) and each predictor, controlling for length of medical 
record. The idea would be to use only those predictors that were significant in 
the actual data mining.
 
Is this likely to be a good approach? Or is there likely to be a better way of 
doing it?
 
If it is a good approach, I’m wondering how to go about obtaining the necessary 
results. I’ve managed to figure out how to compute semi-partial correlations 
using the spcor.test() function in the ppcor package, as in:
 
> spcor.test(as.numeric(Tested$TestStatus=="Yes"), Tested$predictor, Tested 
> $nchar_record)
 
   estimate      p.value statistic   n gp  Method
1 0.3853547 2.307562e-08  5.587203 182  1 pearson
 
This is fine for a single pair of variables. What I’d need though is to combine 
a whole series of such outputs, one for each of my predictors. After that, I’d 
need to be able to determine which semi-partial correlations were significant 
(or perhaps substantial) and to create a list that I could use to eliminate a 
lot of the predictors from my data frame. I’m just beginning to use R in my 
day-to-day work. So it’s not clear to me how to do this. 
 
Thanks,
 
Paul

 

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to