Dear Peter, There are several packages that try to address this type of problem (although the remarks made by Max are something that we should always keep in mind), and I also recommend those with perform some form of regularized, penalized or shrunken linear discriminant analysis with a preliminary variable selection step .
You can take a look at the hda, rda, sda, SDDA, HDclassif or my own HiDimDA, packages for some of the most important alternatives. Hope this helps. Best, Pedro Pedro Duarte Silva Associate Professor of Statistics and Operations Research Faculdade de Economia e Gestão Universidade Católica Portuguesa / Porto www.feg.porto.ucp.pt Date: Mon, 19 Nov 2012 20:53:10 +0100 From: Peter Kupfer <peter.kup...@me.com> To: Max Kuhn <mxk...@gmail.com> Cc: "r-help@r-project.org" <r-help@r-project.org> Subject: Re: [R] Classification methods - which one? Message-ID: <ed56664a-e8ef-4733-a12b-35117f347...@me.com> Content-Type: text/plain; CHARSET=US-ASCII Dear Max, first: Thanks a lot for your suggestion and the open words about methods in real life. I guess: Thats my problem. Regarding my analysis: Yes, thats the problem and I have to coerce to do this analysis regarding lack of time to start something/other methods. So you suggest Linear Discriminant Analysis. Is there a special packages you recommend? Nearest Shrunken Centroids i checked with the package PAMR (http://www-stat.stanford.edu/~tibs/PAM/Rdist/doc/readme.html) The example works fine but I guess i have to many rows (or in this case genes) for the analysis. My main problem is that i cannot reduce the amount of the genes because some of the bosses want to compare the output of classification methods with a ruled-based algorithm which works with all genes (after P/A calls and an alternative CDF) on the array. So an reduction of the 17 000 genes is only possible in a limited way (around 7000 genes after some pre-processing steps). For all tips and suggestions I am more than happy. Best Peter Am 19.11.2012 um 16:36 schrieb Max Kuhn <mxk...@gmail.com>: > My suggestion is not to do any predictive modeling. Basically, the > data doesn't support a sensible and reproducible model. Yes, the > literature is saturated with this type of analysis but almost none of > the examples have any utility in real life. > > Stick to differential expression analysis, investigate the results > statistically and biologically then design a prospective experiment > with a specific set of genes and a more refined measurement system. > > If you are doing this analysis to learn something from the data (as > opposed to generating accurate predictions), a predictive model is one > of the worst ways of going about it. > > If you are coerced to do this analysis, stick to linear methods > (regularized LDA, nearest shrunken centroids, etc) that are less > likely to over-fit and bias yourself towards those that have embedded > feature selection. > > Max > > > On Mon, Nov 19, 2012 at 10:16 AM, Peter Kupfer <peter.kup...@me.com> wrote: >> Dear all, >> i searched for some classification methods and I have no glue if i took the >> right once. >> My problem: I have a matrix with 17000 rows and 33 colums (genes and >> patients). The patients are grouped into 3 diseases. >> No I want to classify the patients and for sure i want to know which rows >> are more helpful for the classification than others. >> >> I tried SVM and random forest. Do you think this are the right >> classification methods? Maybe there are some hints you can give me. I am >> more familiar with the Bioconductor packages. Furthermore: This is/was not >> my field of study in the past but I want to understand it and I am willing >> to deal with this field. >> Would be amazing if one of the (more) mathematical people can give me a hint. >> Thanks and all the best >> >> Peter >> >> >> PS: I can upload my underlying data if somebody is interested >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. ________________________________ AVISO DE CONFIDENCIALIDADE\ Esta mensagem (incluindo qua...{{dropped:16}} ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.