Hi Dennis, My replies are in-line.
On Tue, Apr 26, 2011 at 9:15 PM, Dennis Murphy <[email protected]> wrote: > Hi: > > My view, which may well be narrow, is that techniques like PLS and PCR > are useful fit procedures, but I would be very leery about using them > as prediction machines. With new data, why should a similar set of > principal components emerge? Why should the ordering be (close to) the > same? Why should features present in the training data necessarily be > present in test data? And if the PCs vary considerably from one set of > data to another, what's the point of prediction, since the covariate > set is variable from one iteration to the next? Thinking a little more > mathematically, why should I believe that the same set of basis > functions (covariates + PCs) would reasonably apply to future data? > One problem, as I see it, is that the principal components, when used > as basis functions, are functions of the training data; in that > context, why is it believable that they would well predict future > data? [If this is Greek to you (or 'Kling-on', as one of my friends > says), the basis functions in regression are the columns of the model > matrix X, which map to the terms in the 'linear predictor'.] One of > the potential problems is that the effective dimension of the reduced > PC space may well change from one data set to the next. If all PCs are > retained, then there is a serious danger of overfitting, which is a > serious problem in prediction. > > If you're going to contemplate using such models for prediction, I > would seriously consider looking into model validation procedures; > they should provide some clue about how well a fitted model predicts > to new cases. One of the best treatments of the subject I know is > Frank Harrell's Regression Modeling Strategies book (which I believe > will have a new edition out within the next couple of months). There > is a current thread about this topic re logistic regression validation > where the OP has done a nice job of working through the process - > Prof. Harrell has chimed in a few times with some nice comments and > observations. Most of the code to do this kind of thing in R resides > in the rms package; see ?validate and its related functions. I don't > know if it can be applied to PLS/PCR models (rather doubtful) but the > methodology is what is important; e.g., the estimation of optimism in > various figures of merit (e.g., R^2, MSE) when applied over a number > of test sets, which provides an indication of how much bias is present > in the fitted model due to potential overfitting. The process relies > heavily on bootstrapping, so is in some sense vulnerable to the issues > that arise with the bootstrap (e.g., population undercoverage), but in > very large training sets this becomes less of a problem. If you can > validate a PCR model and provide evidence to back it up, then most > people (present company included) would have less ammunition to attack > your prediction model. > > Thank you for these suggestions. The PLS package I am using does include methods for cross validation to evaluate the quality of PCR/PLSR models, as well as for selecting the optimal number of components to use for predicting using a given model to avoid over fitting. I will also have a look at the RMS package. > > On Tue, Apr 26, 2011 at 11:26 AM, Alison Callahan > <[email protected]> wrote: > > Hello again all, > > > > I am responding to my own earlier post about a "non-conformable > arguments" > > error with the predict() function of the pls package ( > > http://cran.r-project.org/web/packages/pls/) in R 2.13.0 (running in > Ubuntu > > 10.10). > > > > I believe I have narrowed down the cause of the error. My new > understanding > > is that if the test data to be predicted using a regression model (where > the > > test data is passed in as 'newdata' to the predict() function) does not > > contain all possible levels of factors in the training data then the > > predict() function returns a "non-conformable arguments" error. > > > > However, this seems like an odd behaviour to me. Surely not all new data > for > > which the dependent variable(s) are to be predicted will contain all > levels > > of a factor present in the training data. Can someone shed some light on > why > > the predict() function of the pls package has this behaviour? And how to > > avoid it if possible in a way that doesn't involve users having to insert > > dummy values in new data? > > I don't find this odd at all; rather, I find it comforting. From an R > programming perspective, the factors in your newdata should have > exactly the same defined levels as those in the training data. You > could do this with something like > > newdata$somefactor <- factor(newdata$somefactor, levels = > levels(trainingdata$somefactor)) > > What happens if, in future data, one or more new levels of a factor > arise that were not in the training data used to build the prediction > model? > > I absolutely agree with you. New levels for factors in future data that didn't exist in training data used would of course be a problem for predicting. However, in my case, I am trying to use predict() on new data that has a *subset* of the factors present in the training data, and I am getting a "non-conformable arguments" error. For example, my training data has levels A,B,C,D and E for a given factor, while my test data contains only levels B,C and D. Being somewhat new to R, I confused the values of the factor in the new data with the possible levels of that factor. When I specified that the levels of the factor in my test data were to be the same as in the training data, I did not get the "non-conformable arguments" error. Thanks! Alison Dennis > > > > Thanks, > > > > Alison > > > > On Mon, Apr 18, 2011 at 6:18 PM, Alison Callahan > > <[email protected]>wrote: > > > >> Hello all, > >> > >> I have generated a principal components regression model using the pcr() > >> function from the PLS package (R version 2.13.0). I am getting a > >> "non-conformable arguments" error when I try to use the predict() > function > >> on new data, but only when I try to read in the new data from a separate > >> file. > >> > >> More specifically, when my data looks like this > >> > >> #########training data #1################# > >> > >> var1 var2 var3 response train > >> 1 2 type1 33 > >> TRUE > >> 2 23 type2 44 > TRUE > >> ..... > >> ....... > >> 18 11 type1 45 > >> FALSE > >> > >> > >> and I use the predict() function from the PLS package as in the example > >> from http://rss.acs.unt.edu/Rdoc/library/pls/html/predict.mvr.html, > e.g. > >> > >> ################################### > >> mydata <- read.csv("mydata.csv", header=TRUE) > >> > >> mydata <- data.frame(mydata) > >> > >> pcrmodel <- pcr(response ~ var1+var2+var3, data = mydata[mydata$train,]) > >> > >> predict(pcrmodel, type = "response", newdata = mydata[!mydata$train,]) > >> > >> ################################### > >> > >> the code works, and the model predicts new values for the "response" > >> variable rows where train=FALSE. > >> > >> However, as soon as I put the rows where train = FALSE into a separate > file > >> and remove the "train" column so that my training data looks like this: > >> > >> #########training data #2 ################ > >> var1 var2 var3 response > >> 1 2 type1 33 > >> 2 23 type2 44 > >> ..... > >> > >> > >> and my new test data, saved in a separate file (say "newdata.csv") looks > >> like this > >> > >> ########test data in separate file, newdata.csv ############### > >> var1 var2 var3 response > >> 3 5 type1 23 > >> 4 7 type2 30 > >> ..... > >> 18 11 type1 45 > >> > >> if I train a PCR model using the training data #2 and try to predict > with > >> the resulting model and the data from "newdata.csv", e.g., > >> > >> ################################## > >> trainingdata <- read.csv("mydata_without_train_column.csv", header=TRUE) > >> > >> trainingdata <- data.frame(trainingdata) > >> > >> testingdata <- read.csv("newdata.csv", header=TRUE) > >> > >> testingdata <- data.frame(testingdata) > >> > >> pcrmodel2 <- pcr(response ~ var1+var2+var3, data = trainingdata) > >> > >> predict(pcrmodel, type = "response", newdata = testingdata) > >> ############################## > >> > >> I get the following error: > >> > >> "Error in newX %*% B : non-conformable arguments" > >> > >> I don't understand why I get this error only when I put the non-training > >> data into a separate file from the training data and load it as a > separate > >> object. Any help is appreciated, > >> > >> Alison > >> > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > [email protected] mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > [[alternative HTML version deleted]] ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

