I've brought this up numerous times... you shouldn't use `predict.rpart` (or whatever modeling function) from the `finalModel` object. That object has no idea what was done to the data prior to its invocation.
The issue here is that `train(formula)` converts the factors to dummy variables. `rpart` does not require that and the `finalModel` object has no idea that that happened. Using `predict.train` works just fine so why not use it? > table(predict(tr_m, newdata = testPFI)) -2617.42857142857 -1786.76923076923 -1777.58333333333 -1217.3 3 3 6 3 -886.666666666667 -408.375 -375.7 -240.307692307692 5 1 4 5 -201.612903225806 -19.6071428571429 30.8083333333333 43.9 30 72 66 9 151.5 209.647058823529 6 28 On Mon, May 9, 2016 at 2:46 PM, Muhammad Bilal < muhammad2.bi...@live.uwe.ac.uk> wrote: > Please find the sample dataset attached along with R code pasted below to > reproduce the issue. > > > #Loading the data frame > > pfi <- read.csv("pfi_data.csv") > > #Splitting the data into training and test sets > split <- sample.split(pfi, SplitRatio = 0.7) > trainPFI <- subset(pfi, split == TRUE) > testPFI <- subset(pfi, split == FALSE) > > #Cross validating the decision trees > tr.control <- trainControl(method="repeatedcv", number=20) > cp.grid <- expand.grid(.cp = (0:10)*0.001) > tr_m <- train(project_delay ~ project_lon + project_lat + project_duration > + sector + contract_type + capital_value, data = trainPFI, method="rpart", > trControl=tr.control, tuneGrid = cp.grid) > > #Displaying the train results > tr_m > > #Fetching the best tree > best_tree <- tr_m$finalModel > > #Plotting the best tree > prp(best_tree) > > #Using the best tree to make predictions *[This command raises the error]* > best_tree_pred <- predict(best_tree, newdata = testPFI) > > #Calculating the SSE > best_tree_pred.sse <- sum((best_tree_pred - testPFI$project_delay)^2) > > # > tree_pred.sse > > ... > > Many Thanks and > > > Kind Regards > > > > -- > Muhammad Bilal > Research Fellow and Doctoral Researcher, > Bristol Enterprise, Research, and Innovation Centre (BERIC), > University of the West of England (UWE), > Frenchay Campus, > Bristol, > BS16 1QY > > *muhammad2.bi...@live.uwe.ac.uk* <olugbenga2.akin...@live.uwe.ac.uk> > > > ------------------------------ > *From:* Max Kuhn <mxk...@gmail.com> > *Sent:* 09 May 2016 17:22:22 > *To:* Muhammad Bilal > *Cc:* Bert Gunter; r-help@r-project.org > > *Subject:* Re: [R] Problem while predicting in regression trees > > It is extremely difficult to tell what the issue might be without a > reproducible example. > > The only thing that I can suggest is to use the non-formula interface to > `train` so that you can avoid creating dummy variables. > > On Mon, May 9, 2016 at 11:23 AM, Muhammad Bilal < > muhammad2.bi...@live.uwe.ac.uk> wrote: > >> Hi Bert, >> >> Thanks for the response. >> >> I checked the datasets, however, the Hospitals level appears in both of >> them. See the output below: >> >> > sqldf("SELECT sector, count(*) FROM trainPFI GROUP BY sector") >> sector count(*) >> 1 Defense 9 >> 2 Hospitals 101 >> 3 Housing 32 >> 4 Others 99 >> 5 Public Buildings 39 >> 6 Schools 148 >> 7 Social Care 10 >> 8 Transportation 27 >> 9 Waste 26 >> > sqldf("SELECT sector, count(*) FROM testPFI GROUP BY sector") >> sector count(*) >> 1 Defense 5 >> 2 Hospitals 47 >> 3 Housing 11 >> 4 Others 44 >> 5 Public Buildings 18 >> 6 Schools 69 >> 7 Social Care 9 >> 8 Transportation 8 >> 9 Waste 12 >> >> Any thing else to try? >> >> -- >> Muhammad Bilal >> Research Fellow and Doctoral Researcher, >> Bristol Enterprise, Research, and Innovation Centre (BERIC), >> University of the West of England (UWE), >> Frenchay Campus, >> Bristol, >> BS16 1QY >> >> muhammad2.bi...@live.uwe.ac.uk >> >> >> ________________________________________ >> From: Bert Gunter <bgunter.4...@gmail.com> >> Sent: 09 May 2016 01:42:39 >> To: Muhammad Bilal >> Cc: r-help@r-project.org >> Subject: Re: [R] Problem while predicting in regression trees >> >> It seems that the data that you used for prediction contained a level >> "Hospitals" for the sector factor that did not appear in the training >> data (or maybe it's the other way round). Check this. >> >> Cheers, >> Bert >> >> >> Bert Gunter >> >> "The trouble with having an open mind is that people keep coming along >> and sticking things into it." >> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) >> >> >> On Sun, May 8, 2016 at 4:14 PM, Muhammad Bilal >> <muhammad2.bi...@live.uwe.ac.uk> wrote: >> > Hi All, >> > >> > I have the following script, that raises error at the last command. I >> am new to R and require some clarification on what is going wrong. >> > >> > #Creating the training and testing data sets >> > splitFlag <- sample.split(pfi_v3, SplitRatio = 0.7) >> > trainPFI <- subset(pfi_v3, splitFlag==TRUE) >> > testPFI <- subset(pfi_v3, splitFlag==FALSE) >> > >> > >> > #Structure of the trainPFI data frame >> >> str(trainPFI) >> > ******* >> > 'data.frame': 491 obs. of 16 variables: >> > $ project_id : int 1 2 3 6 7 9 10 12 13 14 ... >> > $ project_lat : num 51.4 51.5 52.2 51.9 52.5 ... >> > $ project_lon : num -0.642 -1.85 0.08 -0.401 -1.888 ... >> > $ sector : Factor w/ 9 levels >> "Defense","Hospitals",..: 4 4 4 6 6 6 6 6 6 6 ... >> > $ contract_type : chr "Turnkey" "Turnkey" "Turnkey" >> "Turnkey" ... >> > $ project_duration : int 1826 3652 121 730 730 790 522 819 998 >> 372 ... >> > $ project_delay : int -323 0 -60 0 0 0 -91 0 0 7 ... >> > $ capital_value : num 6.7 5.8 21.8 24.2 40.7 10.7 70 24.5 >> 60.5 78 ... >> > $ project_delay_pct : num -17.7 0 -49.6 0 0 0 -17.4 0 0 1.9 ... >> > $ delay_type : Ord.factor w/ 9 levels "7 months early & >> beyond"<..: 1 5 3 5 5 5 2 5 5 6 ... >> > >> > library(caret) >> > library(e1071) >> > >> > set.seed(100) >> > >> > tr.control <- trainControl(method="cv", number=10) >> > cp.grid <- expand.grid(.cp = (0:10)*0.001) >> > >> > #Fitting the model using regression tree >> > tr_m <- train(project_delay ~ project_lon + project_lat + >> project_duration + sector + contract_type + capital_value, data = trainPFI, >> method="rpart", trControl=tr.control, tuneGrid = cp.grid) >> > >> > tr_m >> > >> > CART >> > 491 samples >> > 15 predictor >> > No pre-processing >> > Resampling: Cross-Validated (10 fold) >> > Summary of sample sizes: 443, 442, 441, 442, 441, 442, ... >> > Resampling results across tuning parameters: >> > cp RMSE Rsquared >> > 0.000 441.1524 0.5417064 >> > 0.001 439.6319 0.5451104 >> > 0.002 437.4039 0.5487203 >> > 0.003 432.3675 0.5566661 >> > 0.004 434.2138 0.5519964 >> > 0.005 431.6635 0.5577771 >> > 0.006 436.6163 0.5474135 >> > 0.007 440.5473 0.5407240 >> > 0.008 441.0876 0.5399614 >> > 0.009 441.5715 0.5401718 >> > 0.010 441.1401 0.5407121 >> > RMSE was used to select the optimal model using the smallest value. >> > The final value used for the model was cp = 0.005. >> > >> > #Fetching the best tree >> > best_tree <- tr_m$finalModel >> > >> > Alright, all the aforementioned commands worked fine. >> > >> > Except the subsequent command raises error, when the developed model is >> used to make predictions: >> > best_tree_pred <- predict(best_tree, newdata = testPFI) >> > Error in eval(expr, envir, enclos) : object 'sectorHospitals' not found >> > >> > Can someone guide me what to do to resolve this issue. >> > >> > Any help will be highly appreciated. >> > >> > Many Thanks and >> > >> > Kind Regards >> > >> > -- >> > Muhammad Bilal >> > Research Fellow and Doctoral Researcher, >> > Bristol Enterprise, Research, and Innovation Centre (BERIC), >> > University of the West of England (UWE), >> > Frenchay Campus, >> > Bristol, >> > BS16 1QY >> > >> > muhammad2.bi...@live.uwe.ac.uk<mailto:olugbenga2.akin...@live.uwe.ac.uk >> > >> > >> > >> > [[alternative HTML version deleted]] >> > >> > ______________________________________________ >> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> >> ______________________________________________ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.