Many thanks Max for these valuable suggestions.
-- Muhammad Bilal Research Fellow and Doctoral Researcher, Bristol Enterprise, Research, and Innovation Centre (BERIC), University of the West of England (UWE), Frenchay Campus, Bristol, BS16 1QY muhammad2.bi...@live.uwe.ac.uk<mailto:olugbenga2.akin...@live.uwe.ac.uk> ________________________________ From: Max Kuhn <mxk...@gmail.com> Sent: 09 May 2016 23:22:30 To: Muhammad Bilal Cc: Bert Gunter; r-help@r-project.org Subject: Re: [R] Problem while predicting in regression trees I've brought this up numerous times... you shouldn't use `predict.rpart` (or whatever modeling function) from the `finalModel` object. That object has no idea what was done to the data prior to its invocation. The issue here is that `train(formula)` converts the factors to dummy variables. `rpart` does not require that and the `finalModel` object has no idea that that happened. Using `predict.train` works just fine so why not use it? > table(predict(tr_m, newdata = testPFI)) -2617.42857142857 -1786.76923076923 -1777.58333333333 -1217.3 3 3 6 3 -886.666666666667 -408.375 -375.7 -240.307692307692 5 1 4 5 -201.612903225806 -19.6071428571429 30.8083333333333 43.9 30 72 66 9 151.5 209.647058823529 6 28 On Mon, May 9, 2016 at 2:46 PM, Muhammad Bilal <muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk>> wrote: Please find the sample dataset attached along with R code pasted below to reproduce the issue. #Loading the data frame pfi <- read.csv("pfi_data.csv") #Splitting the data into training and test sets split <- sample.split(pfi, SplitRatio = 0.7) trainPFI <- subset(pfi, split == TRUE) testPFI <- subset(pfi, split == FALSE) #Cross validating the decision trees tr.control <- trainControl(method="repeatedcv", number=20) cp.grid <- expand.grid(.cp = (0:10)*0.001) tr_m <- train(project_delay ~ project_lon + project_lat + project_duration + sector + contract_type + capital_value, data = trainPFI, method="rpart", trControl=tr.control, tuneGrid = cp.grid) #Displaying the train results tr_m #Fetching the best tree best_tree <- tr_m$finalModel #Plotting the best tree prp(best_tree) #Using the best tree to make predictions [This command raises the error] best_tree_pred <- predict(best_tree, newdata = testPFI) #Calculating the SSE best_tree_pred.sse <- sum((best_tree_pred - testPFI$project_delay)^2) # tree_pred.sse ... Many Thanks and Kind Regards -- Muhammad Bilal Research Fellow and Doctoral Researcher, Bristol Enterprise, Research, and Innovation Centre (BERIC), University of the West of England (UWE), Frenchay Campus, Bristol, BS16 1QY muhammad2.bi...@live.uwe.ac.uk<mailto:olugbenga2.akin...@live.uwe.ac.uk> ________________________________ From: Max Kuhn <mxk...@gmail.com<mailto:mxk...@gmail.com>> Sent: 09 May 2016 17:22:22 To: Muhammad Bilal Cc: Bert Gunter; r-help@r-project.org<mailto:r-help@r-project.org> Subject: Re: [R] Problem while predicting in regression trees It is extremely difficult to tell what the issue might be without a reproducible example. The only thing that I can suggest is to use the non-formula interface to `train` so that you can avoid creating dummy variables. On Mon, May 9, 2016 at 11:23 AM, Muhammad Bilal <muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk>> wrote: Hi Bert, Thanks for the response. I checked the datasets, however, the Hospitals level appears in both of them. See the output below: > sqldf("SELECT sector, count(*) FROM trainPFI GROUP BY sector") sector count(*) 1 Defense 9 2 Hospitals 101 3 Housing 32 4 Others 99 5 Public Buildings 39 6 Schools 148 7 Social Care 10 8 Transportation 27 9 Waste 26 > sqldf("SELECT sector, count(*) FROM testPFI GROUP BY sector") sector count(*) 1 Defense 5 2 Hospitals 47 3 Housing 11 4 Others 44 5 Public Buildings 18 6 Schools 69 7 Social Care 9 8 Transportation 8 9 Waste 12 Any thing else to try? -- Muhammad Bilal Research Fellow and Doctoral Researcher, Bristol Enterprise, Research, and Innovation Centre (BERIC), University of the West of England (UWE), Frenchay Campus, Bristol, BS16 1QY muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk> ________________________________________ From: Bert Gunter <bgunter.4...@gmail.com<mailto:bgunter.4...@gmail.com>> Sent: 09 May 2016 01:42:39 To: Muhammad Bilal Cc: r-help@r-project.org<mailto:r-help@r-project.org> Subject: Re: [R] Problem while predicting in regression trees It seems that the data that you used for prediction contained a level "Hospitals" for the sector factor that did not appear in the training data (or maybe it's the other way round). Check this. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Sun, May 8, 2016 at 4:14 PM, Muhammad Bilal <muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk>> wrote: > Hi All, > > I have the following script, that raises error at the last command. I am new > to R and require some clarification on what is going wrong. > > #Creating the training and testing data sets > splitFlag <- sample.split(pfi_v3, SplitRatio = 0.7) > trainPFI <- subset(pfi_v3, splitFlag==TRUE) > testPFI <- subset(pfi_v3, splitFlag==FALSE) > > > #Structure of the trainPFI data frame >> str(trainPFI) > ******* > 'data.frame': 491 obs. of 16 variables: > $ project_id : int 1 2 3 6 7 9 10 12 13 14 ... > $ project_lat : num 51.4 51.5 52.2 51.9 52.5 ... > $ project_lon : num -0.642 -1.85 0.08 -0.401 -1.888 ... > $ sector : Factor w/ 9 levels "Defense","Hospitals",..: 4 4 > 4 6 6 6 6 6 6 6 ... > $ contract_type : chr "Turnkey" "Turnkey" "Turnkey" "Turnkey" ... > $ project_duration : int 1826 3652 121 730 730 790 522 819 998 372 ... > $ project_delay : int -323 0 -60 0 0 0 -91 0 0 7 ... > $ capital_value : num 6.7 5.8 21.8 24.2 40.7 10.7 70 24.5 60.5 78 > ... > $ project_delay_pct : num -17.7 0 -49.6 0 0 0 -17.4 0 0 1.9 ... > $ delay_type : Ord.factor w/ 9 levels "7 months early & > beyond"<..: 1 5 3 5 5 5 2 5 5 6 ... > > library(caret) > library(e1071) > > set.seed(100) > > tr.control <- trainControl(method="cv", number=10) > cp.grid <- expand.grid(.cp = (0:10)*0.001) > > #Fitting the model using regression tree > tr_m <- train(project_delay ~ project_lon + project_lat + project_duration + > sector + contract_type + capital_value, data = trainPFI, method="rpart", > trControl=tr.control, tuneGrid = cp.grid) > > tr_m > > CART > 491 samples > 15 predictor > No pre-processing > Resampling: Cross-Validated (10 fold) > Summary of sample sizes: 443, 442, 441, 442, 441, 442, ... > Resampling results across tuning parameters: > cp RMSE Rsquared > 0.000 441.1524 0.5417064 > 0.001 439.6319 0.5451104 > 0.002 437.4039 0.5487203 > 0.003 432.3675 0.5566661 > 0.004 434.2138 0.5519964 > 0.005 431.6635 0.5577771 > 0.006 436.6163 0.5474135 > 0.007 440.5473 0.5407240 > 0.008 441.0876 0.5399614 > 0.009 441.5715 0.5401718 > 0.010 441.1401 0.5407121 > RMSE was used to select the optimal model using the smallest value. > The final value used for the model was cp = 0.005. > > #Fetching the best tree > best_tree <- tr_m$finalModel > > Alright, all the aforementioned commands worked fine. > > Except the subsequent command raises error, when the developed model is used > to make predictions: > best_tree_pred <- predict(best_tree, newdata = testPFI) > Error in eval(expr, envir, enclos) : object 'sectorHospitals' not found > > Can someone guide me what to do to resolve this issue. > > Any help will be highly appreciated. > > Many Thanks and > > Kind Regards > > -- > Muhammad Bilal > Research Fellow and Doctoral Researcher, > Bristol Enterprise, Research, and Innovation Centre (BERIC), > University of the West of England (UWE), > Frenchay Campus, > Bristol, > BS16 1QY > > muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk><mailto:olugbenga2.akin...@live.uwe.ac.uk<mailto:olugbenga2.akin...@live.uwe.ac.uk>> > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To > UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.