Why are you predicting from tr_m$finalModel instead of from tr_m? Bill Dunlap TIBCO Software wdunlap tibco.com
On Mon, May 9, 2016 at 11:46 AM, Muhammad Bilal < muhammad2.bi...@live.uwe.ac.uk> wrote: > Please find the sample dataset attached along with R code pasted below to > reproduce the issue. > > > #Loading the data frame > > pfi <- read.csv("pfi_data.csv") > > #Splitting the data into training and test sets > split <- sample.split(pfi, SplitRatio = 0.7) > trainPFI <- subset(pfi, split == TRUE) > testPFI <- subset(pfi, split == FALSE) > > #Cross validating the decision trees > tr.control <- trainControl(method="repeatedcv", number=20) > cp.grid <- expand.grid(.cp = (0:10)*0.001) > tr_m <- train(project_delay ~ project_lon + project_lat + project_duration > + sector + contract_type + capital_value, data = trainPFI, method="rpart", > trControl=tr.control, tuneGrid = cp.grid) > > #Displaying the train results > tr_m > > #Fetching the best tree > best_tree <- tr_m$finalModel > > #Plotting the best tree > prp(best_tree) > > #Using the best tree to make predictions [This command raises the error] > best_tree_pred <- predict(best_tree, newdata = testPFI) > > #Calculating the SSE > best_tree_pred.sse <- sum((best_tree_pred - testPFI$project_delay)^2) > > # > tree_pred.sse > > ... > > > Many Thanks and > > > Kind Regards > > > > -- > Muhammad Bilal > Research Fellow and Doctoral Researcher, > Bristol Enterprise, Research, and Innovation Centre (BERIC), > University of the West of England (UWE), > Frenchay Campus, > Bristol, > BS16 1QY > > muhammad2.bi...@live.uwe.ac.uk<mailto:olugbenga2.akin...@live.uwe.ac.uk> > > > ________________________________ > From: Max Kuhn <mxk...@gmail.com> > Sent: 09 May 2016 17:22:22 > To: Muhammad Bilal > Cc: Bert Gunter; r-help@r-project.org > Subject: Re: [R] Problem while predicting in regression trees > > It is extremely difficult to tell what the issue might be without a > reproducible example. > > The only thing that I can suggest is to use the non-formula interface to > `train` so that you can avoid creating dummy variables. > > On Mon, May 9, 2016 at 11:23 AM, Muhammad Bilal < > muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk>> > wrote: > Hi Bert, > > Thanks for the response. > > I checked the datasets, however, the Hospitals level appears in both of > them. See the output below: > > > sqldf("SELECT sector, count(*) FROM trainPFI GROUP BY sector") > sector count(*) > 1 Defense 9 > 2 Hospitals 101 > 3 Housing 32 > 4 Others 99 > 5 Public Buildings 39 > 6 Schools 148 > 7 Social Care 10 > 8 Transportation 27 > 9 Waste 26 > > sqldf("SELECT sector, count(*) FROM testPFI GROUP BY sector") > sector count(*) > 1 Defense 5 > 2 Hospitals 47 > 3 Housing 11 > 4 Others 44 > 5 Public Buildings 18 > 6 Schools 69 > 7 Social Care 9 > 8 Transportation 8 > 9 Waste 12 > > Any thing else to try? > > -- > Muhammad Bilal > Research Fellow and Doctoral Researcher, > Bristol Enterprise, Research, and Innovation Centre (BERIC), > University of the West of England (UWE), > Frenchay Campus, > Bristol, > BS16 1QY > > muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk> > > > ________________________________________ > From: Bert Gunter <bgunter.4...@gmail.com<mailto:bgunter.4...@gmail.com>> > Sent: 09 May 2016 01:42:39 > To: Muhammad Bilal > Cc: r-help@r-project.org<mailto:r-help@r-project.org> > Subject: Re: [R] Problem while predicting in regression trees > > It seems that the data that you used for prediction contained a level > "Hospitals" for the sector factor that did not appear in the training > data (or maybe it's the other way round). Check this. > > Cheers, > Bert > > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along > and sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Sun, May 8, 2016 at 4:14 PM, Muhammad Bilal > <muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk>> > wrote: > > Hi All, > > > > I have the following script, that raises error at the last command. I am > new to R and require some clarification on what is going wrong. > > > > #Creating the training and testing data sets > > splitFlag <- sample.split(pfi_v3, SplitRatio = 0.7) > > trainPFI <- subset(pfi_v3, splitFlag==TRUE) > > testPFI <- subset(pfi_v3, splitFlag==FALSE) > > > > > > #Structure of the trainPFI data frame > >> str(trainPFI) > > ******* > > 'data.frame': 491 obs. of 16 variables: > > $ project_id : int 1 2 3 6 7 9 10 12 13 14 ... > > $ project_lat : num 51.4 51.5 52.2 51.9 52.5 ... > > $ project_lon : num -0.642 -1.85 0.08 -0.401 -1.888 ... > > $ sector : Factor w/ 9 levels "Defense","Hospitals",..: > 4 4 4 6 6 6 6 6 6 6 ... > > $ contract_type : chr "Turnkey" "Turnkey" "Turnkey" "Turnkey" > ... > > $ project_duration : int 1826 3652 121 730 730 790 522 819 998 > 372 ... > > $ project_delay : int -323 0 -60 0 0 0 -91 0 0 7 ... > > $ capital_value : num 6.7 5.8 21.8 24.2 40.7 10.7 70 24.5 > 60.5 78 ... > > $ project_delay_pct : num -17.7 0 -49.6 0 0 0 -17.4 0 0 1.9 ... > > $ delay_type : Ord.factor w/ 9 levels "7 months early & > beyond"<..: 1 5 3 5 5 5 2 5 5 6 ... > > > > library(caret) > > library(e1071) > > > > set.seed(100) > > > > tr.control <- trainControl(method="cv", number=10) > > cp.grid <- expand.grid(.cp = (0:10)*0.001) > > > > #Fitting the model using regression tree > > tr_m <- train(project_delay ~ project_lon + project_lat + > project_duration + sector + contract_type + capital_value, data = trainPFI, > method="rpart", trControl=tr.control, tuneGrid = cp.grid) > > > > tr_m > > > > CART > > 491 samples > > 15 predictor > > No pre-processing > > Resampling: Cross-Validated (10 fold) > > Summary of sample sizes: 443, 442, 441, 442, 441, 442, ... > > Resampling results across tuning parameters: > > cp RMSE Rsquared > > 0.000 441.1524 0.5417064 > > 0.001 439.6319 0.5451104 > > 0.002 437.4039 0.5487203 > > 0.003 432.3675 0.5566661 > > 0.004 434.2138 0.5519964 > > 0.005 431.6635 0.5577771 > > 0.006 436.6163 0.5474135 > > 0.007 440.5473 0.5407240 > > 0.008 441.0876 0.5399614 > > 0.009 441.5715 0.5401718 > > 0.010 441.1401 0.5407121 > > RMSE was used to select the optimal model using the smallest value. > > The final value used for the model was cp = 0.005. > > > > #Fetching the best tree > > best_tree <- tr_m$finalModel > > > > Alright, all the aforementioned commands worked fine. > > > > Except the subsequent command raises error, when the developed model is > used to make predictions: > > best_tree_pred <- predict(best_tree, newdata = testPFI) > > Error in eval(expr, envir, enclos) : object 'sectorHospitals' not found > > > > Can someone guide me what to do to resolve this issue. > > > > Any help will be highly appreciated. > > > > Many Thanks and > > > > Kind Regards > > > > -- > > Muhammad Bilal > > Research Fellow and Doctoral Researcher, > > Bristol Enterprise, Research, and Innovation Centre (BERIC), > > University of the West of England (UWE), > > Frenchay Campus, > > Bristol, > > BS16 1QY > > > > muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk > ><mailto:olugbenga2.akin...@live.uwe.ac.uk<mailto: > olugbenga2.akin...@live.uwe.ac.uk>> > > > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To > UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To > UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.