Re: [R] Problem while predicting in regression trees

William Dunlap via R-help Mon, 09 May 2016 12:29:40 -0700

Why are you predicting from tr_m$finalModel instead of from tr_m?

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Mon, May 9, 2016 at 11:46 AM, Muhammad Bilal <
muhammad2.bi...@live.uwe.ac.uk> wrote:

> Please find the sample dataset attached along with R code pasted below to
> reproduce the issue.
>
>
> #Loading the data frame
>
> pfi <- read.csv("pfi_data.csv")
>
> #Splitting the data into training and test sets
> split <- sample.split(pfi, SplitRatio = 0.7)
> trainPFI <- subset(pfi, split == TRUE)
> testPFI <- subset(pfi, split == FALSE)
>
> #Cross validating the decision trees
> tr.control <- trainControl(method="repeatedcv", number=20)
> cp.grid <- expand.grid(.cp = (0:10)*0.001)
> tr_m <- train(project_delay ~ project_lon + project_lat + project_duration
> + sector + contract_type + capital_value, data = trainPFI, method="rpart",
> trControl=tr.control, tuneGrid = cp.grid)
>
> #Displaying the train results
> tr_m
>
> #Fetching the best tree
> best_tree <- tr_m$finalModel
>
> #Plotting the best tree
> prp(best_tree)
>
> #Using the best tree to make predictions [This command raises the error]
> best_tree_pred <- predict(best_tree, newdata = testPFI)
>
> #Calculating the SSE
> best_tree_pred.sse <- sum((best_tree_pred - testPFI$project_delay)^2)
>
> #
> tree_pred.sse
>
> ...
>
>
> Many Thanks and
>
>
> Kind Regards
>
>
>
> --
> Muhammad Bilal
> Research Fellow and Doctoral Researcher,
> Bristol Enterprise, Research, and Innovation Centre (BERIC),
> University of the West of England (UWE),
> Frenchay Campus,
> Bristol,
> BS16 1QY
>
> muhammad2.bi...@live.uwe.ac.uk<mailto:olugbenga2.akin...@live.uwe.ac.uk>
>
>
> ________________________________
> From: Max Kuhn <mxk...@gmail.com>
> Sent: 09 May 2016 17:22:22
> To: Muhammad Bilal
> Cc: Bert Gunter; r-help@r-project.org
> Subject: Re: [R] Problem while predicting in regression trees
>
> It is extremely difficult to tell what the issue might be without a
> reproducible example.
>
> The only thing that I can suggest is to use the non-formula interface to
> `train` so that you can avoid creating dummy variables.
>
> On Mon, May 9, 2016 at 11:23 AM, Muhammad Bilal <
> muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk>>
> wrote:
> Hi Bert,
>
> Thanks for the response.
>
> I checked the datasets, however, the Hospitals level appears in both of
> them. See the output below:
>
> > sqldf("SELECT sector, count(*) FROM trainPFI GROUP BY sector")
>             sector count(*)
> 1          Defense        9
> 2        Hospitals      101
> 3          Housing       32
> 4           Others       99
> 5 Public Buildings       39
> 6          Schools      148
> 7      Social Care       10
> 8      Transportation       27
> 9            Waste       26
> > sqldf("SELECT sector, count(*) FROM testPFI GROUP BY sector")
>             sector count(*)
> 1          Defense        5
> 2        Hospitals       47
> 3          Housing       11
> 4           Others       44
> 5 Public Buildings       18
> 6          Schools       69
> 7      Social Care        9
> 8   Transportation        8
> 9            Waste       12
>
> Any thing else to try?
>
> --
> Muhammad Bilal
> Research Fellow and Doctoral Researcher,
> Bristol Enterprise, Research, and Innovation Centre (BERIC),
> University of the West of England (UWE),
> Frenchay Campus,
> Bristol,
> BS16 1QY
>
> muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk>
>
>
> ________________________________________
> From: Bert Gunter <bgunter.4...@gmail.com<mailto:bgunter.4...@gmail.com>>
> Sent: 09 May 2016 01:42:39
> To: Muhammad Bilal
> Cc: r-help@r-project.org<mailto:r-help@r-project.org>
> Subject: Re: [R] Problem while predicting in regression trees
>
> It seems that the data that you used for prediction contained a level
> "Hospitals" for the sector factor that did not appear in the training
> data (or maybe it's the other way round). Check this.
>
> Cheers,
> Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Sun, May 8, 2016 at 4:14 PM, Muhammad Bilal
> <muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk>>
> wrote:
> > Hi All,
> >
> > I have the following script, that raises error at the last command. I am
> new to R and require some clarification on what is going wrong.
> >
> > #Creating the training and testing data sets
> > splitFlag <- sample.split(pfi_v3, SplitRatio = 0.7)
> > trainPFI <- subset(pfi_v3, splitFlag==TRUE)
> > testPFI <- subset(pfi_v3, splitFlag==FALSE)
> >
> >
> > #Structure of the trainPFI data frame
> >> str(trainPFI)
> > *******
> > 'data.frame': 491 obs. of  16 variables:
> >  $ project_id             : int  1 2 3 6 7 9 10 12 13 14 ...
> >  $ project_lat            : num  51.4 51.5 52.2 51.9 52.5 ...
> >  $ project_lon            : num  -0.642 -1.85 0.08 -0.401 -1.888 ...
> >  $ sector                 : Factor w/ 9 levels "Defense","Hospitals",..:
> 4 4 4 6 6 6 6 6 6 6 ...
> >  $ contract_type          : chr  "Turnkey" "Turnkey" "Turnkey" "Turnkey"
> ...
> >  $ project_duration       : int  1826 3652 121 730 730 790 522 819 998
> 372 ...
> >  $ project_delay          : int  -323 0 -60 0 0 0 -91 0 0 7 ...
> >  $ capital_value          : num  6.7 5.8 21.8 24.2 40.7 10.7 70 24.5
> 60.5 78 ...
> >  $ project_delay_pct      : num  -17.7 0 -49.6 0 0 0 -17.4 0 0 1.9 ...
> >  $ delay_type             : Ord.factor w/ 9 levels "7 months early &
> beyond"<..: 1 5 3 5 5 5 2 5 5 6 ...
> >
> > library(caret)
> > library(e1071)
> >
> > set.seed(100)
> >
> > tr.control <- trainControl(method="cv", number=10)
> > cp.grid <- expand.grid(.cp = (0:10)*0.001)
> >
> > #Fitting the model using regression tree
> > tr_m <- train(project_delay ~ project_lon + project_lat +
> project_duration + sector + contract_type + capital_value, data = trainPFI,
> method="rpart", trControl=tr.control, tuneGrid = cp.grid)
> >
> > tr_m
> >
> > CART
> > 491 samples
> > 15 predictor
> > No pre-processing
> > Resampling: Cross-Validated (10 fold)
> > Summary of sample sizes: 443, 442, 441, 442, 441, 442, ...
> > Resampling results across tuning parameters:
> >   cp     RMSE      Rsquared
> >   0.000  441.1524  0.5417064
> >   0.001  439.6319  0.5451104
> >   0.002  437.4039  0.5487203
> >   0.003  432.3675  0.5566661
> >   0.004  434.2138  0.5519964
> >   0.005  431.6635  0.5577771
> >   0.006  436.6163  0.5474135
> >   0.007  440.5473  0.5407240
> >   0.008  441.0876  0.5399614
> >   0.009  441.5715  0.5401718
> >   0.010  441.1401  0.5407121
> > RMSE was used to select the optimal model using  the smallest value.
> > The final value used for the model was cp = 0.005.
> >
> > #Fetching the best tree
> > best_tree <- tr_m$finalModel
> >
> > Alright, all the aforementioned commands worked fine.
> >
> > Except the subsequent command raises error, when the developed model is
> used to make predictions:
> > best_tree_pred <- predict(best_tree, newdata = testPFI)
> > Error in eval(expr, envir, enclos) : object 'sectorHospitals' not found
> >
> > Can someone guide me what to do to resolve this issue.
> >
> > Any help will be highly appreciated.
> >
> > Many Thanks and
> >
> > Kind Regards
> >
> > --
> > Muhammad Bilal
> > Research Fellow and Doctoral Researcher,
> > Bristol Enterprise, Research, and Innovation Centre (BERIC),
> > University of the West of England (UWE),
> > Frenchay Campus,
> > Bristol,
> > BS16 1QY
> >
> > muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk
> ><mailto:olugbenga2.akin...@live.uwe.ac.uk<mailto:
> olugbenga2.akin...@live.uwe.ac.uk>>
> >
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To
> UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To
> UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Problem while predicting in regression trees

Reply via email to