Hi,
That's a tough one, I'll do my best and hope a more knowledgeable person
will correct me.
 Since you can measure conditional importance by permuting predictors and
re-evaluating importance, perhaps try the randomForest package and examine
how your results change based on permutation of each predictor. I understand
permutation would take a prohibitively large amount of time for certain
applications. Try (clumsy) shortcuts:
Some pseudocode
First Option
myforest=randomForest(y~.,data=df)
imp=myforest$importance #Just for your info importance is here.
#permute an x
newx=sample(x,length(x),replace=F)
#make new forest
newforest=randomForest(y~newx+all else...)
#predict oldx with newforest
#If somewhat accurate, problems afoot.

Second Option:
Predicting your held out variable 1000 times with new forest(pretty quick to
do) and examining the quantile of the predicted value relative to the old
(non permuted) distribution of the variable, should be uniformly distributed
between 0 and 1 if truly random inside the forest (and random outside since
we know it has been permuted)... could measure this with Chi-square
statistic.

#Third option
Permute the x's and plot importance for each variable when the others are
 held out (inferential only)
              Weak I know, but I hope it helps!
               Ken Hutchison


On Fri, Oct 14, 2011 at 12:06 PM, Jason Roberts <jason.robe...@duke.edu>wrote:

> I would like to build a forest of regression trees to see how well some
> covariates predict a response variable and to examine the importance of the
> covariates. I have a small number of covariates (8) and large number of
> records (27368). The response and all of the covariates are continuous
> variables.
>
> A cursory examination of the covariates does not suggest they are
> correlated
> in a simple fashion (e.g. the variance inflation factors are all fairly
> low)
> but common sense suggests there should be some relationship: one of them is
> the day of the year and some of the others are environmental parameters
> such
> as water temperature. For this reason I would like to follow the advice of
> Strobl et al. (2008) and try the authors' conditional variable importance
> measure. This is implemented in the party package by calling varimp(...,
> conditional=TRUE). Unfortunately, when I call that on my forest I receive
> the error:
>
> > varimp(myforest, conditional=TRUE)
> Error in model.matrix.default(as.formula(f), data = blocks) :
>  term 1 would require 9e+12 columns
>
> Does anyone know what is wrong?
>
> I noticed a post in June 2011 where a user reported this message and the
> ultimate problem was that the importance measure was being conditioned on
> too many variables (47). I have only a small number of variables here so I
> guessed that was not the problem.
>
> Another suggestion was that there could be a factor with too many levels.
> In
> my case, all of the variables are continuous. Term 1 (x1 below) is the day
> of the year, which does happen to be integers 1 ... 366. But the variable
> is
> class numeric, not integer, so I don't believe cforest would treat it as a
> factor, although I do not know how to tell whether cforest is treating
> something as continuous or as a factor.
>
> Thank you for any help you can provide. I am running R 2.13.1 with party
> 0.9-99994. You can download the data from
> http://www.duke.edu/~jjr8/data.rdata (512 KB). Here is the complete code:
>
> > load("\\Temp\\data.rdata")
> > nrow(df)
> [1] 27368
> > summary(df)
>       y                 x1              x2               x3
> x4             x5                  x6              x7                  x8
>
>  Min.   :  0.000   Min.   :  1.0   Min.   :0.0000   Min.   :  1.00   Min.
> :  52   Min.   : 0.008184   Min.   :16.71   Min.   :0.0000000   Min.   :
> 0.02727
>  1st Qu.:  0.000   1st Qu.:105.0   1st Qu.:0.0000   1st Qu.: 30.00   1st
> Qu.:1290   1st Qu.: 6.747035   1st Qu.:23.92   1st Qu.:0.0000000   1st Qu.:
> 0.11850
>  Median :  1.282   Median :169.0   Median :0.2353   Median : 38.00   Median
> :1857   Median :11.310277   Median :26.35   Median :0.0001569   Median :
> 0.14625
>  Mean   :  5.651   Mean   :178.7   Mean   :0.2555   Mean   : 55.03   Mean
> :1907   Mean   :12.889021   Mean   :26.31   Mean   :0.0162043   Mean   :
> 0.20684
>  3rd Qu.:  5.353   3rd Qu.:262.0   3rd Qu.:0.4315   3rd Qu.: 47.00   3rd
> Qu.:2594   3rd Qu.:18.427410   3rd Qu.:28.95   3rd Qu.:0.0144660   3rd Qu.:
> 0.20095
>  Max.   :195.238   Max.   :366.0   Max.   :1.0000   Max.   :400.00   Max.
> :3832   Max.   :29.492380   Max.   :31.73   Max.   :0.3157486   Max.
> :11.76877
> > library(HH)
> <output deleted>
> > vif(y ~ ., data=df)
>      x1       x2       x3       x4       x5       x6       x7       x8
> 1.374583 1.252250 1.021672 1.218801 1.015124 1.439868 1.075546 1.060580
> > library(party)
> <output deleted>
> > mycontrols <- cforest_unbiased(ntree=50, mtry=3)           # Small forest
> but requires a few minutes
> > myforest <- cforest(y ~ ., data=df, controls=mycontrols)
> > varimp(myforest)
>        x1         x2         x3         x4         x5         x6         x7
> x8
>  11.924498 103.180195  16.228864  30.658946   5.053500  12.820551
> 2.113394
> 6.911377
> > varimp(myforest, conditional=TRUE)
> Error in model.matrix.default(as.formula(f), data = blocks) :
>  term 1 would require 9e+12 columns
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to