Re: [R] party for prediction [REPOST]

Achim Zeileis Sun, 14 Oct 2012 08:54:59 -0700

Ed:

I'm experiencing some problems using the party package (specificallymob) for prediction. I have a real scalar y I want to predict from areal valued vector x and an integral vector z. mob seemed the idealchoice from the documentation.

I'm not sure what you mean by "integral vector". If you want to apply theapproach to hundreds of thousands of observations, I gues that these arecategorical (maybe even binary?) but maybe not...

The first problem I had was at some nodes in a partitioning tree, thecomponents of x may be extremely highly correlated or effectivelyconstant (that is x are not independent for all choices of components ofz). When the resulting fit is fed into predict() the result is NA - thisis not the same behaviour as models returned by say lm which ignoremissing coefficients. I have fixed this by defining my own statsModel(myLinearModel - imaginative) which also ignores such coefficients whenpredicting.

If I recall correctly, we kept linearModel as simple as we did to save asmuch time as possible. This can be particularly important when one of thepartitioning variables has many possible splits and the linearModel has tobe fitted thousands of times.

Also, mob() assesses the stability of all coefficients of the model in allnodes during partitioning. If any of the coefficients is not identified,this would have to be excluded from all subsequent parameter stabilitytests in that node (and its child nodes). This is currently not providedfor in mob().

The second problem I have is that I get "Cholesky not positive definite"errors at some nodes. I guess this is because of numerical error anddegeneracy in the covariance matrix? Any thoughts on how to avoid havingthis happen would be welcome; it is ignorable though for now.

This comes from the parameter stability tests and might be a result of anunidentified (or close to unidentified) model fit.

The third and really big problem I have is that when I apply mob to
large datasets (say hundreds of thousands of elements) I get a
"logical subscript too long" error inside mob_fit_fluctests. It's
caught in a try(), and mob just gives up and treats the node as
terminal. This is really hurting me though; with 1% of my data I can
get a good fit and a worthwhile tree, but with the whole dataset I get
a very stunted tree with a pretty useless prediction ability.

With hundreds of thousands of observations, you would need some additionalpruning strategy anyway. Significance test-based splitting will probablyoverfit because tiny differences in the coefficients will be picked up atsuch large sample sizes.

Furthermore, computationally the extensive search over all possible splitsmight be too burdensome with this many observations.


Hence, using some subsampling strategy might not be the worst thing.

I guess what I really want to know is:
(a) has anyone else had this problem, and if so how did they overcome it?

We have had non-identified model fits in binary GLMs (with quasi-completeseparation) where we then set estfun() to all zero so that partitioningstops. But I don't think that such a strategy helps here.

(b) is there any way to get a line or stack trace out of a try()
without source modification?


Not sure, I don't know any off the top off my head.

(c) failing all of that, does anyone know of an alternative to mob
that does the same thing; for better or worse I'm now committed to
recursive partitioning over linear models, as per mob?

If your partitioning variables are particularly simple (e.g., all binary)you could exploit that and it may be easier to write a custom function foryour particular data. Then likelihood-ratio tests (rather than LM-typetests) would also be easier to apply in case of unidentified parameters.

But if there are partitioning variables with different measurement scales,then this will not be that simple...

(d) failing all of this, does anyone have a link to a way to rebuild, orlocally modify, an R package (preferably windows, but anything woulddo)?

Have a look at the "Writing R Extensions" manual and the R for WindowsFAQ.


Best,
Z

Sorry for the length of this post. If I should RTFM, please point me
at any relevant manual by all means. I've spent a few days on this as
you can maybe tell, but I'm far from being an R expert.

Thanks for any help you can give.

Best wishes,

Ed

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] party for prediction [REPOST]

Reply via email to