Ed:

I'm experiencing some problems using the party package (specifically mob) for prediction. I have a real scalar y I want to predict from a real valued vector x and an integral vector z. mob seemed the ideal choice from the documentation.

I'm not sure what you mean by "integral vector". If you want to apply the approach to hundreds of thousands of observations, I gues that these are categorical (maybe even binary?) but maybe not...

The first problem I had was at some nodes in a partitioning tree, the components of x may be extremely highly correlated or effectively constant (that is x are not independent for all choices of components of z). When the resulting fit is fed into predict() the result is NA - this is not the same behaviour as models returned by say lm which ignore missing coefficients. I have fixed this by defining my own statsModel (myLinearModel - imaginative) which also ignores such coefficients when predicting.

If I recall correctly, we kept linearModel as simple as we did to save as much time as possible. This can be particularly important when one of the partitioning variables has many possible splits and the linearModel has to be fitted thousands of times.

Also, mob() assesses the stability of all coefficients of the model in all nodes during partitioning. If any of the coefficients is not identified, this would have to be excluded from all subsequent parameter stability tests in that node (and its child nodes). This is currently not provided for in mob().

The second problem I have is that I get "Cholesky not positive definite" errors at some nodes. I guess this is because of numerical error and degeneracy in the covariance matrix? Any thoughts on how to avoid having this happen would be welcome; it is ignorable though for now.

This comes from the parameter stability tests and might be a result of an unidentified (or close to unidentified) model fit.

The third and really big problem I have is that when I apply mob to
large datasets (say hundreds of thousands of elements) I get a
"logical subscript too long" error inside mob_fit_fluctests. It's
caught in a try(), and mob just gives up and treats the node as
terminal. This is really hurting me though; with 1% of my data I can
get a good fit and a worthwhile tree, but with the whole dataset I get
a very stunted tree with a pretty useless prediction ability.

With hundreds of thousands of observations, you would need some additional pruning strategy anyway. Significance test-based splitting will probably overfit because tiny differences in the coefficients will be picked up at such large sample sizes.

Furthermore, computationally the extensive search over all possible splits might be too burdensome with this many observations.

Hence, using some subsampling strategy might not be the worst thing.

I guess what I really want to know is:
(a) has anyone else had this problem, and if so how did they overcome it?

We have had non-identified model fits in binary GLMs (with quasi-complete separation) where we then set estfun() to all zero so that partitioning stops. But I don't think that such a strategy helps here.

(b) is there any way to get a line or stack trace out of a try()
without source modification?

Not sure, I don't know any off the top off my head.

(c) failing all of that, does anyone know of an alternative to mob
that does the same thing; for better or worse I'm now committed to
recursive partitioning over linear models, as per mob?

If your partitioning variables are particularly simple (e.g., all binary) you could exploit that and it may be easier to write a custom function for your particular data. Then likelihood-ratio tests (rather than LM-type tests) would also be easier to apply in case of unidentified parameters.

But if there are partitioning variables with different measurement scales, then this will not be that simple...

(d) failing all of this, does anyone have a link to a way to rebuild, or locally modify, an R package (preferably windows, but anything would do)?

Have a look at the "Writing R Extensions" manual and the R for Windows FAQ.

Best,
Z

Sorry for the length of this post. If I should RTFM, please point me
at any relevant manual by all means. I've spent a few days on this as
you can maybe tell, but I'm far from being an R expert.

Thanks for any help you can give.

Best wishes,

Ed

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to