This was an exceptionally helpful answer, I can only thank you again. I have plenty of avenues ahead where I was worried before I was getting trapped in a dead end. If all else fails, the idea of using anova is brilliant. Thank you!
Ed On 14 October 2012 18:36, Achim Zeileis <achim.zeil...@uibk.ac.at> wrote: > On Sun, 14 Oct 2012, Ed wrote: > >> First up, thanks hugely for your response. I've been beating my head >> against this! >> >> On 14 October 2012 16:51, Achim Zeileis <achim.zeil...@uibk.ac.at> wrote: >>> >>> I'm not sure what you mean by "integral vector". If you want to apply the >>> approach to hundreds of thousands of observations, I gues that these are >>> categorical (maybe even binary?) but maybe not... >> >> >> I'm sorry I can't go into the details of the data, I would if I could. >> z are categorical variables represented as integers, mostly ordered, >> but not all. I've tried fitting them as integers, as well as ordered, >> but O don't think it made a huge difference. > > > The tests performed for categorical partitioning variables are rather > different from the tests for numerical partitioning variables. If all of the > variables are categorical, this may not be immediately obvious, but the > factor coding should be more appropriate (especially if the number of levels > is small or moderate). > > >>> If I recall correctly, we kept linearModel as simple as we did to save as >>> much time as possible. This can be particularly important when one of the >>> partitioning variables has many possible splits and the linearModel has >>> to >>> be fitted thousands of times. >> >> >> I can appreciate that, but maybe having an alternative linearModel which >> will predict when the fit is degenerate would be worth including? I'm happy >> to contribute what I have, although it's pretty obvious stuff (and probably >> done suboptimally since I'm not much of an R coder at this point). For me at >> least, even with huge datasets, the speed of party is quite good; it's >> getting a better result that's the problem. > > > As I explained in my last e-mail. In your situation this does not solve the > problem completely because subsequent the tests are also not adapted to > this. Setting the empirical estimating functions to zero for non-identified > coefficients might alleviate the problem but is not really a clean solution. > > >>> Also, mob() assesses the stability of all coefficients of the model in >>> all >>> nodes during partitioning. If any of the coefficients is not identified, >>> this would have to be excluded from all subsequent parameter stability >>> tests >>> in that node (and its child nodes). This is currently not provided for in >>> mob(). >> >> >> Would pretending the coefficients were fit at 0 fool mob into doing >> something moderately meaningful here? > > > The coefficients are not looked at during fitting, only the estfun(). This > would have to be set to 0. > > >> If not, I would try to hack the code, but I'm honestly at something of >> a loss as to how to modify it and feed the results back into my >> interpreter. I have bytecode installed; I downloaded the source, but I >> haven't squared the circle of modifying the source and installing the >> result. I will check out the docs on writing extensions you suggest. > > > Writing (or modifying) R packages and installing them under Windows is > pretty standard and well documented. The pointers I gave you should > hopefully get you started. > > >>>> The second problem I have is that I get "Cholesky not positive definite" >>>> errors at some nodes. I guess this is because of numerical error and >>>> degeneracy in the covariance matrix? Any thoughts on how to avoid having >>>> this happen would be welcome; it is ignorable though for now. >>> >>> >>> This comes from the parameter stability tests and might be a result of an >>> unidentified (or close to unidentified) model fit. >> >> >> This is a great help to know. I improved my results quite considerably >> with aggressive scaling of everything (scaling the response and all >> the predictors to lie between 0 an 1). That deepened my tree by a >> factor of two or so (say depth 3 to 7) and improved the quality of fit >> substantially. Is there any way I can engage a more numerically robust >> Cholesky in mob? > > > No, I don't think that this is conceivable with the way this is implemented > at the moment. Instead of the currently implemented tests, one could in > principle use the likelihood ratio test which is invariant against parameter > transformations and doesn't need the covariance matrix. > > >>> With hundreds of thousands of observations, you would need some >>> additional >>> pruning strategy anyway. Significance test-based splitting will probably >>> overfit because tiny differences in the coefficients will be picked up at >>> such large sample sizes. >> >> >> I'm okay with overfitting, honestly. At the moment it is underfitting >> by quite a large amount I think (the quality of the predictions on the >> training set is not very high). The problem really is there is so much >> going on the data, but the "noise" level is probably very low. I >> wouldn't be surprised if my data was accurate to 5 or 6 s.f. > > > OK. Maybe with so little noise some other splitting strategy (not based on > significance tests) would be better? > > >>> Furthermore, computationally the extensive search over all possible >>> splits >>> might be too burdensome with this many observations. >> >> >> I have plenty of compute time/power and RAM, though R seems to be >> running single threaded. But even on a few million observations, it's >> still pretty fast and doesn't use more than 30 or so gig of memory. If >> it takes a day and requires 150gig of RAM, that is absolutely fine, >> even over that would be viable though less optimal. > > > With this setup, you may consider writing your own partitioning algorithm > using the same type of ideas as MOB. Instead of using the parameter > stability tests, you could use plain likelihood ratio tests or ANOVAs to > decide about splitting further. > > If d is the data in the current node, y and x are response/regressors, and > f1 to fn are your categorical partitioning variables, you could do > > m0 <- lm(y ~ x, data = d) > m1 <- lm(y ~ f1 * x, data = d) > ... > mn <- lm(y ~ fn * x, data = d) > > and then > > anova(m0, m1) > ... > anova(m0, mn) > > and then choose the lowest p-value for splitting. You could then also > parallelize the computations in the daughter nodes. > > None of this is readily coded in party though... > > >>> Hence, using some subsampling strategy might not be the worst thing. >> >> >> I've tried this at various degrees, but the data is really very >> complicated with not a lot of error. I'm trying to encourage party to >> fit more closely, which I thought more data might encourage. At the >> moment I'm a long way from a clean fit. I have subsampled at various >> levels down to 1%, and although that increases the depth of the tree >> and quality of fit, it still doesn't give a very good quality fit and >> can encourage it to overlook obvious aspects of the training set. >> >>>> I guess what I really want to know is: >>>> (a) has anyone else had this problem, and if so how did they overcome >>>> it? >>> >>> >>> >>> We have had non-identified model fits in binary GLMs (with quasi-complete >>> separation) where we then set estfun() to all zero so that partitioning >>> stops. But I don't think that such a strategy helps here. >> >> >> I've considered using rpart() to partition into cells of constant >> gradient, then fitting linear models myself to the cells. This is my >> next thought. I'm pretty sure partitioning over linear regression is >> the way forward for the data we have. I tried mars and glm but there >> are good reasons to think they're less reasonable, even though the fit >> wasn't particularly poor. I'm not particularly wedded to party's >> approach except that it looked like it immediately returned what we >> needed, and with some degree of "optimality" into the bargain. > > > Possibly boosting linear models may be another route to go. > > >>>> (b) is there any way to get a line or stack trace out of a try() >>>> without source modification? >>> >>> >>> Not sure, I don't know any off the top off my head. >> >> >> I guess I really will have to bite the bullet and try to figure out >> how to install modified libraries. Thanks. > > > I think that's easier compared to solving the conceptual problems (thinking > about how to best partition degenerate linear models etc.). > ;-) > > Good luck! > Best wishes, > > Z > >>>> (c) failing all of that, does anyone know of an alternative to mob >>>> that does the same thing; for better or worse I'm now committed to >>>> recursive partitioning over linear models, as per mob? >>> >>> >>> If your partitioning variables are particularly simple (e.g., all binary) >>> you could exploit that and it may be easier to write a custom function >>> for >>> your particular data. Then likelihood-ratio tests (rather than LM-type >>> tests) would also be easier to apply in case of unidentified parameters. >>> >>> But if there are partitioning variables with different measurement >>> scales, >>> then this will not be that simple... >> >> >> Unfortunately each partitioning variable is essentially a state >> indicator, taking values say 0,...,R where R is different for each >> component. I'm not a stats expert either; I've spent some time with >> the party manuals and papers, but I wouldn't be confident of >> implementing something like it in the time available to me (though if >> I have to I will, but that wouldn't be a good situation to be in). >> >>>> (d) failing all of this, does anyone have a link to a way to rebuild, or >>>> locally modify, an R package (preferably windows, but anything would >>>> do)? >>> >>> >>> >>> Have a look at the "Writing R Extensions" manual and the R for Windows >>> FAQ. >> >> >> Will do. >> >> Thank you very much for your responses, I really appreciate it. >> >> Best wishes, >> >> Ed >> > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.