Oh oops. I clearly embarrassed myself. :D I believe you are suggesting that besides the evaluation functions proposed in the paper you want to test the model produced by SR using statistical tests to prove its validity? I haven't really given much thought about using statistical tests in model evaluation. But, that seems to me like a hybrid -- not just purely evolutionary, betraying the title of SR. However, l haven't performed any tests myself to conclude which one will outdo the other.
Chillu On Mon, Mar 8, 2010 at 1:19 PM, James Salsman <jsals...@talknicer.com>wrote: > Chillu, I meant that development on both a syrfr R package capable of > using either F statistics or parametric derivatives should proceed in > parallel with your work on such a derivatives package. You are right > that genetic algorithm search (and general best-first search -- > http://en.wikipedia.org/wiki/Best-first_search -- of which genetic > algorithms are various special cases) can be very effectively > parallelized, too. > > In any case, thank you for pointing out Eureqa -- > http://ccsl.mae.cornell.edu/eureqa -- but I can see no evidence there > or in the user manual or user forums that Eureqa is considering > degrees of freedom in its goodness-of-fit estimation. That is a > serious problem which will typically result in invalid symbolic > regression. I am sending this message also to Michael Schmidt so that > he might be able to comment on the extent to which Eureqa adjusts for > degrees of freedom in his fit evaluations. > > Best regards, > James Salsman > > On Sun, Mar 7, 2010 at 10:39 PM, Chidambaram Annamalai > <quantumeli...@gmail.com> wrote: > > > >> If I understand your concern, you want to lay the foundation for > >> derivatives so that you can implement the search strategies described > >> in Schmidt and Lipson (2010) -- > >> http://www.springerlink.com/content/l79v2183725413w0/ -- is that > >> right? > > > > Yes. Basically traditional "naive" error estimators or fitness functions > > fail miserably when used in SR with implicit equations because they > > immediately close in on "best" fits like f(x) = x - x and other trivial > > solutions. In such cases no amount of regularization and complexity > > penalizing methods will help since x - x is fairly simple by most > measures > > of complexity and it does have zero error. So the paper outlines such > > problems associated with "direct" error estimators and thus they infer > the > > "triviality" of the fit by probing its estimates around nearby points and > > seeing if it does follow the pattern dictated by the data points -- ergo > > derivatives. > > > > Also, somewhat like a side benefit, this method also enables us to > perform > > regression on closed loops and other implicit equations since the fitness > > functions are based only on derivatives. The specific form of the error > is > > equation 1.2 which is what, I believe, comprises of the internals of the > > evaluation procedure used in Eureqa. > > > > You are correct in pointing out that there is no reason to not work in > > parallel, since GAs generally have a more or less fixed form > > (evaluate-reproduce cycle) which is quite easily parallelized. I have > used > > OpenMP in the past, in which it is fairly trivial to parallelize well > formed > > for loops. > > > > Chillu > > > >> It is not clear to me how well this generalized approach will > >> work in practice, but there is no reason not to proceed in parallel to > >> establish a framework under which you could implement the metrics > >> proposed by Schmidt and Lipson in the contemplated syrfr package. > >> > >> I have expanded the test I proposed with two more questions -- at > >> > http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010:syrfr > >> -- specifically: > >> > >> 5. Critique http://sites.google.com/site/gptips4matlab/ > >> > >> 6. Use anova to compare the goodness-of-fit of a SSfpl nls fit with a > >> linear model of your choice. How can your characterize the > >> degree-of-freedom-adjusted goodness of fit of nonlinear models? > >> > >> I believe pairwise anova.nls is the optimal comparison for nonlinear > >> models, but there are several good choices for approximations, > >> including the residual standard error, which I believe can be adjusted > >> for degrees of freedom, as can the F statistic which TableCurve uses; > >> see: http://en.wikipedia.org/wiki/F-test#Regression_problems > >> > >> Best regards, > >> James Salsman > >> > >> > >> On Sun, Mar 7, 2010 at 7:35 PM, Chidambaram Annamalai > >> <quantumeli...@gmail.com> wrote: > >> > It's been a while since I proposed syrfr and I have been constantly in > >> > contact with the many people in the R community and I wasn't able to > >> > find a > >> > mentor for the project. I later got interested in the Automatic > >> > Differentiation proposal (adinr) and, on consulting with a few others > >> > within > >> > the R community, I mailed John Nash (who proposed adinr in the first > >> > place) > >> > if he'd be willing to take me up on the project. I got a positive > reply > >> > only > >> > a few hours ago and it was my mistake to have not removed the syrfr > >> > proposal > >> > in time from the wiki, as being listed under proposals looking for > >> > mentors. > >> > > >> > While I appreciate your interest in the syrfr proposal I am afraid my > >> > allegiances have shifted towards the adinr proposal, as I got > convinced > >> > that > >> > it might interest a larger group of people and it has wider scope in > >> > general. > >> > > >> > I apologize for having caused this trouble. > >> > > >> > Best Regards, > >> > Chillu > >> > > >> > On Mon, Mar 8, 2010 at 6:41 AM, James Salsman <jsals...@talknicer.com > > > >> > wrote: > >> >> > >> >> Per > http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010 > >> >> -- and > >> >> > >> >> > http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010:syrfr > >> >> -- I am applying to mentor the "Symbolic Regression for R" (syrfr) > >> >> package for the Google Summer of Code 2010. > >> >> > >> >> I propose the following test which an applicant would have to pass in > >> >> order to qualify for the topic: > >> >> > >> >> 1. Describe each of the following terms as they relate to statistical > >> >> regression: categorical, periodic, modular, continuous, bimodal, > >> >> log-normal, logistic, Gompertz, and nonlinear. > >> >> > >> >> 2. Explain which parts of http://bit.ly/tablecurve were adopted in > >> >> SigmaPlot and which weren't. > >> >> > >> >> 3. Use the 'outliers' package to improve a regression fit maintaining > >> >> the correct extrapolation confidence intervals as are between those > >> >> with and without outlier exclusions in proportion to the confidence > >> >> that the outliers were reasonably excluded. (Show your R > transcript.) > >> >> > >> >> 4. Explain the relationship between degrees of freedom and correlated > >> >> independent variables. > >> >> > >> >> Best regards, > >> >> > >> >> James Salsman > >> >> jsals...@talknicer.com > >> >> http://talknicer.com > >> >> > >> >> ______________________________________________ > >> >> R-devel@r-project.org mailing list > >> >> https://stat.ethz.ch/mailman/listinfo/r-devel > >> > > >> > > > > > > [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel