Re: [Rd] application to mentor syrfr package development for Google Summer of Code 2010

Chidambaram Annamalai Mon, 08 Mar 2010 00:23:49 -0800

Oh oops. I clearly embarrassed myself. :D

I believe you are suggesting that besides the evaluation functions proposed
in the paper you want to test the model produced by SR using statistical
tests to prove its validity? I haven't really given much thought about using
statistical tests in model evaluation. But, that seems to me like a hybrid
-- not just purely evolutionary, betraying the title of SR. However, l
haven't performed any tests myself to conclude which one will outdo the
other.


Chillu

On Mon, Mar 8, 2010 at 1:19 PM, James Salsman <jsals...@talknicer.com>wrote:

> Chillu, I meant that development on both a syrfr R package capable of
> using either F statistics or parametric derivatives should proceed in
> parallel with your work on such a derivatives package. You are right
> that genetic algorithm search (and general best-first search --
> http://en.wikipedia.org/wiki/Best-first_search -- of which genetic
> algorithms are various special cases) can be very effectively
> parallelized, too.
>
> In any case, thank you for pointing out Eureqa --
> http://ccsl.mae.cornell.edu/eureqa -- but I can see no evidence there
> or in the user manual or user forums that Eureqa is considering
> degrees of freedom in its goodness-of-fit estimation.  That is a
> serious problem which will typically result in invalid symbolic
> regression.  I am sending this message also to Michael Schmidt so that
> he might be able to comment on the extent to which Eureqa adjusts for
> degrees of freedom in his fit evaluations.
>
> Best regards,
> James Salsman
>
> On Sun, Mar 7, 2010 at 10:39 PM, Chidambaram Annamalai
> <quantumeli...@gmail.com> wrote:
> >
> >> If I understand your concern, you want to lay the foundation for
> >> derivatives so that you can implement the search strategies described
> >> in Schmidt and Lipson (2010) --
> >> http://www.springerlink.com/content/l79v2183725413w0/ -- is that
> >> right?
> >
> > Yes. Basically traditional "naive" error estimators or fitness functions
> > fail miserably when used in SR with implicit equations because they
> > immediately close in on "best" fits like f(x) = x - x and other trivial
> > solutions. In such cases no amount of regularization and complexity
> > penalizing methods will help since x - x is fairly simple by most
> measures
> > of complexity and it does have zero error. So the paper outlines such
> > problems associated with "direct" error estimators and thus they infer
> the
> > "triviality" of the fit by probing its estimates around nearby points and
> > seeing if it does follow the pattern dictated by the data points -- ergo
> > derivatives.
> >
> > Also, somewhat like a side benefit, this method also enables us to
> perform
> > regression on closed loops and other implicit equations since the fitness
> > functions are based only on derivatives. The specific form of the error
> is
> > equation 1.2 which is what, I believe, comprises of the internals of the
> > evaluation procedure used in Eureqa.
> >
> > You are correct in pointing out that there is no reason to not work in
> > parallel, since GAs generally have a more or less fixed form
> > (evaluate-reproduce cycle) which is quite easily parallelized. I have
> used
> > OpenMP in the past, in which it is fairly trivial to parallelize well
> formed
> > for loops.
> >
> > Chillu
> >
> >> It is not clear to me how well this generalized approach will
> >> work in practice, but there is no reason not to proceed in parallel to
> >> establish a framework under which you could implement the metrics
> >> proposed by Schmidt and Lipson in the contemplated syrfr package.
> >>
> >> I have expanded the test I proposed with two more questions -- at
> >>
> http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010:syrfr
> >> -- specifically:
> >>
> >> 5. Critique http://sites.google.com/site/gptips4matlab/
> >>
> >> 6. Use anova to compare the goodness-of-fit of a SSfpl nls fit with a
> >> linear model of your choice. How can your characterize the
> >> degree-of-freedom-adjusted goodness of fit of nonlinear models?
> >>
> >> I believe pairwise anova.nls is the optimal comparison for nonlinear
> >> models, but there are several good choices for approximations,
> >> including the residual standard error, which I believe can be adjusted
> >> for degrees of freedom, as can the F statistic which TableCurve uses;
> >> see: http://en.wikipedia.org/wiki/F-test#Regression_problems
> >>
> >> Best regards,
> >> James Salsman
> >>
> >>
> >> On Sun, Mar 7, 2010 at 7:35 PM, Chidambaram Annamalai
> >> <quantumeli...@gmail.com> wrote:
> >> > It's been a while since I proposed syrfr and I have been constantly in
> >> > contact with the many people in the R community and I wasn't able to
> >> > find a
> >> > mentor for the project. I later got interested in the Automatic
> >> > Differentiation proposal (adinr) and, on consulting with a few others
> >> > within
> >> > the R community, I mailed John Nash (who proposed adinr in the first
> >> > place)
> >> > if he'd be willing to take me up on the project. I got a positive
> reply
> >> > only
> >> > a few hours ago and it was my mistake to have not removed the syrfr
> >> > proposal
> >> > in time from the wiki, as being listed under proposals looking for
> >> > mentors.
> >> >
> >> > While I appreciate your interest in the syrfr proposal I am afraid my
> >> > allegiances have shifted towards the adinr proposal, as I got
> convinced
> >> > that
> >> > it might interest a larger group of people and it has wider scope in
> >> > general.
> >> >
> >> > I apologize for having caused this trouble.
> >> >
> >> > Best Regards,
> >> > Chillu
> >> >
> >> > On Mon, Mar 8, 2010 at 6:41 AM, James Salsman <jsals...@talknicer.com
> >
> >> > wrote:
> >> >>
> >> >> Per
> http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010
> >> >> -- and
> >> >>
> >> >>
> http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010:syrfr
> >> >> -- I am applying to mentor the "Symbolic Regression for R" (syrfr)
> >> >> package for the Google Summer of Code 2010.
> >> >>
> >> >> I propose the following test which an applicant would have to pass in
> >> >> order to qualify for the topic:
> >> >>
> >> >> 1. Describe each of the following terms as they relate to statistical
> >> >> regression: categorical, periodic, modular, continuous, bimodal,
> >> >> log-normal, logistic, Gompertz, and nonlinear.
> >> >>
> >> >> 2. Explain which parts of http://bit.ly/tablecurve were adopted in
> >> >> SigmaPlot and which weren't.
> >> >>
> >> >> 3. Use the 'outliers' package to improve a regression fit maintaining
> >> >> the correct extrapolation confidence intervals as are between those
> >> >> with and without outlier exclusions in proportion to the confidence
> >> >> that the outliers were reasonably excluded.  (Show your R
> transcript.)
> >> >>
> >> >> 4. Explain the relationship between degrees of freedom and correlated
> >> >> independent variables.
> >> >>
> >> >> Best regards,
> >> >>
> >> >> James Salsman
> >> >> jsals...@talknicer.com
> >> >> http://talknicer.com
> >> >>
> >> >> ______________________________________________
> >> >> R-devel@r-project.org mailing list
> >> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >> >
> >> >
> >
> >
>

        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] application to mentor syrfr package development for Google Summer of Code 2010

Reply via email to