Re: [R] Formula in a model

Bert Gunter Thu, 12 Sep 2013 12:45:25 -0700

Paulito:

Just my opinion:


You wrote: "Syntactically, I thought R is smart enough to detect that I'm
using one of the columns because I use data=mytable syntax which means that
input/output information are in the mytable. "

You appear to be suggesting that R do something contrary to what is
documented to do (as you were shown). That is illegitimate and rather
foolish, don't you think?

It **is** legitimate to suggest that what R does be changed and the changed
behavior be documented (of course!). However, you are suggesting a change
in R's long established and core behavior, which might break a bunch of
code and, as history shows (fitting models with many variables is a
standard task) does not appear to have caused distress. Does this sound
wise to you?

Others have made similar remarks about all sorts of issues on this list. I
confess that I don't understand. R has been in active use by thousands of
people for 20 years or so. R's authors and the maintainers of its core code
are not dopes. While this certainly does not exclude the possibility that
modifications to core behavior are sometimes needed and are, indeed, made,
logic would suggest that this is a relatively rare event. That
comparatively new users seem to ignore such considerations and propose them
anyway  therefore baffles me. Greater modesty would seem to be in order.

But that's just my opinion, like I said.

Cheers,
Bert






On Thu, Sep 12, 2013 at 6:44 AM, Paulito Palmes <ppal...@yahoo.com> wrote:

> Hi Gerrit,
>
> Thank you very much for the precise explanation.
>
> Syntactically, I thought R is smart enough to detect that I'm using one of
> the columns because I use data=mytable syntax which means that input/output
> information are in the mytable.
>
> For a generic support, I think it's wise to support this syntax:
> genericModel(table[,columnLists] ~ ., data=table) because in many cases
> where you have hundred's of columns, you don't know the header but you know
> the column position of your inputs and outputs. You may say that why not
> use genericModel(table[,inputColumns],table[,outputColumns])? The formula
> expression shows more flexibility and elegance. Can this become a feature
> in the future? or at least R can be smart enough to detect that the output
> column is part of the input column.
>
> I'm not sure how many will have a mistake of using this expression in the
> future specially in dealing with many columns and the easiest way to access
> it is by column number instead of headers. It can be sensible when you
> understand how R interprets it but syntactically, it makes sense to have
> the expression: mutable[,outputColumns] ~ .
>
> Regards,
> Paulito
>
>
>
>
> ----- Original Message -----
> From: Gerrit Eichner <gerrit.eich...@math.uni-giessen.de>
> To: Paulito Palmes <ppal...@yahoo.com>
> Cc: "r-help@r-project.org" <r-help@r-project.org>
> Sent: Thursday, 12 September 2013, 8:53
> Subject: Re: [R] Formula in a model
>
> Hello, Paulito,
>
> my comments are inline below:
>
> > Thanks for the explanation. Let me give a specific example. Assume Temp
> > (column 4) is the output and the rest of the columns are input is the
> > training features. Note that I only use the air quality data for
> > illustration purpose. T input->output mapping may not make sense in the
> > real interpretation of this data.
> >
> > library(e1071)
> >
> > data(airquality)
> > mytable=airquality
> >
> > colnames(mytable)=c('a','b','c','d','e','f')
> >
> > modelSVM1=svm(mytable[,6] ~ .,data=mytable)
> > modelSVM2=svm(mytable[,-6],mytable[,6])
> > modelSVM3=svm(f ~ ., data=mytable)
> >
> > predSVM1=predict(modelSVM1,newdata=mytable)
> > predSVM2=predict(modelSVM2,newdata=mytable[,-6])
> > predSVM3=predict(modelSVM3,newdata=mytable)
> >
> > Results of predSVM2 is similar with predSVM3  but different
> from predSVM1.
>
> Well, because already modelSVM1 is different from the other two. This is
> due to how the "." on the rhs of a formula is interpreted. From the help
> page of formula:
>
>     "There are two special interpretations of . in a formula. The
>     usual one is in the context of a data argument of model fitting
>     functions and means 'all columns not otherwise in the formula':
>     see terms.formula. In the context of update.formula, only, it
>     means 'what was previously in this part of the formula'."
>
> The first interpretation applies to your situation. With the formula for
> your modelSVM1 the function model.matrix() (which is called inside the
> formula version of svm()) creates a model matrix after looking for a
> column "mytable[,6]" in the data argument. And since there is no column
> with that name, it takes all columns of mytable (including the 6th, i.e.,
> the one named "f"). See what model.matrix() does in that case:
>
> > head( model.matrix(mytable[,6] ~ .,data=mytable), 3)
>    (Intercept)  a   b    c  d e f
> 1           1 41 190  7.4 67 5 1
> 2           1 36 118  8.0 72 5 2
> 3           1 12 149 12.6 74 5 3
>
>
>
> In the case of modelSVM3 model.matrix() does find column "f" in the data
> argument, and hence omits this column in forming the terms of the rhs of
> the formula:
>
> > head( model.matrix( f ~ .,data=mytable), 3)
>    (Intercept)  a   b    c  d e
> 1           1 41 190  7.4 67 5
> 2           1 36 118  8.0 72 5
> 3           1 12 149 12.6 74 5
>
>
>
> The call to svm() for modelSVM2 is the (non-formula) default version and
> does not need to call model.matrix() because (so to say) it expects that
> the user has done that already by supplying the response to its argument y
> and the adequately formed data matrix to its argument x.
>
>
> > Question: Which is the correct formulation?
>
> The second and the third (for a sensible purpose), unless you want to
> experiment with svm() to see what happens if one does something rather
> nonsensical.
>
>
> > Why R doesn't detect error/discrepancy in formulation?
>
> Because R, or in this case rather the concept of a formula and the
> function model.matrix() are not designed to replace the user who knows
> what s/he is doing after having read the documentation. ;)
>
>
>
> > If I use the same formulation with rpart using the same data:
> >
> > library(rpart)
> >
> > data(airquality)
> > mytable=airquality
> >
> > colnames(mytable)=c('a','b','c','d','e','f')
> >
> > modelRP1=rpart(mytable[,6]~.,data=mytable,method='anova') # this works
> > modelRP3=rpart(f ~ ., data=mytable,method='anova') # this works
> >
> > predRP1=predict(modelRP1,newdata=mytable)
> > predRP3=predict(modelRP3,newdata=mytable)
> >
> >
> > The results between predRP1 and predRP3 are different while the
> statements:
> >
> > predRP2=predict(modelRP2,newdata=mytable[,-6])
> > modelRP2=rpart(mytable[,-6],mytable[,6],method='anova')
> >
> > have errors.
>
> This is presumably due to the same reasons as described above.
>
>
> Remark: It is generally - for various reasons - recommended to use "<-" as
> the assignment operator, not "=". (And I like to recommend to use use
> blanks to increase readability of code.)
>
> [... snip ...]
>
>
>   I hope the fog has lifted  --  Gerrit
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Formula in a model

Reply via email to