Hi Gerrit, Thank you very much for the precise explanation.
Syntactically, I thought R is smart enough to detect that I'm using one of the columns because I use data=mytable syntax which means that input/output information are in the mytable. For a generic support, I think it's wise to support this syntax: genericModel(table[,columnLists] ~ ., data=table) because in many cases where you have hundred's of columns, you don't know the header but you know the column position of your inputs and outputs. You may say that why not use genericModel(table[,inputColumns],table[,outputColumns])? The formula expression shows more flexibility and elegance. Can this become a feature in the future? or at least R can be smart enough to detect that the output column is part of the input column. I'm not sure how many will have a mistake of using this expression in the future specially in dealing with many columns and the easiest way to access it is by column number instead of headers. It can be sensible when you understand how R interprets it but syntactically, it makes sense to have the expression: mutable[,outputColumns] ~ . Regards, Paulito ----- Original Message ----- From: Gerrit Eichner <gerrit.eich...@math.uni-giessen.de> To: Paulito Palmes <ppal...@yahoo.com> Cc: "r-help@r-project.org" <r-help@r-project.org> Sent: Thursday, 12 September 2013, 8:53 Subject: Re: [R] Formula in a model Hello, Paulito, my comments are inline below: > Thanks for the explanation. Let me give a specific example. Assume Temp > (column 4) is the output and the rest of the columns are input is the > training features. Note that I only use the air quality data for > illustration purpose. T input->output mapping may not make sense in the > real interpretation of this data. > > library(e1071) > > data(airquality) > mytable=airquality > > colnames(mytable)=c('a','b','c','d','e','f') > > modelSVM1=svm(mytable[,6] ~ .,data=mytable) > modelSVM2=svm(mytable[,-6],mytable[,6]) > modelSVM3=svm(f ~ ., data=mytable) > > predSVM1=predict(modelSVM1,newdata=mytable) > predSVM2=predict(modelSVM2,newdata=mytable[,-6]) > predSVM3=predict(modelSVM3,newdata=mytable) > > Results of predSVM2 is similar with predSVM3 but different from predSVM1. Well, because already modelSVM1 is different from the other two. This is due to how the "." on the rhs of a formula is interpreted. From the help page of formula: "There are two special interpretations of . in a formula. The usual one is in the context of a data argument of model fitting functions and means 'all columns not otherwise in the formula': see terms.formula. In the context of update.formula, only, it means 'what was previously in this part of the formula'." The first interpretation applies to your situation. With the formula for your modelSVM1 the function model.matrix() (which is called inside the formula version of svm()) creates a model matrix after looking for a column "mytable[,6]" in the data argument. And since there is no column with that name, it takes all columns of mytable (including the 6th, i.e., the one named "f"). See what model.matrix() does in that case: > head( model.matrix(mytable[,6] ~ .,data=mytable), 3) (Intercept) a b c d e f 1 1 41 190 7.4 67 5 1 2 1 36 118 8.0 72 5 2 3 1 12 149 12.6 74 5 3 In the case of modelSVM3 model.matrix() does find column "f" in the data argument, and hence omits this column in forming the terms of the rhs of the formula: > head( model.matrix( f ~ .,data=mytable), 3) (Intercept) a b c d e 1 1 41 190 7.4 67 5 2 1 36 118 8.0 72 5 3 1 12 149 12.6 74 5 The call to svm() for modelSVM2 is the (non-formula) default version and does not need to call model.matrix() because (so to say) it expects that the user has done that already by supplying the response to its argument y and the adequately formed data matrix to its argument x. > Question: Which is the correct formulation? The second and the third (for a sensible purpose), unless you want to experiment with svm() to see what happens if one does something rather nonsensical. > Why R doesn't detect error/discrepancy in formulation? Because R, or in this case rather the concept of a formula and the function model.matrix() are not designed to replace the user who knows what s/he is doing after having read the documentation. ;) > If I use the same formulation with rpart using the same data: > > library(rpart) > > data(airquality) > mytable=airquality > > colnames(mytable)=c('a','b','c','d','e','f') > > modelRP1=rpart(mytable[,6]~.,data=mytable,method='anova') # this works > modelRP3=rpart(f ~ ., data=mytable,method='anova') # this works > > predRP1=predict(modelRP1,newdata=mytable) > predRP3=predict(modelRP3,newdata=mytable) > > > The results between predRP1 and predRP3 are different while the statements: > > predRP2=predict(modelRP2,newdata=mytable[,-6]) > modelRP2=rpart(mytable[,-6],mytable[,6],method='anova') > > have errors. This is presumably due to the same reasons as described above. Remark: It is generally - for various reasons - recommended to use "<-" as the assignment operator, not "=". (And I like to recommend to use use blanks to increase readability of code.) [... snip ...] I hope the fog has lifted -- Gerrit ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.