This is where a small, reproducible example will definitely help us discover your problem.
-----Original Message----- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Noah Silverman Sent: Wednesday, August 12, 2009 4:29 PM To: Achim Zeileis Cc: r help Subject: Re: [R] Nominal variables in SVM? Thanks for all the suggestions. My data was loaded in from a csv file with about 80 columns (3 of these columns are nominal) no specific settings for the nominal columns. Currently, if I call svm (e1071), I get an error about the nominal column. Do I need to tell R to change the column to a factor? i.e. foo$color <- factor(foo$color) On 8/12/09 2:21 PM, Achim Zeileis wrote: > On Wed, 12 Aug 2009, Noah Silverman wrote: > >> Hi, >> >> The answers to my previous question about nominal variables has lead >> me to a more important question. >> >> What is the "best practice" way to feed nominal variable to an SVM. > > As some of the previous posters have already indicated: The data > structure for storing categorical (including nominal) variables in R > is a "factor". > > Your comment about "truly nominal" is wrong. A character variable is a > character variable, not necessarily a categorical variable. > Categorical means that the answer falls into one of a finite number of > known categories, known as "levels" in R's "factor" class. > > If you start out from character information: > > x <- c("red", "red", "blue", "green", "blue") > > You can turn it into a factor via: > > x <- factor(x, levels = c("red", "green", "blue")) > > R now knows how to do certain things with such a variable, e.g., > produces useful summaries or knows how to deal with it in regression > problems: > > model.matrix(~ x) > > which seems to be what you asked for. Moreover, you don't need call > this yourself but most regression functions in R will do that for you > (including svm() in "e1071" or ksvm() in "kernlab", among others). > > In short: Keep your categorical variables as "factor" columns in a > "data.frame" and use the formula interface of svm()/ksvm() and you are > fine. > Z > > >> For example: >> color = ("red, "blue", "green") >> >> I could translate that into an index so I wind up with >> color= (1,2,3) >> >> But my concern is that the SVM will now think that the values are >> numeric in "range" and not discrete conditions. >> >> Another thought would be to create 3 binary variables from the single >> color variable, so I have: >> >> red = (0,1) >> blue = (0,1) >> green = (0,1) >> >> A example fed to the SVM would have one positive and two negative >> values to indicate the color value: >> i.e. for a blue example: >> red = 0, blue =1 , green = 0 >> >> Or, do any of the SVM packages intelligently handle this internally >> so that I don't have to mess with it. If so, do I need to be >> concerned about different "translation" of the data if the test data >> set isn't exactly the same as the training set. >> For example: >> training data = color ("red, "blue", "green") >> test data = color ("red, "green") >> >> How would I be sure that the "red" and "green" examples get encoded >> the same so that the SVM is accurate? >> >> Thanks in advance!! >> >> -N >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.