Il 27/09/11 01:58, R. Michael Weylandt ha scritto:
Why exactly do you want to "stabilize" your results?
If it's in preparation for publication/classroom demo/etc., certainly
resetting the seed before each run (and hence getting the same sample()
output) will make your results exactly reproducible. However, if you are
looking for a clearer picture of the true efficacy of your svm and
there's no real underlying order to the data set (i.e., not a time
series), then a straight sample() seems better to me.
I'm not particularly well read on the svm literature, but it sounds like
you are worried by widely varying performance of the svm itself. If
that's the case, it seems (to me at least) that there are certain data
points that are strongly informative and it might be a more interesting
question to look into which ones those are.
I guess my answer, as a total non-savant in the field, is that it
depends on your goal: repeated runs with sample will give you more
information about the strength of the svm while setting the seed will
give you reproducibility. Importance sampling might be of interest,
particularly if it could be tied to the information content of each data
point, and a quick skim of the MC variance reduction literature might
just provide some fun insights.
I'm not entirely sure how you mean to bootstrap the act of setting the
seed (a randomly set seed seems to be the same as not setting a seed at
all) but that might give you a nice middle ground.
Sorry this can't be of more help,
Michael
On Mon, Sep 26, 2011 at 6:32 PM, Riccardo G-Mail <ric.rom...@gmail.com
<mailto:ric.rom...@gmail.com>> wrote:
Hi, I'm working with support vector machine for the classification
purpose, and I have a problem about the accuracy of prediction.
I divided my data set in train (1/3 of enteire data set) and test
(2/3 of data set) using the "sample" function. Each time I perform
the svm model I obtain different result, according with the result
of the "sample" function. I would like to "stabilize" the
performance of my analysis. To do this I used the "set.seed"
function. Is there a better way to do this? Should I perform a
bootstrap on my work-flow (sample and svm)?
Here is an example of my workflow:
### not to run
index <- 1:nrow(myData)
set.seed(23)
testindex <- sample(index, trunc(length(index)/3))
testset <- myData[testindex, ]
trainset <- myData[-testindex, ]
tune.svm()
svm.model <- svm(Factor ~ ., data = myData, cost = from tune.svm,
gamma = from tune.svm, cross= 10, subset= testset)
summary(svm.model)
predict(svm.model, testset)
Best
Riccardo
________________________________________________
R-help@r-project.org <mailto:R-help@r-project.org> mailing list
https://stat.ethz.ch/mailman/__listinfo/r-help
<https://stat.ethz.ch/mailman/listinfo/r-help>
PLEASE do read the posting guide
http://www.R-project.org/__posting-guide.html
<http://www.R-project.org/posting-guide.html>
and provide commented, minimal, self-contained, reproducible code.
Thanks for your suggestion, I'm agree with you about the uselessness of
set.seed inside a bootstrap; the idea of bootstrap exclude the set.seed.
In my mind the bootstrap could allow me to understand the distribution
of the prediction accuracy of the model. My doubt stems from the fact
that I'm not a statistician.
Best
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.