Il 27/09/11 01:58, R. Michael Weylandt ha scritto:
Why exactly do you want to "stabilize" your results?

If it's in preparation for publication/classroom demo/etc., certainly
resetting the seed before each run (and hence getting the same sample()
output) will make your results exactly reproducible. However, if you are
looking for a clearer picture of the true efficacy of your svm and
there's no real underlying order to the data set (i.e., not a time
series), then a straight sample() seems better to me.

I'm not particularly well read on the svm literature, but it sounds like
you are worried by widely varying performance of the svm itself. If
that's the case, it seems (to me at least) that there are certain data
points that are strongly informative and it might be a more interesting
question to look into which ones those are.

I guess my answer, as a total non-savant in the field, is that it
depends on your goal: repeated runs with sample will give you more
information about the strength of the svm while setting the seed will
give you reproducibility. Importance sampling might be of interest,
particularly if it could be tied to the information content of each data
point, and a quick skim of the MC variance reduction literature might
just provide some fun insights.

I'm not entirely sure how you mean to bootstrap the act of setting the
seed (a randomly set seed seems to be the same as not setting a seed at
all) but that might give you a nice middle ground.

Sorry this can't be of more help,

Michael

On Mon, Sep 26, 2011 at 6:32 PM, Riccardo G-Mail <ric.rom...@gmail.com
<mailto:ric.rom...@gmail.com>> wrote:

    Hi, I'm working with support vector machine for the classification
    purpose, and I have a problem about the accuracy of prediction.

    I divided my data set in train (1/3 of enteire data set) and test
    (2/3 of data set) using the "sample" function. Each time I perform
    the svm model I obtain different result, according with the result
    of the "sample" function. I would like to "stabilize" the
    performance of my analysis. To do this I used the "set.seed"
    function. Is there a better way to do this? Should I perform a
    bootstrap on my work-flow (sample and svm)?

    Here is an example of my workflow:
    ### not to run
    index <- 1:nrow(myData)
    set.seed(23)
    testindex <- sample(index, trunc(length(index)/3))
    testset <- myData[testindex, ]
    trainset <- myData[-testindex, ]

    tune.svm()
    svm.model <- svm(Factor ~ ., data = myData, cost = from tune.svm,
                     gamma = from tune.svm, cross= 10, subset= testset)
    summary(svm.model)
    predict(svm.model, testset)

    Best
    Riccardo

    ________________________________________________
    R-help@r-project.org <mailto:R-help@r-project.org> mailing list
    https://stat.ethz.ch/mailman/__listinfo/r-help
    <https://stat.ethz.ch/mailman/listinfo/r-help>
    PLEASE do read the posting guide
    http://www.R-project.org/__posting-guide.html
    <http://www.R-project.org/posting-guide.html>
    and provide commented, minimal, self-contained, reproducible code.


Thanks for your suggestion, I'm agree with you about the uselessness of set.seed inside a bootstrap; the idea of bootstrap exclude the set.seed. In my mind the bootstrap could allow me to understand the distribution of the prediction accuracy of the model. My doubt stems from the fact that I'm not a statistician.

Best

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to