Hi,
I'm studying SVMs and found that if I run SVM in R, Weka, Python their results 
are differ. So, to eliminate  possible pitfalls, I decided to use standard iris 
dataset and wrote implementation in R, Weka, Python for the same SVM/kernel. I 
think the choice of kernel does not matter and only needs to be consistent 
among implementations. I excluded cross validation since python does not have 
it and tried to keep consistent set of input parameters among all 
implementations (I went through them all and the defaults seems consistent). 
So, the Weka and Python both produced identical confusion matrix, but R results 
stays apart (I tried both e1071 and kerblab, they consistent among each other, 
but differ from Weka/Python). That's why I decided to post my message to R 
community and ask for help to identify the "problem" (if any) or get reasonable 
explanation why R results can differ. Please note that all implementation uses 
libsvm underneath (at least that what I got from reading), so !
 I would expect results to be the same. I understand that seeds may differ, but 
I used entire dataset without any sampling, may be there is internal 
normalization?

I'm posting the code for all implementations along with confusion matrix 
outputs. Feel free to reproduce and comment.

Thanks,
Valentin.

Weka:
--------------------------------------------------
#!/usr/bin/env bash
# set path to Weka
export CLASSPATH=/Applications/weka-3-6-9.app/Contents/Resources/Java/weka.jar
data=./iris.arff
kernel="weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G 0.01"
c=1.0
t=0.001
# -V The number of folds for the internal cross-validation. (default -1, use 
training data)
# -N Whether to 0=normalize/1=standardize/2=neither. (default 0=normalize)
# -W The random number seed. (default 1)
#opts="-C $c -L $t -N 2 -V -1 -W 1"
opts="-C $c -L $t -N 2"
cmd="java weka.classifiers.functions.SMO"
if [ "$1" == "help" ]; then
    $cmd
    exit 0
fi
$cmd $opts -K "$kernel" -t $data

--------------------------------------------------

  a  b  c   <-- classified as
 50  0  0 |  a = Iris-setosa
  0 47  3 |  b = Iris-versicolor
  0  5 45 |  c = Iris-virginica

Python:
--------------------------------------------------
from sklearn import svm
from sklearn import svm, datasets
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

def report(clf, x_test, y_test):
    y_pred = clf.predict(x_test)
    print clf
    print(classification_report(y_test, y_pred))
    print(confusion_matrix(y_test, y_pred))

def classifier():
    # import some data to play with
    iris = datasets.load_iris()
    x_train = iris.data
    y_train = iris.target
    regC = 1.0  # SVM regularization parameter
    clf = svm.SVC(kernel='rbf', gamma=0.01, C=regC).fit(x_train, y_train)
    report(clf, x_train, y_train)

if __name__ == '__main__':
    classifier()
--------------------------------------------------

[[50  0  0]
 [ 0 47  3]
 [ 0  5 45]]

R:
--------------------------------------------------
library(kernlab)
library(e1071)

# load data
data(iris)

# run svm algorithm (e1071 library) for given vector of data and kernel
model <- svm(Species~., data=iris, kernel="radial", gamma=0.01)
print(model)
# the last column of this dataset is what we'll predict, so we'll exclude it
prediction <- predict(model, iris[,-ncol(iris)])
# the last column is what we'll check against for
tab <- table(pred = prediction, true = iris[,ncol(iris)])
print(tab)
cls <- classAgreement(tab)
msg <- sprintf("Correctly classified: %f, kappa %f", cls$diag, cls$kappa)
print(msg)
--------------------------------------------------

            true
pred         setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         46        11
  virginica       0          4        39
[1] "Correctly classified: 0.900000, kappa 0.850000"
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to