Dear R community,
 I have recently discovered the package oblique.tree and I must admit that
it was a nice surprise for me,
since I have actually made my own version of a kind of a classifier which
uses the idea of oblique splits (splits by means of hyperplanes).
So I am now interested in comparing these two classifiers.

 But what I do not seem to understand is why the function
predict.oblique.tree asserts the dependent variable to be included in
`newdata`.
I have set update.tree.predictions to FALSE and I have used the formula
interface when creating the model ( y~. ).
Is there a way to avoid this kind of behaviour ? Or should I just create a
dummy dependent-variable-column in my test set in order to use the
prediction function ? And in the latter case can I actually be sure that
the dependent variable is not ever going to be used in the
prediction-procedure ?

I would be really grateful for any tips regarding this problem.

A piece of reproducible code :

#
-------------------------------------------------------------------------------------------------------
library(oblique.tree)

N <- 100; nvars <- 3;
x <- array(rnorm(n = N*nvars), c(N,nvars))
y <- as.factor(sample(0:1, size = N, replace = T))

m <- data.frame(x,y);
var_names <- colnames(m);
var_x_names <- var_names[-length(var_names)]
n_train <- floor(N/2); n_test <- N - n_train;
train <- m[1:n_train,]; test <- m[-(1:n_train),];

bot <- oblique.tree(formula = y ~., data = train,
 oblique.splits = "on", variable.selection = "none",
 split.impurity = "gini");

 ### If the dependent variable is excluded from `newdata` the code ends up
with this error :
 # Error in model.frame.default(formula =
as.formula(eval(object$call$formula)),  :
#   variable lengths differ (found for 'X1')
# In addition: Warning message:
# 'newdata' had 50 rows but variable(s) found have 100 rows

 pred <- predict(bot, newdata = train[, var_x_names],
  type="vector", update.tree.predictions = F)

  ### An error does not occur if the dependent variable is included in
`newdata`
  pred <- predict(bot, newdata = train[, var_names],
  type="vector", update.tree.predictions = F)

### Although: the result of the prediction does not seem to depend upon
###    the values of the dependent variable included in the data
    pred1 <- predict(bot, newdata = test[, var_names],
  type="vector", update.tree.predictions = F);
    test$y <- as.factor(sample(0:1, size = dim(test)[1], replace = T))
    pred2 <- predict(bot, newdata = test[, var_names],
  type="vector", update.tree.predictions = F);
    abs(mean(pred1[,1] - pred2[,1]))

    if (abs(mean(pred1[,1] - pred2[,1])) > 1e-3) {
    print("Results do differ.");
    }

    ### What is more curious is that the error message changes if I
    ###    write my data.frame and then read it again.
write.table(m, file = "m.txt", col.names = T, row.names = F, quote = F)
rm(list = ls());

m <- read.table("m.txt", header = T, colClasses = "numeric");
m$y <- as.factor(m$y);
var_names <- colnames(m);
var_x_names <- var_names[-length(var_names)]
N <- dim(m)[1];
n_train <- floor(N/2); n_test <- N - n_train;
train <- m[1:n_train,]; test <- m[-(1:n_train),]; rm(m);

bot <- oblique.tree(formula = y ~., data = train,
 oblique.splits = "on", variable.selection = "none",
 split.impurity = "gini");

 ### If the dependent variable is excluded from `newdata` the code ends up
with this error :
 # Error in eval(expr, envir, enclos) : object 'y' not found

 pred <- predict(bot, newdata = train[, var_x_names],
  type="vector", update.tree.predictions = F)

#
-------------------------------------------------------------------------------------------------------

-- 
Sincerely yours,
Yulia Matveyeva,
Department of Statistical Modelling,
Faculty of Mathematics and Mechanics,
St Petersburg State University, Russia

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to