This reproducible example is from the help of 'gbm' in R.

I ran the following code in R, and works fine as long as the response is
numeric.  The problem starts when I convert the response from numeric to
binary (0/1). It gives me an error.

My question is, is converting the response from numeric to binary will have
this much effect.

Help page code:

N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]

SNR <- 10 # signal-to-noise ratio
Y <- X1**1.5 + 2 * (X2**.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)

# introduce some missing values
X1[sample(1:N,size=500)] <- NA
X4[sample(1:N,size=300)] <- NA

data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

# fit initial model
gbm1 <-
  gbm(Y~X1+X2+X3+X4+X5+X6,         # formula
      data=data,                   # dataset
      var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
      # +1: monotone increase,
      #  0: no monotone restrictions
      distribution="gaussian",     # see the help for other choices
      n.trees=1000,                # number of trees
      shrinkage=0.05,              # shrinkage or learning rate,
      # 0.001 to 0.1 usually work
      interaction.depth=3,         # 1: additive model, 2: two-way
interactions, etc.
      bag.fraction = 0.5,          # subsampling fraction, 0.5 is probably
best
      train.fraction = 0.5,        # fraction of data for training,
      # first train.fraction*N used for training
      n.minobsinnode = 10,         # minimum total weight needed in each
node
      cv.folds = 3,                # do 3-fold cross-validation
      keep.data=TRUE,              # keep a copy of the dataset with the
object
      verbose=FALSE)               # don't print out progress

gbm1
summary(gbm1)


Now I slightly change the response variable to make it binary.

Y[Y < mean(Y)] = 0   #My edit
Y[Y >= mean(Y)] = 1  #My edit
data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
fmla = as.formula(factor(Y)~X1+X2+X3+X4+X5+X6) #My edit

gbm2 <-
  gbm(fmla,                        # formula
      data=data,                   # dataset
      distribution="bernoulli",     # My edit
      n.trees=1000,                # number of trees
      shrinkage=0.05,              # shrinkage or learning rate,
      # 0.001 to 0.1 usually work
      interaction.depth=3,         # 1: additive model, 2: two-way
interactions, etc.
      bag.fraction = 0.5,          # subsampling fraction, 0.5 is probably
best
      train.fraction = 0.5,        # fraction of data for training,
      # first train.fraction*N used for training
      n.minobsinnode = 10,         # minimum total weight needed in each
node
      cv.folds = 3,                # do 3-fold cross-validation
      keep.data=TRUE,              # keep a copy of the dataset with the
object
      verbose=FALSE)               # don't print out progress

gbm2


> gbm2
gbm(formula = fmla, distribution = "bernoulli", data = data,
    n.trees = 1000, interaction.depth = 3, n.minobsinnode = 10,
    shrinkage = 0.05, bag.fraction = 0.5, train.fraction = 0.5,
    cv.folds = 3, keep.data = TRUE, verbose = FALSE)
A gradient boosted model with bernoulli loss function.
1000 iterations were performed.
The best cross-validation iteration was .
The best test-set iteration was .
Error in 1:n.trees : argument of length 0


My question is, Is binarizing the response will have so much effect that it
does not find anythin useful in the predictors?

Thanks

-- 
-------------
Mary Kindall
Yorktown Heights, NY
USA

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to