This reproducible example is from the help of 'gbm' in R.
I ran the following code in R, and works fine as long as the response is
numeric. The problem starts when I convert the response from numeric to
binary (0/1). It gives me an error.
My question is, is converting the response from numeric to binary will have
this much effect.
Help page code:
N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]
SNR <- 10 # signal-to-noise ratio
Y <- X1**1.5 + 2 * (X2**.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)
# introduce some missing values
X1[sample(1:N,size=500)] <- NA
X4[sample(1:N,size=300)] <- NA
data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
# fit initial model
gbm1 <-
gbm(Y~X1+X2+X3+X4+X5+X6, # formula
data=data, # dataset
var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
# +1: monotone increase,
# 0: no monotone restrictions
distribution="gaussian", # see the help for other choices
n.trees=1000, # number of trees
shrinkage=0.05, # shrinkage or learning rate,
# 0.001 to 0.1 usually work
interaction.depth=3, # 1: additive model, 2: two-way
interactions, etc.
bag.fraction = 0.5, # subsampling fraction, 0.5 is probably
best
train.fraction = 0.5, # fraction of data for training,
# first train.fraction*N used for training
n.minobsinnode = 10, # minimum total weight needed in each
node
cv.folds = 3, # do 3-fold cross-validation
keep.data=TRUE, # keep a copy of the dataset with the
object
verbose=FALSE) # don't print out progress
gbm1
summary(gbm1)
Now I slightly change the response variable to make it binary.
Y[Y < mean(Y)] = 0 #My edit
Y[Y >= mean(Y)] = 1 #My edit
data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
fmla = as.formula(factor(Y)~X1+X2+X3+X4+X5+X6) #My edit
gbm2 <-
gbm(fmla, # formula
data=data, # dataset
distribution="bernoulli", # My edit
n.trees=1000, # number of trees
shrinkage=0.05, # shrinkage or learning rate,
# 0.001 to 0.1 usually work
interaction.depth=3, # 1: additive model, 2: two-way
interactions, etc.
bag.fraction = 0.5, # subsampling fraction, 0.5 is probably
best
train.fraction = 0.5, # fraction of data for training,
# first train.fraction*N used for training
n.minobsinnode = 10, # minimum total weight needed in each
node
cv.folds = 3, # do 3-fold cross-validation
keep.data=TRUE, # keep a copy of the dataset with the
object
verbose=FALSE) # don't print out progress
gbm2
> gbm2
gbm(formula = fmla, distribution = "bernoulli", data = data,
n.trees = 1000, interaction.depth = 3, n.minobsinnode = 10,
shrinkage = 0.05, bag.fraction = 0.5, train.fraction = 0.5,
cv.folds = 3, keep.data = TRUE, verbose = FALSE)
A gradient boosted model with bernoulli loss function.
1000 iterations were performed.
The best cross-validation iteration was .
The best test-set iteration was .
Error in 1:n.trees : argument of length 0
My question is, Is binarizing the response will have so much effect that it
does not find anythin useful in the predictors?
Thanks
--
-------------
Mary Kindall
Yorktown Heights, NY
USA
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.