(sorry to post it again with plain text).

I recently use gbm for a binary classification problem. As expected, it gets 
very good results, based on Area under ROC with 7-fold cross validation. 
However, the application (malware detection) is cost-sensitive, getting a FP 
(classify a clean sample as a dirty one) is much worse than getting a FN (miss 
a dirty sample). I would like to tune the gbm model biased to very low FP rate. 
The metric I used is to calculate Area under ROC, cut at 1% FP rate. The higher 
the better.

For this purpose, I tried both weighting and sampling strategies, but both of 
them do not work as I expect yet. I notice that there is a weight vector and 
hence I tried to overwight on clean side (10 for each clean sample and 1 for 
each dirty sample), but I don't see big difference from gbm modeling without 
weighting. I also try to feed an imbalanced data into gbm (in the dataset, 
clean samples are 10 times more than dirty samples),  it still not work.

I think I miss sth here. I would very much appreciate if anyone can advise me 
how to implement cost-sensitive classification with gbm. Follows is the gbm 
modeling scirpt I used.

model.gbm <- gbm.fit(tr[,1:DIM],tr.y,offset = NULL,misc = NULL,distribution = 
"bernoulli",w = tr.w,var.monotone = NULL,n.trees = NTREE,interaction.depth = 
TREEDEPTH,n.minobsinnode = 10,shrinkage = 0.05,bag.fraction = 
BAGRATIO,train.fraction = 1.0,keep.data = TRUE,verbose = TRUE,var.names = 
NULL,response.name = NULL);

or 

model.gbm  <- gbm(tr.y ~ .,distribution = 
"bernoulli",data=data.frame(cbind(tr[,1:DIM],tr.y)),weights = 
tr.w,var.monotone=NULL,n.trees=NTREE,interaction.depth = 
TREEDEPTH,n.minobsinnode = 10,shrinkage=0.05,bag.fraction = 0.5,train.fraction 
= 1.0,cv.folds = 5,keep.data=TRUE,verbose=TRUE);


 
------------------------------------
Yuchun Tang, Ph.D.
Principal Engineer, Lead
 
McAfee, Inc.
4800 North Point Parkway
Suite 300
Alpharetta,
GA  30022
 
Main:     678.904.9153
www.mcafee.com
www.trustedsource.org

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to