Folks, I have a query around weighting in Random Forest (RF). I know that several earlier emails in this group have raised this issue, but I did not find an answer to my query.
I am working on a dataset (dataset1) that consists of 4 million records that can be reduced to a dataset (dataset2) of approximately 1500 unique records with frequency counts that add up to the 4 million records number as above. Because of size issues, I cannot work with dataset1 in R and therefore, I am working with dataset2 . Each record consists of whether or not a patient chose a particular drug based on 14 comorbidity (Yes / No) variables; I am using RF to understand the comorbidity drivers of drug adoption (yes/no) classification. At full dataset level (dataset1), the drug adoption incidence is ~11%. At the reduced dataset dataset2 level, the drug adoption incidence increases to ~38%. My question is that, if am using the reduced dataset (dataset2), how should I inform RF that the adoption incidence at the full dataset level was 11%. Should that be used as a classwt prior with classwt=c(Yes=.11, No=.89)? My understanding is that RF does not allow case weighting. Or can this be handled with the sampsize arguement through oversampling? What proportions should one use for this (e.g., sampsize=c(Yes=100, No=100))? I would appreciate any feedback or pointers to any earlier thread that I may have overlooked. Regards, Raghu ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.