Hello, I'm seeking ideas on how to remove outliers from a non-normal distribution predictor variable. We wish to reset points deemed outliers to a truncated value that is less extreme. (I've seen many posts requesting outlier removal systems. It seems like most of the replies center around "why do you want to remove them", "you shouldn't remove them", "it depends", etc. so I've tried to add a lot of notes below in an attempt to answer these questions in advance.)
Currently we Winsorize using the quantile function to get the new high and low values to set the outliers to on the high end and low end (this is summarized legacy code that I am revisiting): #Get the truncated values for resetting: lowq = quantile(dat,probs=perc_low,na.rm=TRUE) hiq = quantile(dat,probs=perc_hi,na.rm=TRUE) #resetting the highest and lowest values with the truncated values: dat[lowq>dat] = lowq dat[hiq<dat] = hiq What I don't like about this is that it always truncates values (whether they truly are outliers or not) and the perc_low and perc_hi settings are arbitrary. I'd like to be more intelligent about it. Notes: 1) Ranking has already been explored and is not an option at this time. 2) Reminder: these factors are almost always distributed non-normally. 3) For reason I won't get into here, I have to do this pragmatically. I can't manually inspect the data each time I remove outliers. 4) I will be removing outliers from candidate predictor variables. Predictors variable distributions all look very different from each other, so I can't make any generalizations about them. 5) As #4 above indicates, I am building and testing predictor variables for use in a regression model. 6) The predictor variable outliers are usually somewhat informative, but their "extremeness" is a result of the predictor variable calculation. I think "extremeness" takes away from the information that would otherwise be available (outlier effect). So I want to remove some, but not all, of their "extremeness". For example, percent change of a small number: from say 0.001 to 500. Yes, we want to know that it changed a lot, but 49,999,900% is not helpful and masks otherwise useful information. I'd like to hear your ideas. Thanks in advance! Regards, Ben [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.