Hello,

I'm seeking ideas on how to remove outliers from a non-normal distribution
predictor variable. We wish to reset points deemed outliers to a truncated
value that is less extreme. (I've seen many posts requesting outlier removal
systems. It seems like most of the replies center around "why do you want to
remove them", "you shouldn't remove them", "it depends", etc. so I've tried
to add a lot of notes below in an attempt to answer these questions in
advance.)

Currently we Winsorize using the quantile function to get the new high and
low values to set the outliers to on the high end and low end (this is
summarized legacy code that I am revisiting):

#Get the truncated values for resetting:
lowq = quantile(dat,probs=perc_low,na.rm=TRUE)
hiq = quantile(dat,probs=perc_hi,na.rm=TRUE)

#resetting the highest and lowest values with the truncated values:
dat[lowq>dat] = lowq
dat[hiq<dat] = hiq

What I don't like about this is that it always truncates values (whether
they truly are outliers or not) and the perc_low and perc_hi settings are
arbitrary. I'd like to be more intelligent about it.

Notes:
1) Ranking has already been explored and is not an option at this time.
2) Reminder: these factors are almost always distributed non-normally.
3) For reason I won't get into here, I have to do this pragmatically. I
can't manually inspect the data each time I remove outliers.
4) I will be removing outliers from candidate predictor variables.
Predictors variable distributions all look very different from each other,
so I can't make any generalizations about them.
5) As #4 above indicates, I am building and testing predictor variables for
use in a regression model.
6) The predictor variable outliers are usually somewhat informative, but
their "extremeness" is a result of the predictor variable calculation. I
think "extremeness" takes away from the information that would otherwise be
available (outlier effect). So I want to remove some, but not all, of their
"extremeness". For example, percent change of a small number: from say 0.001
to 500. Yes, we want to know that it changed a lot, but 49,999,900% is not
helpful and masks otherwise useful information.

I'd like to hear your ideas. Thanks in advance!

Regards,

Ben

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to