Re: [R] Identifying outliers in non-normally distributed data

Kevin Wright Thu, 07 Jan 2010 12:51:53 -0800

Here is a simple function I use.  It uses Median +/- 5.2 * MAD.  If I
recall, this flags about 1/2000 of values from a true Normal distribution.


is.outlier = function (x) {
    # See: Davies, P.L. and Gather, U. (1993).
    # "The identification of multiple outliers" (with discussion)
    # J. Amer. Statist. Assoc., 88, 782-801.

    x <- na.omit(x)
    lims <- median(x) + c(-1, 1) * 5.2 * mad(x, constant = 1)
    x < lims[1] | x > lims[2]
}

Maybe the function should be called "is.patentable".  I definitely agree
with Bert's comments.

Kevin Wright



On Wed, Dec 30, 2009 at 11:47 AM, Jerry Floren <jerry.flo...@state.mn.us>wrote:

>
> Greetings:
>
> I could also use guidance on this topic. I provide manure sample
> proficiency
> sets to agricultural labs in the United States and Canada. There are about
> 65 labs in the program.
>
> My data sets are much smaller and typically non-symmetrical with obvious
> outliers. Usually, there are 30 to 60 sets of data, each with triple
> replicates (90 to 180 observations).
>
> There are definitely outliers caused by the following: reporting in the
> wrong units, sending in the wrong spreadsheet, entering data in the wrong
> row, misplacing decimal points, calculation errors, etc. For each analysis,
> it is common that two to three labs make these types of errors.
>
> Since there are replicates, errors like misplaced decimal points are more
> obvious. However, most of the outlier errors are repeated for all three
> replicates.
>
> I use the median and Median Absolute Deviation (MAD, constant = 1) to flag
> labs for accuracy. Labs where the average of their three reps deviates more
> than 2.5 MAD values from the median are flagged for accuracy. With this
> method, it is not necessary to identify the outliers.
>
> A collegue suggested running the data twice. On the first run, outliers
> more
> than 4.0 MAD units from the median are removed. On the second run, values
> exceeding 2.9 times the MAD are flagged for accuracy. I tried this in R
> with
> a normally distributed data set of 100,000, and the 4.0 MAD values were
> nearly identical to the outliers identified with boxplot.
>
> With my data set, the flags do not change very much if the data is run one
> time with the flags set at 2.5 MAD units compared to running the data twice
> and removing the 4.0 MAD outliers and flagging the second set at 2.9 MAD
> units. Using either one of these methods might work for you, but I am not
> sure of the statistical value of these methods.
>
> Yours,
>
> Jerry Floren
>
>
>
> Brian G. Peterson wrote:
> >
> > John wrote:
> >> Hello,
> >>
> >> I've been searching for a method for identify outliers for quite some
> >> time now. The complication is that I cannot assume that my data is
> >> normally distributed nor symmetrical (i.e. some distributions might
> >> have one longer tail) so I have not been able to find any good tests.
> >> The Walsh's Test (http://www.statistics4u.info/
> >> fundsta...liertest.html#), as I understand assumes that the data is
> >> symmetrical for example.
> >>
> >> Also, while I've found some interesting articles:
> >> http://tinyurl.com/yc7w4oq ("Missing Values, Outliers, Robust
> >> Statistics & Non-parametric Methods")
> >> I don't really know what to use.
> >>
> >> Any ideas? Any R packages available for this? Thanks!
> >>
> >> PS. My data has 1000's of observations..
> >
> > Take a look at package 'robustbase', it provides most of the standard
> > robust
> > measures and calculations.
> >
> > While you didn't say what kind of data you're trying to identify outliers
> > in,
> > if it is time series data the function Return.clean in
> > PerformanceAnalytics may
> > be useful.
> >
> > Regards,
> >
> >    - Brian
> >
> >
> > --
> > Brian G. Peterson
> > http://braverock.com/brian/
> > Ph: 773-459-4973
> > IM: bgpbraverock
> >
> > ______________________________________________
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> >
>
> --
> View this message in context:
> http://n4.nabble.com/Identifying-outliers-in-non-normally-distributed-data-tp987921p991062.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Kevin Wright

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Identifying outliers in non-normally distributed data

Reply via email to