Here is a simple function I use. It uses Median +/- 5.2 * MAD. If I recall, this flags about 1/2000 of values from a true Normal distribution.
is.outlier = function (x) { # See: Davies, P.L. and Gather, U. (1993). # "The identification of multiple outliers" (with discussion) # J. Amer. Statist. Assoc., 88, 782-801. x <- na.omit(x) lims <- median(x) + c(-1, 1) * 5.2 * mad(x, constant = 1) x < lims[1] | x > lims[2] } Maybe the function should be called "is.patentable". I definitely agree with Bert's comments. Kevin Wright On Wed, Dec 30, 2009 at 11:47 AM, Jerry Floren <jerry.flo...@state.mn.us>wrote: > > Greetings: > > I could also use guidance on this topic. I provide manure sample > proficiency > sets to agricultural labs in the United States and Canada. There are about > 65 labs in the program. > > My data sets are much smaller and typically non-symmetrical with obvious > outliers. Usually, there are 30 to 60 sets of data, each with triple > replicates (90 to 180 observations). > > There are definitely outliers caused by the following: reporting in the > wrong units, sending in the wrong spreadsheet, entering data in the wrong > row, misplacing decimal points, calculation errors, etc. For each analysis, > it is common that two to three labs make these types of errors. > > Since there are replicates, errors like misplaced decimal points are more > obvious. However, most of the outlier errors are repeated for all three > replicates. > > I use the median and Median Absolute Deviation (MAD, constant = 1) to flag > labs for accuracy. Labs where the average of their three reps deviates more > than 2.5 MAD values from the median are flagged for accuracy. With this > method, it is not necessary to identify the outliers. > > A collegue suggested running the data twice. On the first run, outliers > more > than 4.0 MAD units from the median are removed. On the second run, values > exceeding 2.9 times the MAD are flagged for accuracy. I tried this in R > with > a normally distributed data set of 100,000, and the 4.0 MAD values were > nearly identical to the outliers identified with boxplot. > > With my data set, the flags do not change very much if the data is run one > time with the flags set at 2.5 MAD units compared to running the data twice > and removing the 4.0 MAD outliers and flagging the second set at 2.9 MAD > units. Using either one of these methods might work for you, but I am not > sure of the statistical value of these methods. > > Yours, > > Jerry Floren > > > > Brian G. Peterson wrote: > > > > John wrote: > >> Hello, > >> > >> I've been searching for a method for identify outliers for quite some > >> time now. The complication is that I cannot assume that my data is > >> normally distributed nor symmetrical (i.e. some distributions might > >> have one longer tail) so I have not been able to find any good tests. > >> The Walsh's Test (http://www.statistics4u.info/ > >> fundsta...liertest.html#), as I understand assumes that the data is > >> symmetrical for example. > >> > >> Also, while I've found some interesting articles: > >> http://tinyurl.com/yc7w4oq ("Missing Values, Outliers, Robust > >> Statistics & Non-parametric Methods") > >> I don't really know what to use. > >> > >> Any ideas? Any R packages available for this? Thanks! > >> > >> PS. My data has 1000's of observations.. > > > > Take a look at package 'robustbase', it provides most of the standard > > robust > > measures and calculations. > > > > While you didn't say what kind of data you're trying to identify outliers > > in, > > if it is time series data the function Return.clean in > > PerformanceAnalytics may > > be useful. > > > > Regards, > > > > - Brian > > > > > > -- > > Brian G. Peterson > > http://braverock.com/brian/ > > Ph: 773-459-4973 > > IM: bgpbraverock > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > > -- > View this message in context: > http://n4.nabble.com/Identifying-outliers-in-non-normally-distributed-data-tp987921p991062.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Kevin Wright [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.