Ralf: Don't bother testing. You will reject normality.
But don't bother paying attention to the results of the normality testing anyway -- normality testing is generally useless. (IMO -- others disagree). DO pay attention to the plots; I would place a modest bet that you will find that your data are not homogeneous with a strong central peak -- i.e. that they may look more uniform-ish or even exhibit 2 or more modes, indicating that you have a mixture of distributions. If true, this will have an (possibly large) effect on statistical inference... and what this would mean and what you should do depend very much on the substantive context in which you are working (about which I know zip of course). If I'm wrong in my guesses, please reply to the list so that everyone knows (including me). Hubris begs comeuppance. Finally, FWIW, 10000 is not considered "very large" these days; maybe 10,000,000,000 might be... Cheers, Bert Gunter Genentech Nonclinical Biostatistics -----Original Message----- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Peter Ehlers Sent: Wednesday, June 23, 2010 11:35 AM To: Ralf B Cc: r-help@r-project.org Subject: Re: [R] About normality tests... On 2010-06-23 12:05, Ralf B wrote: > Hi all, > > I have two very large samples of data (10000+ data points) and would > like to perform normality tests on it. I know that p< .05 means that > a data set is considered as not normal with any of the two tests. I am > also aware that large samples tend to lead more likely to normal > results (Andy Field, 2005). I that depends on what you mean by 'tend to lead ...' > > I have a few questions to ensure that I am using them right. > > 1) The Shapiro-Wilk test requires to provide mean and sd. Is is > correct to add here the mean and sd of the data itself (since I am > comparing to a normal distribution with the same parameters) ? > > mySD<- sd(mydata$myfield) > myMean<- mean(mydata$myfield) > shapiro.test(rnorm(100, mean = myMean, sd = mySD)) I don't think that your understanding of the S-W test is correct. You would just do: shapiro.test(mydata$myfield) to test for Normality. However, shapiro.test() won't accept sample sizes greater than 5000. So use ks.test. Or use a graphical method: I like qq.plot in the 'car' package. > > 2) If I just want to test each distribution individually, I assume > that I am doing a one-sample Kolmogorov-Smirnov test. Is that correct? I don't understand this. What do you mean by 'test ... individually'? > > 3) If I simply want to know if normality exists or not, what should I > put for the parameter 'alternative' ? Does it actually matter? > > alternative = c("two.sided", "less", "greater") Leave it at the default 'two.sided' unless you have good reason to suspect that the cdf lies above or below the Normal cdf. -Peter Ehlers > > Thank you, > Ralf > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.