On 12-Aug-2014 22:22:13 Ted Harding wrote: > On 12-Aug-2014 21:41:52 Rolf Turner wrote: >> On 13/08/14 07:57, Ron Michael wrote: >>> Hi, >>> >>> I would need to get a clarification on a quite fundamental statistics >>> property, hope expeRts here would not mind if I post that here. >>> >>> I leant that variance-covariance matrix of the standardized data is >>> equal to the correlation matrix for the unstandardized data. So I >>> used following data. >> >> <SNIP> >> >>> (t(Data_Normalized) %*% Data_Normalized)/dim(Data_Normalized)[1] >>> >>> Point is that I am not getting exact CORR matrix. Can somebody point >>> me what I am missing here? >> >> You are using a denominator of "n" in calculating your "covariance" >> matrix for your normalized data. But these data were normalized using >> the sd() function which (correctly) uses a denominator of n-1 so as to >> obtain an unbiased estimator of the population standard deviation. >> >> If you calculated >> >> (t(Data_Normalized) %*% Data_Normalized)/(dim(Data_Normalized)[1]-1) >> >> then you would get the same result as you get from cor(Data) (to within >> about 1e-15). >> >> cheers, >> Rolf Turner > > One could argue about "(correctly)"! > >>From the "descriptive statistics" point of view, if one is given a single > number x, then this dataset has no variation, so one could say that > sd(x) = 0. And this is what one would get with a denominator of "n". > > But if the single value x is viewed as sampled from a distribution > (with positive dispersion), then the value of x gives no information > about the SD of the distribution. If you use denominator (n-1) then > sd(x) = NA, i.e. is indeterminate (as it should be in this application). > > The important thing when using pre-programmed functions is to know > which is being used. R uses (n-1), and this can be found from > looking at > > ?sd > > or (with more detail) at > > ?cor > > Ron had assumed that the denominator was n, apparently not being aware > that R uses (n-1). > > Just a few thoughts ... > Ted. > ------------------------------------------------- > E-Mail: (Ted Harding) <ted.hard...@wlandres.net> > Date: 12-Aug-2014 Time: 23:22:09
After some hesitation (not wanting to prolong a thread whose original query has been effectively settled), nevertheless I think it may be worth stating the general picture, for the sake of users who might be confused or misled regarding the use of the functions var(), cov(), cor(), sd() etc. The distinction to become aware of is between "summary" statistics and statistics which will be used for estimation/inference. Given a set of numbers, x[1], x[2], ... , x[N], they have a mean MEAN = sum(x)/N and their variance VAR is the mean of the deviations between the x[i] and MEAN: VAR = (sum(x - MEAN)^2)/N If a random value X is drawn from {x[1], x[2], ... , x[N]}, with uniform probability, then the expectation of X is again E(X) = MEAN and the variance of X is again Var(X) = E((X - MEAN)^2) = VAR with MEAN and VAR as given above. And the R function mean(x) will return MEAN is its value. However, the R function var(x) will not return VAR -- it will return (N/(N-1))*VAR The above definitions of MEAN and VAR use division by N, i.e. "denominator = N". But the R functions var(x), sd(x) etc. divide by (N-1), i.e. use "denominator = N-1", as explained in ?var ?sd etc. The basic reason for this is that, given a random sample X[1], ... ,X[n] of size n from {x[1], x[2], ... , x[N]}, the expectation of sum((X - mean(X))^2) is (n-1)*VAR, so to obtain an unbiased estimate of VAR from the X sample one must use the "bias-corrected" sample variance var(X) = sum((X - mean(X))^2)/(n-1) i.e. "denominator = (n-1)" as described in the help pages. So the function var(), with denominator = (n-1), is "correct" for obtaining an unbiased estimator. But it will not be correct for the variance sum((X - mean(X))^2)/n of the numbers X[1], ... ,X[n]. Since sd() also uses denominator = (n-1), sd(X) is the square root of var(X). But, while var(X) is unbiased for VAR, sd(X) is not unbiased for SD = sqrt(VAR), since, for a positive ransom variable Y, in general E(sqrt(Y)) < sqrt(E(Y)) i.e. E(sd(X)) = E(sqrt(var(X))) < sqrt(E(var(X))) = sqrt(VAR) = SD. The R functions var() etc., which use denominator = (n-1), do not have an option which allows the user to choose denominator = n. Therefore, in particular, a user who has a set of numbers {x[1], x[2], ... , x[N]} (e.g. a record of a popuation) and wants the SD of these (e.g. for use in summary statistics Mean and SD), could inadvertently use R's sd(x), expecting SD, without being alerted to the fact that it will give the wrong answer. And the only way round it is to explicitly write one's own correction, e.g. SD <- function(x){n<-length(x); sd(x)*sqrt((n-1)/n)} Indeed, this topic has got me wondering how many times I may have blindly used sd(x) in the past, as if it were going to give me the standard (sum(x - mean(x))^2)/length(x) result! As I wrote earlier, when there is more than one definition which might be used, the important thing is to know which one is being used, and to correct accordingly if it is not the one you want. Best wishes to all, Ted. ------------------------------------------------- E-Mail: (Ted Harding) <ted.hard...@wlandres.net> Date: 13-Aug-2014 Time: 19:49:02 This message was sent by XFMail ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.