Dear all, The other day I was reading this post [1] that slightly surprised me: "To reject the null of no correlation, an hypothsis test based on the normal distribution. If normality is not the base assumption your working from then p-values, significance tests and conf. intervals dont mean much (the value of the coefficient is not reliable) " (BOB SAMOHYL).
To me this implied that in practice Pearson's product-moment correlation (and associated significance) is often used incorrectly . Then I went wrestling with the literature, and with my friends on what does the Pearson correlation actually impose, and after about a week I'm still head-banging against divergent opinions. From what I understand there are two aspects to this classical parametric procedure: 1. Estimating the magnitude of the correlation: - the sample data should come from a bivariate normal distribution (?cor, ?cor.test, Dalgaard 2003, somewhat implied in many examples such as ?rrcov::maryo or Wilcox 2005) - the sample data should be (I presume univariate) normal (Crawley 2007) - the sample data can be of any distribution (if I understand correctly the `distribution-free' definition of correlation in Huber 1981, 2004) - the sample data could come from just about any bivariate distribution (Wikipedia [2][3] and associated reference) - the coefficient is (very) not robust to univariate outliers (e.g., Huber 1981), and to multivariate outliers (?rrcov::maryo with data from Marona and Yohai 1998) 2. Assessing whether the correlation is significantly different from zero (using a statistic following the t distribution): - the data should come from independent normal distributions (?cor.test) - at least one of the marginal distributions is normal (Wilcox 2005) Surprisingly (to me) many sources seem quite evasive on clearly defining the pearson correlation. Reading the literature I was pretty much convinced that the correlation coefficient is not robust to outliers. The literature is also convincing on the impact of contaminated normal, heavy-tailed distributions on parametric tests (invalidating their results). However, I'm not clear on the distributional assumptions on the data: - does the data have to be bivariate normal in order to correctly estimate the linear correlation? - does the data have to be univariate normal in order to correctly estimate the significance of the correlation? If the above is true, what are the preferable alternatives for non-gaussian data (including heavy-tailed normal)? non-parametric tests (spearman, kendall)? the robust MASS::cov.mcd, rrcov::CovOgk, robust::covRob()? hypothesis testing via Permutation Tests [4]? is there a robust cor.test? other robust tests of independence? Thank you, Liviu [1] http://www.nabble.com/Correlation-on-Tick-Data-tp18589474p18595197.html [2] http://en.wikipedia.org/wiki/Correlation#Sensitivity_to_the_data_distribution [3] http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Sensitivity_to_the_data_distribution [4] http://www.burns-stat.com/pages/Tutor/bootstrap_resampling.html#permtest -- Do you know how to read? http://www.alienetworks.com/srtest.cfm Do you know how to write? http://garbl.home.comcast.net/~garbl/stylemanual/e.htm#e-mail ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.