On Thu, Dec 15, 2011 at 8:35 AM, PtitBleu <ptit_b...@yahoo.fr> wrote: > Hello, > > I've two data.frames (data1 and data4), dec="." and sep=";". > http://r.789695.n4.nabble.com/file/n4199964/data1.txt data1.txt > http://r.789695.n4.nabble.com/file/n4199964/data4.txt data4.txt > > When I do > plot(data1$nx,data1$ny, col="red") > points(data4$nx,data4$ny, col="blue") > , results seem very similar (at least to me) but the R-squared of > summary(lm(data1$ny ~ data1$nx)) > and > summary(lm(data4$ny ~ data4$nx)) > are very different (0.48 against 0.89). > > Could someone explain me the reason? > > To be complete, I am looking for an simple indicator telling me if it is > worthwhile to keep the values provided by lm. I thought that R-squared could > do the job. For me, if R-squared is far from 1, the data are not good enough > to perform a linear fit. > It seems that I'm wrong.
The problem is the outliers. Try using a robust measure instead. If we replace Pearson correlations with Spearman (rank) correlations they are much closer: > # R^2 based on Pearson correlations > cor(fitted(lm(ny ~ nx, data4)), data4$ny)^2 [1] 0.8916924 > cor(fitted(lm(ny ~ nx, data1)), data1$ny)^2 [1] 0.4868575 > > # R^2 based on Spearman (rank) correlations > cor(fitted(lm(ny ~ nx, data4)), data4$ny, method = "spearman")^2 [1] 0.8104026 > cor(fitted(lm(ny ~ nx, data1)), data1$ny, method = "spearman")^2 [1] 0.7266705 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.