> On 25 Mar 2016, at 10:41 am, peter dalgaard <pda...@gmail.com> wrote: > > As I see it, the display showing the first p << n PCs adding up to 100% of > the variance is plainly wrong. > > I suspect it comes about via a mental short-circuit: If we try to control p > using a tolerance, then that amounts to saying that the remaining PCs are > effectively zero-variance, but that is (usually) not the intention at all. > > The common case is that the remainder terms have a roughly _constant_, > small-ish variance and are interpreted as noise. Of course the magnitude of > the noise is important information. > But then you should use Factor Analysis which has that concept of “noise” (unlike PCA).
Cheers, Jari Oksanen >> On 25 Mar 2016, at 00:02 , Steve Bronder <sbron...@stevebronder.com> wrote: >> >> I agree with Kasper, this is a 'big' issue. Does your method of taking only >> n PCs reduce the load on memory? >> >> The new addition to the summary looks like a good idea, but Proportion of >> Variance as you describe it may be confusing to new users. Am I correct in >> saying Proportion of variance describes the amount of variance with respect >> to the number of components the user chooses to show? So if I only choose >> one I will explain 100% of the variance? I think showing 'Total Proportion >> of Variance' is important if that is the case. >> >> >> Regards, >> >> Steve Bronder >> Website: stevebronder.com >> Phone: 412-719-1282 >> Email: sbron...@stevebronder.com >> >> >> On Thu, Mar 24, 2016 at 2:58 PM, Kasper Daniel Hansen < >> kasperdanielhan...@gmail.com> wrote: >> >>> Martin, I fully agree. This becomes an issue when you have big matrices. >>> >>> (Note that there are awesome methods for actually only computing a small >>> number of PCs (unlike your code which uses svn which gets all of them); >>> these are available in various CRAN packages). >>> >>> Best, >>> Kasper >>> >>> On Thu, Mar 24, 2016 at 1:09 PM, Martin Maechler < >>> maech...@stat.math.ethz.ch >>>> wrote: >>> >>>> Following from the R-help thread of March 22 on "Memory usage in prcomp", >>>> >>>> I've started looking into adding an optional 'rank.' argument >>>> to prcomp allowing to more efficiently get only a few PCs >>>> instead of the full p PCs, say when p = 1000 and you know you >>>> only want 5 PCs. >>>> >>>> (https://stat.ethz.ch/pipermail/r-help/2016-March/437228.html >>>> >>>> As it was mentioned, we already have an optional 'tol' argument >>>> which allows *not* to choose all PCs. >>>> >>>> When I do that, >>>> say >>>> >>>> C <- chol(S <- toeplitz(.9 ^ (0:31))) # Cov.matrix and its root >>>> all.equal(S, crossprod(C)) >>>> set.seed(17) >>>> X <- matrix(rnorm(32000), 1000, 32) >>>> Z <- X %*% C ## ==> cov(Z) ~= C'C = S >>>> all.equal(cov(Z), S, tol = 0.08) >>>> pZ <- prcomp(Z, tol = 0.1) >>>> summary(pZ) # only ~14 PCs (out of 32) >>>> >>>> I get for the last line, the summary.prcomp(.) call : >>>> >>>>> summary(pZ) # only ~14 PCs (out of 32) >>>> Importance of components: >>>> PC1 PC2 PC3 PC4 PC5 PC6 >>>> PC7 PC8 >>>> Standard deviation 3.6415 2.7178 1.8447 1.3943 1.10207 0.90922 >>> 0.76951 >>>> 0.67490 >>>> Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986 0.02713 >>> 0.01943 >>>> 0.01495 >>>> Cumulative Proportion 0.4352 0.6775 0.7892 0.8530 0.89288 0.92001 >>> 0.93944 >>>> 0.95439 >>>> PC9 PC10 PC11 PC12 PC13 PC14 >>>> Standard deviation 0.60833 0.51638 0.49048 0.44452 0.40326 0.3904 >>>> Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534 0.0050 >>>> Cumulative Proportion 0.96653 0.97528 0.98318 0.98966 0.99500 1.0000 >>>>> >>>> >>>> which computes the *proportions* as if there were only 14 PCs in >>>> total (but there were 32 originally). >>>> >>>> I would think that the summary should or could in addition show >>>> the usual "proportion of variance explained" like result which >>>> does involve all 32 variances or std.dev.s ... which are >>>> returned from the svd() anyway, even in the case when I use my >>>> new 'rank.' argument which only returns a "few" PCs instead of >>>> all. >>>> >>>> Would you think the current summary() output is good enough or >>>> rather misleading? >>>> >>>> I think I would want to see (possibly in addition) proportions >>>> with respect to the full variance and not just to the variance >>>> of those few components selected. >>>> >>>> Opinions? >>>> >>>> Martin Maechler >>>> ETH Zurich >>>> >>>> ______________________________________________ >>>> R-devel@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>> >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > > -- > Peter Dalgaard, Professor, > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Office: A 4.23 > Email: pd....@cbs.dk Priv: pda...@gmail.com > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel